Nico Schlömer

2017-03-02 10:27:45 UTC

Hi everyone,

When trying to speed up my code, I noticed that simply by reordering my

data I could get more than twice as fast for the simplest operations:

```

import numpy

a = numpy.random.rand(50, 50, 50)

%timeit a[0] + a[1]

1000000 loops, best of 3: 1.7 Âµs per loop

%timeit a[:, 0] + a[:, 1]

100000 loops, best of 3: 4.42 Âµs per loop

%timeit a[..., 0] + a[..., 1]

100000 loops, best of 3: 5.99 Âµs per loop

```

This makes sense considering the fact that, by default, NumPy features

C-style memory allocation: the last index runs fastest. The blocks that are

added with `a[0] + a[1]` are contiguous in memory, so cache is nicely made

use of. As opposed to that, the blocks that are added with `a[:, 0] + a[:,

1]` are not contiguous, even more so with `a[..., 0] + a[..., 1]`; hence

the slowdown. Would that be the correct explanation?

If yes, I'm wondering why most numpy.linalg methods, when vectorized, put

the vector index up front. E.g., to mass-compute determinants, one has to do

```

a = numpy.random.rand(777, 3, 3)

numpy.linalg.det(a)

```

This way, all 3x3 matrices form a memory block, so if you do `det` block by

block, that'll be fine. However, vectorized operations (like `+` above)

will be slower than necessary.

Any background on this?

(I came across this when having to rearrange my data (swapaxes, rollaxis)

from shape (3, 3, 777) (which allows for fast vectorized operations in the

rest of the code) to (777, 3, 3) for using numpy's svd.)

Cheers,

Nico

When trying to speed up my code, I noticed that simply by reordering my

data I could get more than twice as fast for the simplest operations:

```

import numpy

a = numpy.random.rand(50, 50, 50)

%timeit a[0] + a[1]

1000000 loops, best of 3: 1.7 Âµs per loop

%timeit a[:, 0] + a[:, 1]

100000 loops, best of 3: 4.42 Âµs per loop

%timeit a[..., 0] + a[..., 1]

100000 loops, best of 3: 5.99 Âµs per loop

```

This makes sense considering the fact that, by default, NumPy features

C-style memory allocation: the last index runs fastest. The blocks that are

added with `a[0] + a[1]` are contiguous in memory, so cache is nicely made

use of. As opposed to that, the blocks that are added with `a[:, 0] + a[:,

1]` are not contiguous, even more so with `a[..., 0] + a[..., 1]`; hence

the slowdown. Would that be the correct explanation?

If yes, I'm wondering why most numpy.linalg methods, when vectorized, put

the vector index up front. E.g., to mass-compute determinants, one has to do

```

a = numpy.random.rand(777, 3, 3)

numpy.linalg.det(a)

```

This way, all 3x3 matrices form a memory block, so if you do `det` block by

block, that'll be fine. However, vectorized operations (like `+` above)

will be slower than necessary.

Any background on this?

(I came across this when having to rearrange my data (swapaxes, rollaxis)

from shape (3, 3, 777) (which allows for fast vectorized operations in the

rest of the code) to (777, 3, 3) for using numpy's svd.)

Cheers,

Nico