[Numpy-discussion] understanding buffering done when broadcasting

Discussion:

Eli Bendersky

2015-11-23 21:31:36 UTC

Hello,

I'm trying to understand the buffering done by the Numpy iterator interface
(the new post 1.6-one) when running ufuncs on arrays that require
broadcasting. Consider this simple case:

In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
Out[39]:
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])

If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see that
when the add() ufunc is called, 'n' is copied into a temporary buffer by
the iterator. The ufunc then gets the buffer as its data.

My question is: why is this buffering needed? It seems wasteful, since no
casting is required here, no special alignment problems and also 'n' is
contiguously laid out in memory. It seems that it would be more efficient
to just use 'n' in the ufunc instead of passing in the buffer. What am I
missing?

Thanks in advance,
Eli

Sebastian Berg

2015-11-23 22:09:53 UTC

Permalink

Post by Eli Bendersky
Hello,
I'm trying to understand the buffering done by the Numpy iterator
interface (the new post 1.6-one) when running ufuncs on arrays that
In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])

On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m` which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!

The optimization (which might kick in a bit more broadly maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy optimization
possible here.
Note however, that this does not occur for large inner loop sizes
(though I think you can find some "bad" sizes):

```
In [18]: n = np.arange(40000)

In [19]: m = np.arange(160000).reshape(4,40000)

In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```

Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.

- Sebastian

Post by Eli Bendersky
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see
that when the add() ufunc is called, 'n' is copied into a temporary
buffer by the iterator. The ufunc then gets the buffer as its data.
My question is: why is this buffering needed? It seems wasteful, since
no casting is required here, no special alignment problems and also
'n' is contiguously laid out in memory. It seems that it would be more
efficient to just use 'n' in the ufunc instead of passing in the
buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Eli Bendersky

2015-11-25 00:49:44 UTC

Permalink

Post by Sebastian Berg

The heuristic in the code says that if we can use a single stride and
that's larger than the buffer size (which I assume is the default buffer
size, and can change) then it's "is_onestride" and no buffering is done.

So this led me to explore around this threshold (8192 items by default on
my machine), and indeed we can notice funny behavior:

In [51]: %%timeit n = arange(8192); m = np.arange(8192*40).reshape(40,8192)
o = m + n
....:
1000 loops, best of 3: 274 Âµs per loop

In [52]: %%timeit n = arange(8292); m = np.arange(8292*40).reshape(40,8292)
o = m + n
....:
1000 loops, best of 3: 229 Âµs per loop

So, given this, it's not very clear why the "optimization" kicks in.
Buffering for small sizes seems like a mistake.

Eli

Post by Sebastian Berg
Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.
- Sebastian

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Sebastian Berg

2015-11-25 08:21:25 UTC

Permalink

On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg

Post by Eli Bendersky
Hello,
I'm trying to understand the buffering done by the Numpy

iterator

Post by Eli Bendersky
interface (the new post 1.6-one) when running ufuncs on

arrays that

Post by Eli Bendersky
In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])

On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m` which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!
The optimization (which might kick in a bit more broadly maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy optimization
possible here.
Note however, that this does not occur for large inner loop sizes
```
In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```
The heuristic in the code says that if we can use a single stride and
that's larger than the buffer size (which I assume is the default
buffer size, and can change) then it's "is_onestride" and no buffering
is done.
So this led me to explore around this threshold (8192 items by default
In [51]: %%timeit n = arange(8192); m =
np.arange(8192*40).reshape(40,8192)
o = m + n
1000 loops, best of 3: 274 Âµs per loop
In [52]: %%timeit n = arange(8292); m =
np.arange(8292*40).reshape(40,8292)
o = m + n
1000 loops, best of 3: 229 Âµs per loop
So, given this, it's not very clear why the "optimization" kicks in.
Buffering for small sizes seems like a mistake.

I am pretty sure it is not generally a mistake. Consider the case of an
10000x3 array (note that shrinking the buffer can have great advantage
though, I am not sure if this is done usually).
If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would
have 10000x overhead. Doing the buffering (which I believe has some
extra code to be faster), will lower this to a few ufunc inner loop
calls.
I have not timed it, but I would be a bit surprised if it was not faster
in this case at least. Even calling a C function (and looping) for an
inner loop of 3 elements, should be quite a bit of overhead, and my
guess is more overhead than the buffering.

- Sebastian

Eli
Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.
- Sebastian

Post by Eli Bendersky
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such),

I see

Post by Eli Bendersky
that when the add() ufunc is called, 'n' is copied into a

temporary

Post by Eli Bendersky
buffer by the iterator. The ufunc then gets the buffer as

its data.

Post by Eli Bendersky
My question is: why is this buffering needed? It seems

wasteful, since

Post by Eli Bendersky
no casting is required here, no special alignment problems

and also

Post by Eli Bendersky
'n' is contiguously laid out in memory. It seems that it

would be more

Post by Eli Bendersky
efficient to just use 'n' in the ufunc instead of passing in

the

Post by Eli Bendersky
buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Eli Bendersky

2015-11-27 16:13:52 UTC

Permalink

Post by Sebastian Berg

On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg

Post by Eli Bendersky
Hello,
I'm trying to understand the buffering done by the Numpy

iterator

Post by Eli Bendersky
interface (the new post 1.6-one) when running ufuncs on

arrays that

Post by Eli Bendersky
In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])

On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single
stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m`
which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!
The optimization (which might kick in a bit more broadly
maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy
optimization
possible here.
Note however, that this does not occur for large inner loop sizes
```
In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```
The heuristic in the code says that if we can use a single stride and
that's larger than the buffer size (which I assume is the default
buffer size, and can change) then it's "is_onestride" and no buffering
is done.
So this led me to explore around this threshold (8192 items by default
In [51]: %%timeit n = arange(8192); m =
np.arange(8192*40).reshape(40,8192)
o = m + n
1000 loops, best of 3: 274 Âµs per loop
In [52]: %%timeit n = arange(8292); m =
np.arange(8292*40).reshape(40,8292)
o = m + n
1000 loops, best of 3: 229 Âµs per loop
So, given this, it's not very clear why the "optimization" kicks in.
Buffering for small sizes seems like a mistake.

Yes, that's a good point for arrays shaped like this. I guess all this
leaves us with is a realization that the heuristic *could* be tuned
somewhat for arrays where the inner dimension is large - as in the case I
demonstrated above it's nonsensical to have a computation be 20% faster
when the array size increases over an arbitrary threshold.

I'll see if I can find some time to dig more into this and figure out where
the knobs to tweak the heuristic are.

Thanks for the enlightening discussion, Sebastian

Eli

Post by Sebastian Berg
- Sebastian

Eli
Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.
- Sebastian

Post by Eli Bendersky
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such),

I see

Post by Eli Bendersky
that when the add() ufunc is called, 'n' is copied into a

temporary

Post by Eli Bendersky
buffer by the iterator. The ufunc then gets the buffer as

its data.

Post by Eli Bendersky
My question is: why is this buffering needed? It seems

wasteful, since

Post by Eli Bendersky
no casting is required here, no special alignment problems

and also

Post by Eli Bendersky
'n' is contiguously laid out in memory. It seems that it

would be more

Post by Eli Bendersky
efficient to just use 'n' in the ufunc instead of passing in

the

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion