Discussion:
[Numpy-discussion] understanding buffering done when broadcasting
Eli Bendersky
2015-11-23 21:31:36 UTC
Permalink
Hello,

I'm trying to understand the buffering done by the Numpy iterator interface
(the new post 1.6-one) when running ufuncs on arrays that require
broadcasting. Consider this simple case:

In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
Out[39]:
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])

If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see that
when the add() ufunc is called, 'n' is copied into a temporary buffer by
the iterator. The ufunc then gets the buffer as its data.

My question is: why is this buffering needed? It seems wasteful, since no
casting is required here, no special alignment problems and also 'n' is
contiguously laid out in memory. It seems that it would be more efficient
to just use 'n' in the ufunc instead of passing in the buffer. What am I
missing?

Thanks in advance,
Eli
Sebastian Berg
2015-11-23 22:09:53 UTC
Permalink
Post by Eli Bendersky
Hello,
I'm trying to understand the buffering done by the Numpy iterator
interface (the new post 1.6-one) when running ufuncs on arrays that
In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])
On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m` which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!

The optimization (which might kick in a bit more broadly maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy optimization
possible here.
Note however, that this does not occur for large inner loop sizes
(though I think you can find some "bad" sizes):

```
In [18]: n = np.arange(40000)

In [19]: m = np.arange(160000).reshape(4,40000)

In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```

Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.

- Sebastian
Post by Eli Bendersky
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see
that when the add() ufunc is called, 'n' is copied into a temporary
buffer by the iterator. The ufunc then gets the buffer as its data.
My question is: why is this buffering needed? It seems wasteful, since
no casting is required here, no special alignment problems and also
'n' is contiguously laid out in memory. It seems that it would be more
efficient to just use 'n' in the ufunc instead of passing in the
buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Eli Bendersky
2015-11-25 00:49:44 UTC
Permalink
Post by Sebastian Berg
Post by Eli Bendersky
Hello,
I'm trying to understand the buffering done by the Numpy iterator
interface (the new post 1.6-one) when running ufuncs on arrays that
In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])
On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m` which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!
The optimization (which might kick in a bit more broadly maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy optimization
possible here.
Note however, that this does not occur for large inner loop sizes
```
In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```
The heuristic in the code says that if we can use a single stride and
that's larger than the buffer size (which I assume is the default buffer
size, and can change) then it's "is_onestride" and no buffering is done.

So this led me to explore around this threshold (8192 items by default on
my machine), and indeed we can notice funny behavior:

In [51]: %%timeit n = arange(8192); m = np.arange(8192*40).reshape(40,8192)
o = m + n
....:
1000 loops, best of 3: 274 µs per loop

In [52]: %%timeit n = arange(8292); m = np.arange(8292*40).reshape(40,8292)
o = m + n
....:
1000 loops, best of 3: 229 µs per loop

So, given this, it's not very clear why the "optimization" kicks in.
Buffering for small sizes seems like a mistake.

Eli
Post by Sebastian Berg
Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.
- Sebastian
Post by Eli Bendersky
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see
that when the add() ufunc is called, 'n' is copied into a temporary
buffer by the iterator. The ufunc then gets the buffer as its data.
My question is: why is this buffering needed? It seems wasteful, since
no casting is required here, no special alignment problems and also
'n' is contiguously laid out in memory. It seems that it would be more
efficient to just use 'n' in the ufunc instead of passing in the
buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Sebastian Berg
2015-11-25 08:21:25 UTC
Permalink
On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg
Post by Eli Bendersky
Hello,
I'm trying to understand the buffering done by the Numpy
iterator
Post by Eli Bendersky
interface (the new post 1.6-one) when running ufuncs on
arrays that
Post by Eli Bendersky
In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])
On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m` which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!
The optimization (which might kick in a bit more broadly maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy optimization
possible here.
Note however, that this does not occur for large inner loop sizes
```
In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```
The heuristic in the code says that if we can use a single stride and
that's larger than the buffer size (which I assume is the default
buffer size, and can change) then it's "is_onestride" and no buffering
is done.
So this led me to explore around this threshold (8192 items by default
In [51]: %%timeit n = arange(8192); m =
np.arange(8192*40).reshape(40,8192)
o = m + n
1000 loops, best of 3: 274 µs per loop
In [52]: %%timeit n = arange(8292); m =
np.arange(8292*40).reshape(40,8292)
o = m + n
1000 loops, best of 3: 229 µs per loop
So, given this, it's not very clear why the "optimization" kicks in.
Buffering for small sizes seems like a mistake.
I am pretty sure it is not generally a mistake. Consider the case of an
10000x3 array (note that shrinking the buffer can have great advantage
though, I am not sure if this is done usually).
If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would
have 10000x overhead. Doing the buffering (which I believe has some
extra code to be faster), will lower this to a few ufunc inner loop
calls.
I have not timed it, but I would be a bit surprised if it was not faster
in this case at least. Even calling a C function (and looping) for an
inner loop of 3 elements, should be quite a bit of overhead, and my
guess is more overhead than the buffering.

- Sebastian
Eli
Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.
- Sebastian
Post by Eli Bendersky
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such),
I see
Post by Eli Bendersky
that when the add() ufunc is called, 'n' is copied into a
temporary
Post by Eli Bendersky
buffer by the iterator. The ufunc then gets the buffer as
its data.
Post by Eli Bendersky
My question is: why is this buffering needed? It seems
wasteful, since
Post by Eli Bendersky
no casting is required here, no special alignment problems
and also
Post by Eli Bendersky
'n' is contiguously laid out in memory. It seems that it
would be more
Post by Eli Bendersky
efficient to just use 'n' in the ufunc instead of passing in
the
Post by Eli Bendersky
buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Eli Bendersky
2015-11-27 16:13:52 UTC
Permalink
Post by Sebastian Berg
On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg
Post by Eli Bendersky
Hello,
I'm trying to understand the buffering done by the Numpy
iterator
Post by Eli Bendersky
interface (the new post 1.6-one) when running ufuncs on
arrays that
Post by Eli Bendersky
In [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])
On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single
stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m`
which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!
The optimization (which might kick in a bit more broadly
maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy
optimization
possible here.
Note however, that this does not occur for large inner loop sizes
```
In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```
The heuristic in the code says that if we can use a single stride and
that's larger than the buffer size (which I assume is the default
buffer size, and can change) then it's "is_onestride" and no buffering
is done.
So this led me to explore around this threshold (8192 items by default
In [51]: %%timeit n = arange(8192); m =
np.arange(8192*40).reshape(40,8192)
o = m + n
1000 loops, best of 3: 274 µs per loop
In [52]: %%timeit n = arange(8292); m =
np.arange(8292*40).reshape(40,8292)
o = m + n
1000 loops, best of 3: 229 µs per loop
So, given this, it's not very clear why the "optimization" kicks in.
Buffering for small sizes seems like a mistake.
I am pretty sure it is not generally a mistake. Consider the case of an
10000x3 array (note that shrinking the buffer can have great advantage
though, I am not sure if this is done usually).
If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would
have 10000x overhead. Doing the buffering (which I believe has some
extra code to be faster), will lower this to a few ufunc inner loop
calls.
I have not timed it, but I would be a bit surprised if it was not faster
in this case at least. Even calling a C function (and looping) for an
inner loop of 3 elements, should be quite a bit of overhead, and my
guess is more overhead than the buffering.
Yes, that's a good point for arrays shaped like this. I guess all this
leaves us with is a realization that the heuristic *could* be tuned
somewhat for arrays where the inner dimension is large - as in the case I
demonstrated above it's nonsensical to have a computation be 20% faster
when the array size increases over an arbitrary threshold.

I'll see if I can find some time to dig more into this and figure out where
the knobs to tweak the heuristic are.

Thanks for the enlightening discussion, Sebastian

Eli
Post by Sebastian Berg
- Sebastian
Eli
Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.
- Sebastian
Post by Eli Bendersky
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such),
I see
Post by Eli Bendersky
that when the add() ufunc is called, 'n' is copied into a
temporary
Post by Eli Bendersky
buffer by the iterator. The ufunc then gets the buffer as
its data.
Post by Eli Bendersky
My question is: why is this buffering needed? It seems
wasteful, since
Post by Eli Bendersky
no casting is required here, no special alignment problems
and also
Post by Eli Bendersky
'n' is contiguously laid out in memory. It seems that it
would be more
Post by Eli Bendersky
efficient to just use 'n' in the ufunc instead of passing in
the
Post by Eli Bendersky
buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...