On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg
Post by Eli BenderskyHello,
I'm trying to understand the buffering done by the Numpy
iterator
Post by Eli Benderskyinterface (the new post 1.6-one) when running ufuncs on
arrays that
Post by Eli BenderskyIn [35]: m = np.arange(16).reshape(4,4)
In [37]: n = np.arange(4)
In [39]: m + n
array([[ 0, 2, 4, 6],
[ 4, 6, 8, 10],
[ 8, 10, 12, 14],
[12, 14, 16, 18]])
On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m` which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!
The optimization (which might kick in a bit more broadly maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy optimization
possible here.
Note however, that this does not occur for large inner loop sizes
```
In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```
The heuristic in the code says that if we can use a single stride and
that's larger than the buffer size (which I assume is the default
buffer size, and can change) then it's "is_onestride" and no buffering
is done.
So this led me to explore around this threshold (8192 items by default
In [51]: %%timeit n = arange(8192); m =
np.arange(8192*40).reshape(40,8192)
o = m + n
1000 loops, best of 3: 274 µs per loop
In [52]: %%timeit n = arange(8292); m =
np.arange(8292*40).reshape(40,8292)
o = m + n
1000 loops, best of 3: 229 µs per loop
So, given this, it's not very clear why the "optimization" kicks in.
Buffering for small sizes seems like a mistake.
I am pretty sure it is not generally a mistake. Consider the case of an
though, I am not sure if this is done usually).
have 10000x overhead. Doing the buffering (which I believe has some
calls.
in this case at least. Even calling a C function (and looping) for an
guess is more overhead than the buffering.
Eli
Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.
- Sebastian
Post by Eli BenderskyIf I instrument Numpy (setting NPY_IT_DBG_TRACING and such),
I see
Post by Eli Benderskythat when the add() ufunc is called, 'n' is copied into a
temporary
Post by Eli Benderskybuffer by the iterator. The ufunc then gets the buffer as
its data.
Post by Eli BenderskyMy question is: why is this buffering needed? It seems
wasteful, since
Post by Eli Benderskyno casting is required here, no special alignment problems
and also
Post by Eli Bendersky'n' is contiguously laid out in memory. It seems that it
would be more
Post by Eli Benderskyefficient to just use 'n' in the ufunc instead of passing in
the
Post by Eli Benderskybuffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion