Discussion:
[Numpy-discussion] ufunc reduceat behavior on empty slices
Erik Brinkman
2016-07-29 18:42:23 UTC
Permalink
Hi,

The behavior of a ufuncs reduceat on empty slices seems a little strange,
and I wonder if there's a reason behind it / if there's a route to
potentially changing it. First, I'll go into a little background.

I've been making a lot of use of ufuncs reduceat functionality on staggered
arrays. In general, I'll have "n" arrays, each with size "s[n]" and I'll
store them in one array "x", such that "s.sum() == x.size". reduceat is
great because I use

ufunc.reduceat(x, np.insert(s[:-1].cumsum(), 0, 0))

to get some summary information about each array. However, reduceat seems
to behave strangely for empty slices. To make things concrete, let's assume:

import numpy as np
s = np.array([3, 0, 2])
x = np.arange(s.sum())
inds = np.insert(s[:-1].cumsum(), 0, 0)
# [0, 3, 3]
np.add.reduceat(x, inds)
# [3, 3, 7] not [3, 0, 7]
# This is distinct from
np.fromiter(map(np.add.reduce, np.array_split(x, inds[1:])), x.dtype,
s.size - 1)
# [3, 0, 7] what I wanted

The current documentation
<http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduceat.html>
on reduceat first states:

For i in range(len(indices)), reduceat computes
ufunc.reduce(a[indices[i]:indices[i+1]])

That would suggest the outcome, that I expected. However, in the examples
section it goes into a bunch of examples which contradict that statement
and instead suggest that the actual algorithm is more akin to:

ufunc.reduce(a[indices[i]:indices[i+1]]) if indices[i+1] > indices[i] else
indices[i]

Looking at the source
<https://github.com/numpy/numpy/blob/2af06c804931aae4b30bb3349bc60271b0b65381/numpy/core/src/umath/ufunc_object.c#L3697>,
it seems like it's copying indices[i], and then while there are more
elements to process it keeps reducing, resulting in this unexpected
behavior. It seems like the proper thing to do would be start with
ufunc.identity, and then reduce. This is slightly less performant than
what's implemented, but more "correct." There could, of course, just be a
switch to copy the identity only when the slice is empty.

Is there a reason it's implemented like this? Is it just for performance,
or is this strange behavior *useful* somewhere? It seems like "fixing" this
would be bad because you'll be changing somewhat documented functionality
in a backwards incompatible way. What would the best approach to "fixing"
this be? add another function "reduceat_"? add a flag to reduceat to do the
proper thing for empty slices?

Finally, is there a good way to work around this? I think for now I'm just
going to mask out the empty slices and use insert to add them back in, but
if I'm missing an obvious solution, I'll look at that too. I need to mask
them out because, np.add.reduceat(x, 5) would ideally return 0, but instead
it throws an error since 5 is out of range...

Thanks for indulging my curiosity,
Erik
Stephan Hoyer
2016-07-29 20:14:03 UTC
Permalink
Jaime brought up the same issue recently, along with some other issues for
ufunc.reduceat:
https://mail.scipy.org/pipermail/numpy-discussion/2016-March/075199.html

I completely agree with both of you that the current behavior for empty
slices is very strange and should be changed to remove the special case.
Nathaniel Smith voiced the same opinion on the GitHub issue [1].

I think the path forward here (as Nathaniel writes) is pretty clear:
1. Start issuing a FutureWarning about a future change.
2. Fix this in a release or two.

[1] https://github.com/numpy/numpy/issues/834
Post by Erik Brinkman
Hi,
The behavior of a ufuncs reduceat on empty slices seems a little strange,
and I wonder if there's a reason behind it / if there's a route to
potentially changing it. First, I'll go into a little background.
I've been making a lot of use of ufuncs reduceat functionality on
staggered arrays. In general, I'll have "n" arrays, each with size "s[n]"
and I'll store them in one array "x", such that "s.sum() == x.size".
reduceat is great because I use
ufunc.reduceat(x, np.insert(s[:-1].cumsum(), 0, 0))
to get some summary information about each array. However, reduceat seems
import numpy as np
s = np.array([3, 0, 2])
x = np.arange(s.sum())
inds = np.insert(s[:-1].cumsum(), 0, 0)
# [0, 3, 3]
np.add.reduceat(x, inds)
# [3, 3, 7] not [3, 0, 7]
# This is distinct from
np.fromiter(map(np.add.reduce, np.array_split(x, inds[1:])), x.dtype,
s.size - 1)
# [3, 0, 7] what I wanted
The current documentation
<http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduceat.html>
For i in range(len(indices)), reduceat computes
ufunc.reduce(a[indices[i]:indices[i+1]])
That would suggest the outcome, that I expected. However, in the examples
section it goes into a bunch of examples which contradict that statement
ufunc.reduce(a[indices[i]:indices[i+1]]) if indices[i+1] > indices[i] else
indices[i]
Looking at the source
<https://github.com/numpy/numpy/blob/2af06c804931aae4b30bb3349bc60271b0b65381/numpy/core/src/umath/ufunc_object.c#L3697>,
it seems like it's copying indices[i], and then while there are more
elements to process it keeps reducing, resulting in this unexpected
behavior. It seems like the proper thing to do would be start with
ufunc.identity, and then reduce. This is slightly less performant than
what's implemented, but more "correct." There could, of course, just be a
switch to copy the identity only when the slice is empty.
Is there a reason it's implemented like this? Is it just for performance,
or is this strange behavior *useful* somewhere? It seems like "fixing"
this would be bad because you'll be changing somewhat documented
functionality in a backwards incompatible way. What would the best approach
to "fixing" this be? add another function "reduceat_"? add a flag to
reduceat to do the proper thing for empty slices?
Finally, is there a good way to work around this? I think for now I'm just
going to mask out the empty slices and use insert to add them back in, but
if I'm missing an obvious solution, I'll look at that too. I need to mask
them out because, np.add.reduceat(x, 5) would ideally return 0, but
instead it throws an error since 5 is out of range...
Thanks for indulging my curiosity,
Erik
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...