[Numpy-discussion] Bug in np.nonzero / Should index returning functions return ndarray subclasses?

Discussion:

Jaime Fernández del Río

2015-05-09 17:48:46 UTC

There is a reported bug (issue #5837
<https://github.com/numpy/numpy/issues/5837>) regarding different returns
from np.nonzero with 1-D vs higher dimensional arrays. A full summary of

class C(np.ndarray): pass

...

a = np.arange(6).view(C)
b = np.arange(6).reshape(2, 3).view(C)
anz = a.nonzero()
bnz = b.nonzero()
type(anz[0])

anz[0].flags

C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

anz[0].base
type(bnz[0])

bnz[0].flags

C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
UPDATEIFCOPY : False

bnz[0].base

array([[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2]])

The original bug report was only concerned with the non-writeability of
higher dimensional array returns, but there are more differences: 1-D
always returns an ndarray that owns its memory and is writeable, but higher
dimensional arrays return views, of the type of the original array, that
are non-writeable.

I have a branch that attempts to fix this by making both 1-D and n-D arrays:

1. return a view, never the base array,
2. return an ndarray, never a subclass, and
3. return a writeable view.

I guess the most controversial choice is #2, and in fact making that change
breaks a few tests. I nevertheless think that all of the index returning
functions (nonzero, argsort, argmin, argmax, argpartition) should always
return a bare ndarray, not a subclass. I'd be happy to be corrected, but I
can't think of any situation in which preserving the subclass would be
needed for these functions.

Since we are changing the returns of a few other functions in 1.10
(diagonal, diag, ravel), it may be a good moment to revisit the behavior
for these other functions. Any thoughts?

Jaime

--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayÃºdale en sus planes
de dominaciÃ³n mundial.

Nathaniel Smith

2015-05-09 18:42:50 UTC

Permalink

There is a reported bug (issue #5837) regarding different returns from

np.nonzero with 1-D vs higher dimensional arrays. A full summary of the

class C(np.ndarray): pass

...

a = np.arange(6).view(C)
b = np.arange(6).reshape(2, 3).view(C)
anz = a.nonzero()
bnz = b.nonzero()
type(anz[0])

anz[0].flags

C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

anz[0].base
type(bnz[0])

bnz[0].flags

C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
UPDATEIFCOPY : False

bnz[0].base

array([[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2]])
The original bug report was only concerned with the non-writeability of

higher dimensional array returns, but there are more differences: 1-D
always returns an ndarray that owns its memory and is writeable, but higher
dimensional arrays return views, of the type of the original array, that
are non-writeable.

return a view, never the base array,

This doesn't matter, does it? "View" isn't a thing, only "view of" is
meaningful. And in this case, none of the returned arrays share any memory
with any other arrays that the user has access to... so whether they were
created as a view or not should be an implementation detail that's
transparent to the user?

return an ndarray, never a subclass, and
return a writeable view.
I guess the most controversial choice is #2, and in fact making that

change breaks a few tests. I nevertheless think that all of the index
returning functions (nonzero, argsort, argmin, argmax, argpartition) should
always return a bare ndarray, not a subclass. I'd be happy to be corrected,
but I can't think of any situation in which preserving the subclass would
be needed for these functions.

I also can't see any logical reason why the return type of these functions
has anything to do with the type of the inputs. You can index me with my
phone number but my phone number is not a person. OTOH logic and ndarray
subclassing don't have much to do with each other; the practical effect is
probably more important. Looking at the subclasses I know about (masked
arrays, np.matrix, and astropy quantities), though, I also can't see much
benefit in copying the subclass of the input, and the fact that we were
never consistent about this suggests that people probably aren't depending
on it too much.

So in summary my feeling is: +1 to making then writable, no objection to
the view thing (though I don't see how it matters), and provisional +1 to
consistently returning ndarray (to be revised if the people who use the
subclassing functionality disagree).

-n

Benjamin Root

2015-05-09 19:53:31 UTC

Permalink

Absolutely, it should be writable. As for subclassing, that might be messy.
Consider the following:

inds = np.where(data > 5)

In that case, I'd expect a normal, bog-standard ndarray because that is
what you use for indexing (although pandas might have a good argument for
having it return one of their special indexing types if "data" was a pandas
array...). Next:

foobar = np.where(data > 5, 1, 2)

Again, I'd expect a normal, bog-standard ndarray because the scalar
elements are very simple. This question gets very complicated when
considering array arguments. Consider:

merged_data = np.where(data > 5, data, data2)

So, what should "merged_data" be? If both "data" and "data2" are the same
types, then it would be reasonable to return the same type, if possible.
But what if they aren't the same? Maybe use array_priority to determine the
return type? Or, perhaps it does make sense to say "sod it all" and always
return an ndarray?

I don't know the answer. I do find it interesting that the result from a
multi-dimensional array is not writable. I don't know why I have never
encountered that.

Ben Root

Post by Nathaniel Smith

There is a reported bug (issue #5837) regarding different returns from

np.nonzero with 1-D vs higher dimensional arrays. A full summary of the

class C(np.ndarray): pass

...

a = np.arange(6).view(C)
b = np.arange(6).reshape(2, 3).view(C)
anz = a.nonzero()
bnz = b.nonzero()
type(anz[0])

anz[0].flags

C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

anz[0].base
type(bnz[0])

bnz[0].flags

C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
UPDATEIFCOPY : False

bnz[0].base

array([[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2]])
The original bug report was only concerned with the non-writeability of

I have a branch that attempts to fix this by making both 1-D and n-D
return a view, never the base array,

return an ndarray, never a subclass, and
return a writeable view.
I guess the most controversial choice is #2, and in fact making that

change breaks a few tests. I nevertheless think that all of the index
returning functions (nonzero, argsort, argmin, argmax, argpartition) should
always return a bare ndarray, not a subclass. I'd be happy to be corrected,
but I can't think of any situation in which preserving the subclass would
be needed for these functions.
I also can't see any logical reason why the return type of these functions
has anything to do with the type of the inputs. You can index me with my
phone number but my phone number is not a person. OTOH logic and ndarray
subclassing don't have much to do with each other; the practical effect is
probably more important. Looking at the subclasses I know about (masked
arrays, np.matrix, and astropy quantities), though, I also can't see much
benefit in copying the subclass of the input, and the fact that we were
never consistent about this suggests that people probably aren't depending
on it too much.
So in summary my feeling is: +1 to making then writable, no objection to
the view thing (though I don't see how it matters), and provisional +1 to
consistently returning ndarray (to be revised if the people who use the
subclassing functionality disagree).
-n
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

2015-05-09 20:03:07 UTC

Permalink

Post by Benjamin Root
Absolutely, it should be writable. As for subclassing, that might be
inds = np.where(data > 5)
In that case, I'd expect a normal, bog-standard ndarray because that is

what you use for indexing (although pandas might have a good argument for
having it return one of their special indexing types if "data" was a pandas
array...).

Pandas doesn't subclass ndarray (anymore), so they're irrelevant to this
particular discussion :-). Of course they're an argument for having a
cleaner more general way of allowing non-ndarray array-like objects, but
the legacy subclassing system will never be that.

Post by Benjamin Root
foobar = np.where(data > 5, 1, 2)
Again, I'd expect a normal, bog-standard ndarray because the scalar

elements are very simple. This question gets very complicated when

Post by Benjamin Root
merged_data = np.where(data > 5, data, data2)
So, what should "merged_data" be? If both "data" and "data2" are the same

types, then it would be reasonable to return the same type, if possible.
But what if they aren't the same? Maybe use array_priority to determine the
return type? Or, perhaps it does make sense to say "sod it all" and always
return an ndarray?

Not sure what this has to do with Jaime's post about nonzero? There is
indeed a potential question about what 3-argument where() should do with
subclasses, but that's effectively a different operation entirely and to
discuss it we'd need to know things like what it historically has done and
why that was causing problems.

-n

Benjamin Root

2015-05-09 20:27:03 UTC

Permalink

Post by Nathaniel Smith
Not sure what this has to do with Jaime's post about nonzero? There is
indeed a potential question about what 3-argument where() should do with
subclasses, but that's effectively a different operation entirely and to
discuss it we'd need to know things like what it historically has done and
why that was causing problems.

Because my train of thought started at np.nonzero(), which I have always
just mentally mapped to np.where(), and then... squirrel!

Indeed, np.where() has no bearing here.

Ben Root

Nathaniel Smith

2015-05-09 20:56:24 UTC

Permalink

Post by Benjamin Root

Because my train of thought started at np.nonzero(), which I have always
just mentally mapped to np.where(), and then... squirrel!
Indeed, np.where() has no bearing here.

Ah, gotcha :-).

There is an argument that we should try to reduce this confusion by
nudging people to use np.nonzero() consistently instead of np.where(),
via the documentation and/or a warning message...

--
Nathaniel J. Smith -- http://vorpus.org

Stephan Hoyer

2015-05-10 01:53:42 UTC

Permalink

This post might be inappropriate. Click to display it.

Marten van Kerkwijk

2015-05-12 15:49:07 UTC

Permalink

Agreed that indexing functions should return bare `ndarray`. Note that in
Jaime's PR one can override it anyway by defining __nonzero__. -- Marten

Post by Stephan Hoyer
With regards to np.where -- shouldn't where be a ufunc, so subclasses or
other array-likes can be control its behavior with __numpy_ufunc__?
As for the other indexing functions, I don't have a strong opinion about
how they should handle subclasses. But it is certainly tricky to attempt to
handle handle arbitrary subclasses. I would agree that the least error
prone thing to do is usually to return base ndarrays. Better to force
subclasses to override methods explicitly.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion