Discussion:
[Numpy-discussion] subclassing ndarray and keeping same ufunc behavior
Stuart Reynolds
2016-11-15 17:37:42 UTC
Permalink
I'm trying to subclass an ndarray so that I can add some additional fields.
When I do this however, I get new odd behavior when my object is passed to
a variety of numpy functions. For example nanmin returns now return an
object of the type of my new array class, whereas previously I'd get a
float64. Why? Is this a bug with nanmin or my class?

import numpy as np
class NDArrayWithColumns(np.ndarray):
def __new__(cls, obj, columns=None):
obj = obj.view(cls)
obj.columns = tuple(columns)
return obj

def __array_finalize__(self, obj):
if obj is None: return
self.columns = getattr(obj, 'columns', None)

NAN = float("nan")
r = np.array([1.,0.,1.,0.,1.,0.,1.,0.,NAN, 1., 1.])print "MIN",
np.nanmin(r), type(np.nanmin(r))

gives:

MIN 0.0 <type 'numpy.float64'>

but
r = NDArrayWithColumns(r, ["a"])>>> print "MIN", np.nanmin(r), type(np.nanmin(r))
MIN 0.0 <class '__main__.NDArrayWithColumns'>>>> print r.shape # ?!(11,)

Note the change in type, and also that str(np.nanmin(r)) shows 1 field, not
11 as indicated by its shape. This seems wrong. Is there a way to get my
subclass to behave more like an ndarray?

I realize from the docs that I can override __array_wrap__, but its not
clear me how how to use it to solve this issue. Or whether its the right
tool.

In case you're interested, I'm subclassing because I'd like to track column
names in matrices of a single type. This is pretty common wish in scikit
pipelines. Structured arrays and record type arrays allow for varying type.
Pandas provides this functionality, but dealing with numpy arrays is easier
(and more efficient) when writing cython extensions. Also, I think the
structured arrays and record types are unlikely to play nice with cython
because they're more freely typed -- I want to deal exclusively with arrays
of doubles.

Any thoughts of how to subclass ndarray and keep original behavior in
ufuncs?
Marten van Kerkwijk
2016-11-15 18:48:11 UTC
Permalink
Hi Stuart,

It certainly seems correct behaviour to return the subclass you
created: after all, you might want to keep the information on
`columns` (e.g., consider doing nanmin along a given axis). Indeed, we
certainly want to keep the unit in astropy's Quantity (which also is a
subclass of ndarray).

On the shape: shouldn't that be print(np.nanmin(r).shape)??

Overall, I think it is worth considering very carefully what exactly
you try to accomplish; if different elements along a given axis have
different meaning, I'm not sure it makes all that much sense to treat
them as a single array (e.g., np.sin might be useful for one column,
not not another). Even if pandas is slower, the advantage in clarity
of what is happening might well be more important in the long run.

All the best,

Marten

p.s. nanmin is not a ufunc; you can find it in numpy/lib/nan_functions.py
Nathan Goldbaum
2016-11-15 18:52:35 UTC
Permalink
You might also want to consider writing a wrapper object that contains an
ndarray as a (possibly private) attribute and then presents different views
or interpretations of that array.

Subclassing ndarray is a pit of snakes, it's best to avoid it if you can (I
say as the author and maintainer of an ndarray subclass).

On Tue, Nov 15, 2016 at 1:48 PM, Marten van Kerkwijk <
Post by Marten van Kerkwijk
Hi Stuart,
It certainly seems correct behaviour to return the subclass you
created: after all, you might want to keep the information on
`columns` (e.g., consider doing nanmin along a given axis). Indeed, we
certainly want to keep the unit in astropy's Quantity (which also is a
subclass of ndarray).
On the shape: shouldn't that be print(np.nanmin(r).shape)??
Overall, I think it is worth considering very carefully what exactly
you try to accomplish; if different elements along a given axis have
different meaning, I'm not sure it makes all that much sense to treat
them as a single array (e.g., np.sin might be useful for one column,
not not another). Even if pandas is slower, the advantage in clarity
of what is happening might well be more important in the long run.
All the best,
Marten
p.s. nanmin is not a ufunc; you can find it in numpy/lib/nan_functions.py
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Stuart Reynolds
2016-11-15 19:01:45 UTC
Permalink
Doh! Thanks for that.

On Tue, Nov 15, 2016 at 10:48 AM, Marten van Kerkwijk <
Post by Marten van Kerkwijk
Hi Stuart,
It certainly seems correct behaviour to return the subclass you
created: after all, you might want to keep the information on
`columns` (e.g., consider doing nanmin along a given axis). Indeed, we
certainly want to keep the unit in astropy's Quantity (which also is a
subclass of ndarray).
On the shape: shouldn't that be print(np.nanmin(r).shape)??
Overall, I think it is worth considering very carefully what exactly
you try to accomplish; if different elements along a given axis have
different meaning, I'm not sure it makes all that much sense to treat
them as a single array (e.g., np.sin might be useful for one column,
not not another). Even if pandas is slower, the advantage in clarity
of what is happening might well be more important in the long run.
All the best,
Marten
p.s. nanmin is not a ufunc; you can find it in numpy/lib/nan_functions.py
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...