Discussion:
[Numpy-discussion] np.sign and object comparisons
Jaime Fernández del Río
2015-08-31 04:09:20 UTC
Permalink
There's been some work going on recently on Py2 vs Py3 object comparisons.
If you want all the background, see gh-6265
<https://github.com/numpy/numpy/issues/6265> and follow the links there.

There is a half baked PR in the works, gh-6269
<https://github.com/numpy/numpy/pull/6269>, that tries to unify behavior
and fix some bugs along the way, by replacing all 2.x uses of
PyObject_Compare with several calls to PyObject_RichCompareBool, which is
available on 2.6, the oldest Python version we support.

The poster child for this example is computing np.sign on an object array
cmp(np.nan, 0)
-1
np.nan < 0
False
np.nan > 0
False
np.nan == 0
False

The current 3.x is buggy, so the fact that it produces the same made up
np.sign(np.array([np.nan], 'O'))
array([-1], dtype=object)

Looking at the code, it seems that the original intention was for the
answer to be `0`, which is equally made up but perhaps makes a little more
sense.

There are three ways of fixing this that I see:

1. Arbitrarily choose a value to set the return to. This is equivalent
to choosing a default return for `cmp` for comparisons. This preserves
behavior, but feels wrong.
2. Similarly to how np.sign of a floating point array with nans returns
nan for those values, return e,g, None for these cases. This is my
preferred option.
3. Raise an error, along the lines of the TypeError: unorderable types
that 3.x produces for some comparisons.

Thoughts anyone?

Jaime
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
Sebastian Berg
2015-08-31 08:23:15 UTC
Permalink
Post by Jaime Fernández del Río
There's been some work going on recently on Py2 vs Py3 object
comparisons. If you want all the background, see gh-6265 and follow
the links there.
There is a half baked PR in the works, gh-6269, that tries to unify
behavior and fix some bugs along the way, by replacing all 2.x uses of
PyObject_Compare with several calls to PyObject_RichCompareBool, which
is available on 2.6, the oldest Python version we support.
The poster child for this example is computing np.sign on an object
array that has an np.nan entry. 2.x will just make up an answer for
cmp(np.nan, 0)
-1
np.nan < 0
False
np.nan > 0
False
np.nan == 0
False
The current 3.x is buggy, so the fact that it produces the same made
np.sign(np.array([np.nan], 'O'))
array([-1], dtype=object)
Looking at the code, it seems that the original intention was for the
answer to be `0`, which is equally made up but perhaps makes a little
more sense.
1. Arbitrarily choose a value to set the return to. This is
equivalent to choosing a default return for `cmp` for
comparisons. This preserves behavior, but feels wrong.
2. Similarly to how np.sign of a floating point array with nans
returns nan for those values, return e,g, None for these
cases. This is my preferred option.
That would be my gut feeling as well. Returning `NaN` could also make
sense, but I guess we run into problems since we do not know the input
type. So `None` seems like the only option here I can think of right
now.

- Sebastian
Post by Jaime Fernández del Río
1. Raise an error, along the lines of the TypeError: unorderable
types that 3.x produces for some comparisons.
Thoughts anyone?
Jaime
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
planes de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Stephan Hoyer
2015-08-31 17:23:10 UTC
Permalink
Post by Sebastian Berg
That would be my gut feeling as well. Returning `NaN` could also make
sense, but I guess we run into problems since we do not know the input
type. So `None` seems like the only option here I can think of right
now.
My inclination is that return NaN would be the appropriate choice. It's
certainly consistent with the behavior for float dtypes -- my expectation
for object dtype behavior is that it works exactly like applying the
np.sign ufunc to each element of the array individually.

On the other hand, I suppose there are other ways in which an object can
fail all those comparisons (e.g., NaT?), so I suppose we could return None.
But it would still be a weird outcome for the most common case. Ideally, I
suppose, np.sign would return an array with int-NA dtype, but that's a
whole different can of worms...

Stephan
Antoine Pitrou
2015-08-31 17:31:06 UTC
Permalink
On Mon, 31 Aug 2015 10:23:10 -0700
Post by Stephan Hoyer
My inclination is that return NaN would be the appropriate choice. It's
certainly consistent with the behavior for float dtypes -- my expectation
for object dtype behavior is that it works exactly like applying the
np.sign ufunc to each element of the array individually.
On the other hand, I suppose there are other ways in which an object can
fail all those comparisons (e.g., NaT?), so I suppose we could return None.
np.sign(np.timedelta64('nat'))
numpy.timedelta64(-1)

... probably because NaT is -2**63 under the hood. But in this case
returning NaT would sound better.

Regards

Antoine.
Nathaniel Smith
2015-09-01 06:45:26 UTC
Permalink
Post by Antoine Pitrou
On Mon, 31 Aug 2015 10:23:10 -0700
Post by Stephan Hoyer
My inclination is that return NaN would be the appropriate choice. It's
certainly consistent with the behavior for float dtypes -- my expectation
for object dtype behavior is that it works exactly like applying the
np.sign ufunc to each element of the array individually.
On the other hand, I suppose there are other ways in which an object can
fail all those comparisons (e.g., NaT?), so I suppose we could return None.
np.sign(np.timedelta64('nat'))
numpy.timedelta64(-1)
... probably because NaT is -2**63 under the hood. But in this case
returning NaT would sound better.
I think this is going through the np.sign timedelta64 loop, and thus
is an unrelated issue? It does look like a bug though.

-n
--
Nathaniel J. Smith -- http://vorpus.org
Sebastian Berg
2015-08-31 18:06:12 UTC
Permalink
On Mon, Aug 31, 2015 at 1:23 AM, Sebastian Berg
That would be my gut feeling as well. Returning `NaN` could also make
sense, but I guess we run into problems since we do not know the input
type. So `None` seems like the only option here I can think of right
now.
My inclination is that return NaN would be the appropriate choice.
It's certainly consistent with the behavior for float dtypes -- my
expectation for object dtype behavior is that it works exactly like
applying the np.sign ufunc to each element of the array individually.
I was wondering a bit if returning the original object could make sense.
It would work for NaN (and also decimal versions of NaN, etc.). But I am
not sure in general.

- Sebastian
On the other hand, I suppose there are other ways in which an object
can fail all those comparisons (e.g., NaT?), so I suppose we could
return None. But it would still be a weird outcome for the most common
case. Ideally, I suppose, np.sign would return an array with int-NA
dtype, but that's a whole different can of worms...
Stephan
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-09-01 06:49:50 UTC
Permalink
On Sun, Aug 30, 2015 at 9:09 PM, Jaime Fernández del Río <
Post by Jaime Fernández del Río
1. Arbitrarily choose a value to set the return to. This is equivalent
to choosing a default return for `cmp` for comparisons. This preserves
behavior, but feels wrong.
2. Similarly to how np.sign of a floating point array with nans
returns nan for those values, return e,g, None for these cases. This
is my preferred option.
3. Raise an error, along the lines of the TypeError: unorderable types
that 3.x produces for some comparisons.
Having read the other replies so far -- given that no-one seems to have
any clear intuition or use cases, I guess I find option 3 somewhat
tempting... it keeps our options open until someone who actually cares
comes along with a use case to hone our intuition on, and is very safe in
the mean time.

(This was noticed in the course of routine code cleanups, right, not an
external bug report? For all we know right now, no actual user has ever
even tried to apply np.sign to an object array?)

-n
--
Nathaniel J. Smith -- http://vorpus.org
Jaime Fernández del Río
2015-09-02 04:26:53 UTC
Permalink
Post by Nathaniel Smith
On Sun, Aug 30, 2015 at 9:09 PM, Jaime Fernández del Río <
Post by Jaime Fernández del Río
1. Arbitrarily choose a value to set the return to. This is
equivalent to choosing a default return for `cmp` for comparisons. This
preserves behavior, but feels wrong.
2. Similarly to how np.sign of a floating point array with nans
returns nan for those values, return e,g, None for these cases. This
is my preferred option.
3. Raise an error, along the lines of the TypeError: unorderable types
that 3.x produces for some comparisons.
Having read the other replies so far -- given that no-one seems to have
any clear intuition or use cases, I guess I find option 3 somewhat
tempting... it keeps our options open until someone who actually cares
comes along with a use case to hone our intuition on, and is very safe in
the mean time.
(This was noticed in the course of routine code cleanups, right, not an
external bug report? For all we know right now, no actual user has ever
even tried to apply np.sign to an object array?)
We do have a user that tried np.sign on an object array, and discovered
that our Py3K object comparison was crap:
https://github.com/numpy/numpy/issues/6229

No report of anyone trying np.sign on anything other than numbers that we
know of, though.

I'm starting to think that, given the lack of agreement, I thinking I am
going to agree with you that raising an error may be the better option,
because it's the least likely to break people's code if we later find we
need to change it.

Jaime
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
Allan Haldane
2015-09-02 21:24:47 UTC
Permalink
Post by Jaime Fernández del Río
1. Arbitrarily choose a value to set the return to. This is equivalent
to choosing a default return for `cmp` for comparisons. This
preserves behavior, but feels wrong.
2. Similarly to how np.sign of a floating point array with nans returns
nan for those values, return e,g, None for these cases. This is my
preferred option.
3. Raise an error, along the lines of the TypeError: unorderable types
that 3.x produces for some comparisons.
I think np.sign on nan object arrays should raise the error

AttributeError: 'float' object has no attribute 'sign'


If I've understood correctly, currently object arrays work like this:

If a ufunc has an equivalent pure-python func (eg, PyNumber_Add for
np.add, PyNumber_Absolute for np.abs, < for np.greater_than) then numpy
calls that for objects. Otherwise, if the object defines a method with
the same name as the ufunc, numpy calls that method. For example, arccos
is a ufunc that has no pure python equivalent, so you get the following
behavior
Post by Jaime Fernández del Río
a = np.array([-1], dtype='O')
np.abs(a)
array([1], dtype=object)
Post by Jaime Fernández del Río
np.arccos(a)
AttributeError: 'int' object has no attribute 'arccos'
... def arccos(self):
... return 1
Post by Jaime Fernández del Río
b = np.array([MyClass()], dtype='O')
np.arccos(b)
array([1], dtype=object)

Now, most comparison operators (eg, greater_than) are treated a little
specially in loops.c. For some reason, sign is treated just like the
other comparison operators, even through technically there is no
pure-python equivalent to sign.

I think that because there is no pure-python 'sign', numpy should
attempt to call obj.sign, and in most cases this should fail with the
error above. See also
http://stackoverflow.com/questions/1986152/why-doesnt-python-have-a-sign-function

I think the fix for sign is that the 'sign' ufunc in generate_umath.py
should look more like the arccos one, and we should get rid of
OBJECT_sign in loops.c. I'm not 100% sure about this since I haven't
followed all of how generate_umath.py works yet.

-------

By the way, based on some comments I saw somewhere (apologies, I forget
who by!) I wrote up a vision for how ufuncs could work for objects,
here: https://gist.github.com/ahaldane/c3f9bcf1f62d898be7c7
I'm a little unsure the ideas there are a good idea since they might be
made obsolete by the big dtype subclassing improvements being discussed
in the numpy roadmap thread.

Allan

Loading...