Discussion:
extract elements of an array that are contained in another array?
(too old to reply)
Ning Sean
2009-06-04 00:29:31 UTC
Permalink
Hi, I want to extract elements of an array (say, a) that are contained in
another array (say, b). That is, if a=array([1,1,2,3,3,4]), b=array([1,4]),
then I want array([1,1,4]).

I did the following but the speed is very slow (maybe because a is very
long):

c=array([])
for x in b:
c=append(c,a[a==x])

any way to speed it up?

Thanks!
-Ning
j***@gmail.com
2009-06-04 00:45:10 UTC
Permalink
Post by Ning Sean
Hi, I want to extract elements of an array (say, a) that are contained in
another array (say, b). That is, if a=array([1,1,2,3,3,4]), b=array([1,4]),
then I want array([1,1,4]).
I did the following but the speed is very slow (maybe because a is very
c=array([])
   c=append(c,a[a==x])
any way to speed it up?
Thanks!
-Ning
It's waiting in Trac for inclusion in numpy
http://projects.scipy.org/numpy/ticket/1036
The current version only handles arrays with unique elements.

You can copy the ticket attachment, the version there is very fast.

Josef
Ning Sean
2009-06-04 04:32:31 UTC
Permalink
Thanks! Tried it and it is about twice as fast as my approach.

-Ning
Post by Ning Sean
Post by Ning Sean
Hi, I want to extract elements of an array (say, a) that are contained in
another array (say, b). That is, if a=array([1,1,2,3,3,4]),
b=array([1,4]),
Post by Ning Sean
then I want array([1,1,4]).
I did the following but the speed is very slow (maybe because a is very
c=array([])
c=append(c,a[a==x])
any way to speed it up?
Thanks!
-Ning
It's waiting in Trac for inclusion in numpy
http://projects.scipy.org/numpy/ticket/1036
The current version only handles arrays with unique elements.
You can copy the ticket attachment, the version there is very fast.
Josef
_______________________________________________
Numpy-discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Alan G Isaac
2009-06-04 12:23:43 UTC
Permalink
a[(a==b[:,None]).sum(axis=0,dtype=bool)]

hth,
Alan Isaac
j***@gmail.com
2009-06-04 12:35:05 UTC
Permalink
Post by Alan G Isaac
a[(a==b[:,None]).sum(axis=0,dtype=bool)]
this is my preferred way when b is small and has unique elements.
if the elements in b are not unique, then be can be replaced by np.unique(b)
If b is large this creates a huge intermediate array

The advantage of the new setmember1d_nu is that it handles large b
very efficiently. My try on it was more than 10 times slower than the
proposed solution for larger arrays.

Josef
Post by Alan G Isaac
hth,
Alan Isaac
Alan G Isaac
2009-06-04 14:13:19 UTC
Permalink
Post by j***@gmail.com
Post by Alan G Isaac
a[(a==b[:,None]).sum(axis=0,dtype=bool)]
If b is large this creates a huge intermediate array
True enough, but one could then use fromiter:
setb = set(b)
itr = (ai for ai in a if ai in setb)
out = np.fromiter(itr, dtype=a.dtype)

I suspect (?) that b would have to be pretty
big relative to a for the repeated testing
to be more costly than sorting a.

Or if a stable order is not important (I don't
recall if the OP specified), one could just
np.intersect1d(a, np.unique(b))

On a different note, I think a name change
is needed for your function. (Compare
intersect1d_nu to see the potential
confusion. And btw, what is the use case
for intersect1d, which gives neither a
set intersection nor a multiset intersection?)

Cheers,
Alan Isaac
j***@gmail.com
2009-06-04 14:50:18 UTC
Permalink
Post by Alan G Isaac
Post by j***@gmail.com
Post by Alan G Isaac
a[(a==b[:,None]).sum(axis=0,dtype=bool)]
If b is large this creates a huge intermediate array
setb = set(b)
itr = (ai for ai in a if ai in setb)
out = np.fromiter(itr, dtype=a.dtype)
I suspect (?) that b would have to be pretty
big relative to a for the repeated testing
to be more costly than sorting a.
I didn't look at this case very closely for speed, setmember1d and
setmember1d_nu return a boolean array, that can be used for indexing,
not the actual elements.

Your iterator is in python and could be pretty slow, but I only ran
the performance script attached to the ticket and the speed
differences for different ways of doing it were pretty big for large
arrays.
Post by Alan G Isaac
Or if a stable order is not important (I don't
recall if the OP specified), one could just
np.intersect1d(a, np.unique(b))
This requires that also `a` has only unique elements.
intersect1d_nu doesn't require unique elements.
Post by Alan G Isaac
On a different note, I think a name change
is needed for your function. (Compare
intersect1d_nu to see the potential
confusion. And btw, what is the use case
for intersect1d, which gives neither a
set intersection nor a multiset intersection?)
intersect1d gives set intersection if both arrays have only unique
elements (i.e. are sets).
I thought the naming is pretty clear:

intersect1d(a,b) set intersection if a and b with unique elements
intersect1d_nu(a,b) set intersection if a and b with non-unique elements
setmember1d(a,b) boolean index array for a of set intersection if a
and b with unique elements
setmember1d_nu(a,b) boolean index array for a of set intersection if
a and b with non-unique elements

The new docs http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/
are a bit clearer.

However, I haven't used either of these functions much, and non of
them are *my* functions.
Of the arraysetops functions, I use unique1d most (because of the
return index).
I just keep track of these functions because of the use for
categorical and dummy variables.

Josef
Post by Alan G Isaac
Cheers,
Alan Isaac
_______________________________________________
Numpy-discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Alan G Isaac
2009-06-04 15:12:17 UTC
Permalink
Post by j***@gmail.com
Post by Alan G Isaac
Or if a stable order is not important (I don't
recall if the OP specified), one could just
np.intersect1d(a, np.unique(b))
This requires that also `a` has only unique elements.
intersect1d_nu doesn't require unique elements.
Post by Alan G Isaac
Post by Alan G Isaac
a
array([1, 1, 2, 3, 3, 4])
Post by j***@gmail.com
Post by Alan G Isaac
Post by Alan G Isaac
b
array([1, 4])
Post by j***@gmail.com
Post by Alan G Isaac
Post by Alan G Isaac
np.intersect1d(a, np.unique(b))
array([1, 1, 3, 4])

(And thus my question about intersect1d...)

Cheers,
Alan
j***@gmail.com
2009-06-04 15:19:19 UTC
Permalink
Post by Alan G Isaac
Post by j***@gmail.com
Post by Alan G Isaac
Or if a stable order is not important (I don't
recall if the OP specified), one could just
np.intersect1d(a, np.unique(b))
This requires that also `a` has only unique elements.
intersect1d_nu doesn't require unique elements.
Post by Alan G Isaac
Post by Alan G Isaac
a
array([1, 1, 2, 3, 3, 4])
Post by j***@gmail.com
Post by Alan G Isaac
Post by Alan G Isaac
b
array([1, 4])
Post by j***@gmail.com
Post by Alan G Isaac
Post by Alan G Isaac
np.intersect1d(a, np.unique(b))
array([1, 1, 3, 4])
(And thus my question about intersect1d...)
Yes, I know, and in my current numpy help file this is the only
example there is, which is very misleading for its intended use.
Post by Alan G Isaac
Post by j***@gmail.com
Post by Alan G Isaac
a = np.array([1, 1, 2, 3, 3, 4])
b = np.array([1, 4, 5])
np.intersect1d(np.unique(a), np.unique(b))
array([1, 4])
Post by Alan G Isaac
Post by j***@gmail.com
Post by Alan G Isaac
np.intersect1d_nu(a,b)
array([1, 4])

Josef
Post by Alan G Isaac
Cheers,
Alan
_______________________________________________
Numpy-discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Alan G Isaac
2009-06-04 15:19:49 UTC
Permalink
Post by j***@gmail.com
intersect1d gives set intersection if both arrays have
only unique elements (i.e. are sets). I thought the
intersect1d(a,b) set intersection if a and b with unique elements
intersect1d_nu(a,b) set intersection if a and b with non-unique elements
setmember1d(a,b) boolean index array for a of set intersection if a
and b with unique elements
setmember1d_nu(a,b) boolean index array for a of set intersection if
a and b with non-unique elements
Post by Alan G Isaac
a
array([1, 1, 2, 3, 3, 4])
Post by j***@gmail.com
Post by Alan G Isaac
b
array([1, 4, 4, 4])
Post by j***@gmail.com
Post by Alan G Isaac
np.intersect1d_nu(a,b)
array([1, 4])

That is, intersect1d_nu is the actual set intersection
function. (I.e., intersect1d and intersect1d_nu would most
naturally have swapped names.) That is why the appended _nu
will not communicate what was intended. (I.e.,
setmember1d_nu will not be a match for intersect1d_nu.)

Cheers,
Alan Isaac
Robert Cimrman
2009-06-04 15:27:11 UTC
Permalink
Post by Alan G Isaac
Post by j***@gmail.com
intersect1d gives set intersection if both arrays have
only unique elements (i.e. are sets). I thought the
intersect1d(a,b) set intersection if a and b with unique elements
intersect1d_nu(a,b) set intersection if a and b with non-unique elements
setmember1d(a,b) boolean index array for a of set intersection if a
and b with unique elements
setmember1d_nu(a,b) boolean index array for a of set intersection if
a and b with non-unique elements
Post by Alan G Isaac
a
array([1, 1, 2, 3, 3, 4])
Post by j***@gmail.com
Post by Alan G Isaac
b
array([1, 4, 4, 4])
Post by j***@gmail.com
Post by Alan G Isaac
np.intersect1d_nu(a,b)
array([1, 4])
That is, intersect1d_nu is the actual set intersection
function. (I.e., intersect1d and intersect1d_nu would most
naturally have swapped names.) That is why the appended _nu
will not communicate what was intended. (I.e.,
setmember1d_nu will not be a match for intersect1d_nu.)
The naming should express this: intersect1d expects its arguments are
sets, intersect1d_nu does not. A set has unique elements by definition.

cheers,
r.
j***@gmail.com
2009-06-04 15:29:56 UTC
Permalink
Post by Alan G Isaac
Post by j***@gmail.com
intersect1d gives set intersection if both arrays have
only unique elements (i.e. are sets).  I thought the
intersect1d(a,b)   set intersection if a and b with unique elements
intersect1d_nu(a,b)   set intersection if a and b with non-unique elements
setmember1d(a,b)  boolean index array for a of set intersection if a
and b with unique elements
setmember1d_nu(a,b)  boolean index array for a of set intersection if
a and b with non-unique elements
Post by Alan G Isaac
a
array([1, 1, 2, 3, 3, 4])
Post by j***@gmail.com
Post by Alan G Isaac
b
array([1, 4, 4, 4])
Post by j***@gmail.com
Post by Alan G Isaac
np.intersect1d_nu(a,b)
array([1, 4])
That is, intersect1d_nu is the actual set intersection
function.  (I.e., intersect1d and intersect1d_nu would most
naturally have swapped names.)  That is why the appended _nu
will not communicate what was intended.  (I.e.,
setmember1d_nu will not be a match for intersect1d_nu.)
intersect1d is the intersection between sets (which are stored as
arrays), just like in the mathematical definition the two sets only
have unique elements

intersect1d_nu is the intersection between two arrays which can have
repeated elements. The result is a set, i.e. unique elements, stored
as an array

same for setmember1d, setmember1d_nu

so postfix `_nu` only means that this function also works if the two
arrays are not really sets, i.e. are not required to have unique
elements to make sense.


intersect1d should throw a domain error if you give it arrays with
non-unique elements, which is not done for speed reasons
Post by Alan G Isaac
Cheers,
Alan Isaac
Alan G Isaac
2009-06-04 16:32:33 UTC
Permalink
Post by j***@gmail.com
intersect1d is the intersection between sets (which are stored as
arrays), just like in the mathematical definition the two sets only
have unique elements
Hmmm. OK, I see you and Robert believe this.
But it does not match the documentation.
But indeed, I see that the documentation is incorrect.
E.g.,
Post by j***@gmail.com
np.intersect1d([1,1,2,3,3,4],[1,4])
array([1, 1, 3, 4])

Is this a bug or a documentation bug?
Post by j***@gmail.com
intersect1d_nu is the intersection between two arrays which can have
repeated elements. The result is a set, i.e. unique elements, stored
as an array
same for setmember1d, setmember1d_nu
I cannot understand this.
Following your proposed reasoning,
I expect a[setmember1d_nu(a,b)]
to return the same as
intersect1d_nu(a, b).
It does not.
Post by j***@gmail.com
so postfix `_nu` only means that this function also works
if the two arrays are not really sets
But that just begs the question: what does 'works' mean?
See my previous comment (above).
Post by j***@gmail.com
intersect1d should throw a domain error if you give it arrays with
non-unique elements, which is not done for speed reasons
*If* intersect1d behaved *exactly* as documented,
the example
intersect1d(a, np.unique(b))
shows that the documented behavior can be useful.
And indeed, this would be the match to
a[setmember1d_nu(a,b)]

Cheers,
Alan Isaac
j***@gmail.com
2009-06-04 17:27:25 UTC
Permalink
Post by Alan G Isaac
intersect1d  is the intersection between sets (which are stored as
arrays), just like in the mathematical definition the two sets only
have unique elements
Hmmm. OK, I see you and Robert believe this.
But it does not match the documentation.
But indeed, I see that the documentation is incorrect.
E.g.,
np.intersect1d([1,1,2,3,3,4],[1,4])
array([1, 1, 3, 4])
Is this a bug or a documentation bug?
intersect1d_nu is the intersection between two arrays which can have
repeated elements. The result is a set, i.e. unique elements, stored
as an array
same for setmember1d, setmember1d_nu
I cannot understand this.
Following your proposed reasoning,
I expect a[setmember1d_nu(a,b)]
to return the same as
intersect1d_nu(a, b).
It does not.
I don't have setmember1d_nu available right now, but from my reading
we should have

intersect1d_nu(a, b).== np.unique(a[setmember1d_nu(a,b)])
Post by Alan G Isaac
so  postfix `_nu` only means that this function also works
if the two arrays are not really sets
But that just begs the question: what does 'works' mean?
See my previous comment (above).
intersect1d should throw a domain error if you give it arrays with
non-unique elements, which is not done for speed reasons
*If* intersect1d behaved *exactly* as documented,
the example
intersect1d(a, np.unique(b))
shows that the documented behavior can be useful.
And indeed, this would be the match to
a[setmember1d_nu(a,b)]
I'm don't know if anyone looked at the behavior for "unintented" usage

intersect1d rearranges, sorts
Post by Alan G Isaac
np.intersect1d([4,1,3,3],[3,4])
array([3, 3, 4])

but it gives you the correct multiplicity
Post by Alan G Isaac
np.intersect1d([4,4,4,1,3,3],np.unique([3,4,3,0]))
array([3, 3, 4, 4, 4])

so I guess, we have
np.intersect1d([4,4,4,1,3,3], np.unique([3,4,3,0])) ==
np.sort(a[setmember1d_nu(a,b)])

for the example from the help file I don't find any meaningful interpretation
Post by Alan G Isaac
np.intersect1d([1,3,3],[3,1,1])
array([1, 1, 3, 3])


wrong answer
Post by Alan G Isaac
np.setmember1d([4,1,1,3,3],[3,4])
array([ True, True, False, True, True], dtype=bool)

Note: there are two versions of the docs for np.intersect1d, the
currently published docs which describe the actual behavior (for the
non-unique case), and the new docs on the doc editor
http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/
that describe the "intended" usage of the functions, which also
corresponds closer to the original source docstring
(http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/?revision=-227
). that's my interpretation

If you think that functions make sense also for the "unintended"
usage, then you could add an example to the new docs.

Josef
Alan G Isaac
2009-06-04 18:58:32 UTC
Permalink
Post by j***@gmail.com
Note: there are two versions of the docs for np.intersect1d, the
currently published docs which describe the actual behavior (for the
non-unique case), and the new docs on the doc editor
http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/
that describe the "intended" usage of the functions, which also
corresponds closer to the original source docstring
(http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/?revision=-227
). that's my interpretation
Again, the distributed docs do *not* describe the actual
behavior for the non-unique case. E.g.,
Post by j***@gmail.com
np.intersect1d([1,1,2,3,3,4], [1,4])
array([1, 1, 3, 4])

Might this is a better example of
failure than the one in the doc editor?

However the doc editor version states that the function
fails for the non-unique case, so it seems there was a
documentation bug that is in the process of being fixed.

Thanks,
Alan
j***@gmail.com
2009-06-04 20:14:59 UTC
Permalink
Post by Alan G Isaac
Post by j***@gmail.com
Note: there are two versions of the docs for np.intersect1d, the
currently published docs which describe the actual behavior (for the
non-unique case), and the new docs on the doc editor
http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/
that describe the "intended" usage of the functions, which also
corresponds closer to the original source docstring
(http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/?revision=-227
). that's my interpretation
Again, the distributed docs do *not* describe the actual
behavior for the non-unique case.  E.g.,
Post by j***@gmail.com
np.intersect1d([1,1,2,3,3,4], [1,4])
array([1, 1, 3, 4])
Might this is a better example of
failure than the one in the doc editor?
Thanks, that's a very clear example of a wrong answer,
and it removes the question whether the function makes any sense for
the non-unique case.
I changed the example in the doc editor to this one.

It will hopefully merged with the source at the next update.

Josef
Post by Alan G Isaac
However the doc editor version states that the function
fails for the non-unique case, so it seems there was a
documentation bug that is in the process of being fixed.
Yes
Post by Alan G Isaac
Thanks,
Alan
_______________________________________________
Numpy-discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Cimrman
2009-06-05 05:22:35 UTC
Permalink
Post by j***@gmail.com
Post by Alan G Isaac
Post by j***@gmail.com
Note: there are two versions of the docs for np.intersect1d, the
currently published docs which describe the actual behavior (for the
non-unique case), and the new docs on the doc editor
http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/
that describe the "intended" usage of the functions, which also
corresponds closer to the original source docstring
(http://docs.scipy.org/numpy/docs/numpy.lib.arraysetops.intersect1d/?revision=-227
). that's my interpretation
Again, the distributed docs do *not* describe the actual
behavior for the non-unique case. E.g.,
Post by j***@gmail.com
np.intersect1d([1,1,2,3,3,4], [1,4])
array([1, 1, 3, 4])
Might this is a better example of
failure than the one in the doc editor?
Thanks, that's a very clear example of a wrong answer,
and it removes the question whether the function makes any sense for
the non-unique case.
I changed the example in the doc editor to this one.
It will hopefully merged with the source at the next update.
Thank you Josef!

r.
Kim Hansen
2009-06-04 20:27:11 UTC
Permalink
Concerning the name setmember1d_nu, I personally find it quite verbose
and not the name I would expect as a non-insider coming to numpy and
not knowing all the names of the more special hidden-away functions
and not being a python-wiz either.

I think ain(a,b) would be the name I had expected as an array
equivalent of "a in b" (just as arange is the array version of range)
or I would had anticipated that an ndarray object would have an
"in(b)" or "in_iterable(b)" method, such that you could do a.in(b)
which would return a boolean array of the same shape as a with
elements true if the equivalent a members were members in the iterable
b.

When I had a problem where I needed this function, I could not find
anything near that, and after looking around and also asking here I
got some hints to use the ....1d functions, which gave me the idea to
implement the few-line, very simple proposal for "a in b", which is
now the proposal under review as the new function setmember1d_nu(a,b).
Whereas I see this function name is in line with the existing
functions, I really think the names are non-intuitive. I would
therefore propose that it was also aliased to a more intuitive name
such as ain(a,b) or perhaps better a.in(b)

Again, I am probably missing some important points here as a
non-experienced Python programmer and numpy user, I am just trying to
give some input from the beginners point-of-view, if that can be of
any help.

Thank you,

Kim
Gael Varoquaux
2009-06-04 20:30:19 UTC
Permalink
Post by Kim Hansen
"in(b)" or "in_iterable(b)" method, such that you could do a.in(b)
which would return a boolean array of the same shape as a with
elements true if the equivalent a members were members in the iterable
b.
That would really by what I would be looking for.

Gaël
j***@gmail.com
2009-06-04 20:43:39 UTC
Permalink
On Thu, Jun 4, 2009 at 4:30 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Kim Hansen
"in(b)" or "in_iterable(b)" method, such that you could do a.in(b)
which would return a boolean array of the same shape as a with
elements true if the equivalent a members were members in the iterable
b.
That would really by what I would be looking for.
Just using "in" might promise more than it does, eg. it works only for
one dimensional arrays, maybe "in1d". With "in", I would expect a
generic function as in python that works with many array types and
dimensions. (But I haven't checked whether it would work with a 1d
structured array or object array.)

I found arraysetops because of unique1d, but I didn't figure out what
the subpackage really does, because I was reading "arrayse-tops"
instead of array-set-ops"

BTW, for the docs, I haven't found a counter example where
np.setdiff1d gives the wrong answer for non-unique arrays.

Josef
Gael Varoquaux
2009-06-04 20:52:00 UTC
Permalink
Post by j***@gmail.com
Just using "in" might promise more than it does, eg. it works only for
one dimensional arrays, maybe "in1d". With "in",
Then 'in_1d'
Post by j***@gmail.com
I found arraysetops because of unique1d, but I didn't figure out what
the subpackage really does, because I was reading "arrayse-tops"
instead of array-set-ops"
That's why I push people to use more underscores. IMHO PEP8 lacks a push
for underscores.

Gaël
j***@gmail.com
2009-06-05 00:49:13 UTC
Permalink
On Thu, Jun 4, 2009 at 4:52 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by j***@gmail.com
Just using "in" might promise more than it does, eg. it works only for
one dimensional arrays, maybe "in1d". With "in",
Then 'in_1d'
No, if the breaks in a name are obvious, I still prefer names without
underscores. I don't think `1d` or `2d` needs to be separated from the
word, "in1d"
I always remember how to spell unique1d, but I usually have to check
how to spell at_least_2d, or maybe atleast_2d or even atleast2d.

how about

def setmember1d_nu(a, b):
...

#aliases
set_member_1d_but_it_does_not_really_have_to_be_a_set = setmember1d_nu
in1d = setmember1d_nu

Josef
Post by Gael Varoquaux
Post by j***@gmail.com
[f for f in dir(np) if f[-2:]=='1d' or f[-2:]=='2d']
['atleast_1d', 'atleast_2d', 'ediff1d', 'histogram2d', 'intersect1d',
'poly1d', 'setdiff1d', 'setmember1d', 'setxor1d', 'union1d',
'unique1d']
Post by Gael Varoquaux
Post by j***@gmail.com
[f for f in dir(scipy.signal) if f[-2:]=='1d' or f[-2:]=='2d']
['atleast_1d', 'atleast_2d', 'convolve2d', 'correlate2d', 'cspline1d',
'cspline2d', 'medfilt2d', 'qspline1d', 'qspline2d', 'sepfir2d']
Post by Gael Varoquaux
Post by j***@gmail.com
[f for f in dir(scipy.stats) if f[-2:]=='1d' or f[-2:]=='2d']
[]
Post by Gael Varoquaux
Post by j***@gmail.com
[f for f in dir(scipy.ndimage) if f[-2:]=='1d' or f[-2:]=='2d']
['convolve1d', 'correlate1d', 'gaussian_filter1d', 'generic_filter1d',
'maximum_filter1d', 'minimum_filter1d', 'spline_filter1d',
'uniform_filter1d']
Post by Gael Varoquaux
Post by j***@gmail.com
I found arraysetops because of unique1d, but I didn't figure out what
the subpackage really does, because I was reading "arrayse-tops"
instead of array-set-ops"
That's why I push people to use more underscores. IMHO PEP8 lacks a push
for underscores.
Gaël
_______________________________________________
Numpy-discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Cimrman
2009-06-05 05:48:37 UTC
Permalink
Post by j***@gmail.com
On Thu, Jun 4, 2009 at 4:30 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Kim Hansen
"in(b)" or "in_iterable(b)" method, such that you could do a.in(b)
which would return a boolean array of the same shape as a with
elements true if the equivalent a members were members in the iterable
b.
That would really by what I would be looking for.
Just using "in" might promise more than it does, eg. it works only for
one dimensional arrays, maybe "in1d". With "in", I would expect a
generic function as in python that works with many array types and
dimensions. (But I haven't checked whether it would work with a 1d
structured array or object array.)
I found arraysetops because of unique1d, but I didn't figure out what
the subpackage really does, because I was reading "arrayse-tops"
instead of array-set-ops"
I am bad in choosing names, but note that numpy sub-modules usually do
not use underscores, so array_set_ops would not fit well.
Post by j***@gmail.com
BTW, for the docs, I haven't found a counter example where
np.setdiff1d gives the wrong answer for non-unique arrays.
In [4]: np.setmember1d( [1, 1, 2, 4, 2], [3, 2, 4] )
Out[4]: array([ True, False, True, True, True], dtype=bool)

r.
j***@gmail.com
2009-06-05 05:56:14 UTC
Permalink
Post by Robert Cimrman
Post by j***@gmail.com
On Thu, Jun 4, 2009 at 4:30 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Kim Hansen
"in(b)" or "in_iterable(b)" method, such that you could do a.in(b)
which would return a boolean array of the same shape as a with
elements true if the equivalent a members were members in the iterable
b.
That would really by what I would be looking for.
Just using "in" might promise more than it does, eg. it works only for
one dimensional arrays, maybe "in1d". With "in", I would expect a
generic function as in python that works with many array types and
dimensions. (But I haven't checked whether it would work with a 1d
structured array or object array.)
I found arraysetops because of unique1d, but I didn't figure out what
the subpackage really does, because I was reading "arrayse-tops"
instead of array-set-ops"
I am bad in choosing names, but note that numpy sub-modules usually do
not use underscores, so array_set_ops would not fit well.
I would have chosen something like setfun. Since this is in numpy
that sets refers to arrays should be implied.
Post by Robert Cimrman
Post by j***@gmail.com
BTW, for the docs, I haven't found a counter example where
np.setdiff1d gives the wrong answer for non-unique arrays.
In [4]: np.setmember1d( [1, 1, 2, 4, 2], [3, 2, 4] )
Out[4]: array([ True, False,  True,  True,  True], dtype=bool)
setdiff1d diff not member
Looking at the source, I think setdiff always works even if for
non-unique arrays.

Josef
Post by Robert Cimrman
r.
_______________________________________________
Numpy-discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Cimrman
2009-06-05 06:04:26 UTC
Permalink
Post by j***@gmail.com
Post by Robert Cimrman
Post by j***@gmail.com
On Thu, Jun 4, 2009 at 4:30 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Kim Hansen
"in(b)" or "in_iterable(b)" method, such that you could do a.in(b)
which would return a boolean array of the same shape as a with
elements true if the equivalent a members were members in the iterable
b.
That would really by what I would be looking for.
Just using "in" might promise more than it does, eg. it works only for
one dimensional arrays, maybe "in1d". With "in", I would expect a
generic function as in python that works with many array types and
dimensions. (But I haven't checked whether it would work with a 1d
structured array or object array.)
I found arraysetops because of unique1d, but I didn't figure out what
the subpackage really does, because I was reading "arrayse-tops"
instead of array-set-ops"
I am bad in choosing names, but note that numpy sub-modules usually do
not use underscores, so array_set_ops would not fit well.
I would have chosen something like setfun. Since this is in numpy
that sets refers to arrays should be implied.
Yes, good idea. I am not sure how to proceed, if people agree (name
contest is open!) What about making an alias name setfun, and deprecate
the name arraysetops?
Post by j***@gmail.com
Post by Robert Cimrman
Post by j***@gmail.com
BTW, for the docs, I haven't found a counter example where
np.setdiff1d gives the wrong answer for non-unique arrays.
In [4]: np.setmember1d( [1, 1, 2, 4, 2], [3, 2, 4] )
Out[4]: array([ True, False, True, True, True], dtype=bool)
setdiff1d diff not member
Looking at the source, I think setdiff always works even if for
non-unique arrays.
Whoops, sorry. setdiff1d seems really to work for non-unique arrays - it
relies on the behaviour above though :) - there is always one correct
False even for repeated entries in the first array.

r.
Robert Cimrman
2009-06-05 05:27:16 UTC
Permalink
Post by Kim Hansen
Concerning the name setmember1d_nu, I personally find it quite verbose
and not the name I would expect as a non-insider coming to numpy and
not knowing all the names of the more special hidden-away functions
and not being a python-wiz either.
To explain the naming: those names are used in matlab for functions of
similar functionality. If better names are found, I am not against.

What I particularly do not like is the _nu suffix (yes, blame me).

r.
Anne Archibald
2009-06-04 20:38:40 UTC
Permalink
Post by j***@gmail.com
intersect1d should throw a domain error if you give it arrays with
non-unique elements, which is not done for speed reasons
It seems to me that this is the basic source of the problem. Perhaps
this can be addressed? I realize maintaining compatibility with the
current behaviour is necessary, so how about a multistage deprecation:

1. add a keyword argument to intersect1d "assume_unique"; if it is not
present, check for uniqueness and emit a warning if not unique
2. change the warning to an exception
Optionally:
3. change the meaning of the function to that of intersect1d_nu if the
keyword argument is not present

One could do something similar with setmember1d.

This would remove the pitfall of the 1d assumption and the wart of the
_nu names without hampering performance for people who know they have
unique arrays and are in a hurry.

Anne
Robert Cimrman
2009-06-05 05:35:46 UTC
Permalink
Post by Anne Archibald
Post by j***@gmail.com
intersect1d should throw a domain error if you give it arrays with
non-unique elements, which is not done for speed reasons
It seems to me that this is the basic source of the problem. Perhaps
this can be addressed? I realize maintaining compatibility with the
1. add a keyword argument to intersect1d "assume_unique"; if it is not
present, check for uniqueness and emit a warning if not unique
2. change the warning to an exception
3. change the meaning of the function to that of intersect1d_nu if the
keyword argument is not present
One could do something similar with setmember1d.
This would remove the pitfall of the 1d assumption and the wart of the
_nu names without hampering performance for people who know they have
unique arrays and are in a hurry.
You mean something like:

def intersect1d(ar1, ar2, assume_unique=False):
if not assume_unique:
return intersect1d_nu(ar1, ar2)
else:
... # the current code

intersect1d_nu could be still exported to numpy namespace, or not.

I like this. I do not undestand, however, what you mean by "remove the
pitfall of the 1d assumption"?

cheers,
r.
Neil Crighton
2009-06-06 08:42:58 UTC
Permalink
Post by Robert Cimrman
Post by Anne Archibald
1. add a keyword argument to intersect1d "assume_unique"; if it is not
present, check for uniqueness and emit a warning if not unique
2. change the warning to an exception
3. change the meaning of the function to that of intersect1d_nu if the
keyword argument is not present
return intersect1d_nu(ar1, ar2)
... # the current code
intersect1d_nu could be still exported to numpy namespace, or not.
+1 - from the user's point of view there should just be intersect1d and
setmember1d (i.e. no '_nu' versions). The assume_unique keyword Robert suggests
can be used if speed is a problem.

I really like in1d (no underscore) as a new name for setmember1d_nu. inarray is
another possibility. I don't like 'ain'; 'a' in front of 'in' detracts from
readability, unlike the extra a in arange.

Can we summarise the discussion in this thread and write up a short proposal
about what we'd like to change in arraysetops, and how to make the changes?
Then it's easy for other people to give their opinion on any changes. I can do
this if no one else has time.


Neil
j***@gmail.com
2009-06-06 11:41:32 UTC
Permalink
Post by Neil Crighton
Post by Anne Archibald
1. add a keyword argument to intersect1d "assume_unique"; if it is not
present, check for uniqueness and emit a warning if not unique
2. change the warning to an exception
3. change the meaning of the function to that of intersect1d_nu if the
keyword argument is not present
1. merge _nu version into one function
-------------------------------------------------------
Post by Neil Crighton
         return intersect1d_nu(ar1, ar2)
         ... # the current code
intersect1d_nu could be still exported to numpy namespace, or not.
+1 - from the user's point of view there should just be intersect1d and
setmember1d (i.e. no '_nu' versions). The assume_unique keyword Robert suggests
can be used if speed is a problem.
+ 1 on rolling the _nu versions this way into the plain version, this
would avoid a lot of the confusion.
It would not be a code breaking API change for existing correct usage
(but some speed regression without adding keyword)

depreciate intersect1d_nu
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Post by Neil Crighton
intersect1d_nu could be still exported to numpy namespace, or not.
I would say not, if they are the default branch of the non _nu version

+1 on depreciation


2. alias as "in"
---------------------
Post by Neil Crighton
I really like in1d (no underscore) as a new name for setmember1d_nu. inarray is
another possibility. I don't like 'ain'; 'a' in front of 'in' detracts from
readability, unlike the extra a in arange.
I don't like the extra "a"s either, ones name spaces are commonly used

alias setmember1d_nu as `in1d` or `isin1d`, because the function is a
"in" and not a set operation
+1
Post by Neil Crighton
Can we summarise the discussion in this thread and write up a short proposal
about what we'd like to change in arraysetops, and how to make the changes?
Then it's easy for other people to give their opinion on any changes. I can do
this if no one else has time.
other points

3. behavior of other set functions
-----------------------------------------------

guarantee that setdiff1d works for non-unique arrays (even when
implementation changes), and change documentation
+1

need to check other functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
union1d: works for non-unique arrays, obvious from source

setxor1d: requires unique arrays
Post by Neil Crighton
Post by Anne Archibald
np.setxor1d([1,2,3,3,4,5], [0,0,1,2,2,6])
array([2, 4, 5, 6])
Post by Neil Crighton
Post by Anne Archibald
np.setxor1d(np.unique([1,2,3,3,4,5]), np.unique([0,0,1,2,2,6]))
array([0, 3, 4, 5, 6])

setxor: add keyword option and call unique by default
+1 for symmetry

ediff1d and unique1d are defined for non-unique arrays


4. name of keyword
----------------------------

intersect1d(ar1, ar2, assume_unique=False)

alternative isunique=False or just unique=False
+1 less to write


5. module name
-----------------------

rename arraysetops to something easier to read like setfun. I think it
would only affect internal changes since all functions are exported to
the main numpy name space
+1e-4 (I got used to arrayse_tops)


5. keep docs in sync with correct usage
---------------------------------------------------------

obvious


That's my summary and opinions

Josef
Post by Neil Crighton
Neil
_______________________________________________
Numpy-discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Neil Crighton
2009-06-06 20:37:35 UTC
Permalink
Thanks for the summary! I'm +1 on points 1, 2 and 3.

+0 for points 4 and 5 (assume_unique keyword and renaming arraysetops).

Neil

PS. I think you mean deprecate, not depreciate :)
Robert Cimrman
2009-06-08 11:51:26 UTC
Permalink
Hi Josef,

thanks for the summary! I am responding below, later I will make an
enhancement ticket.
Post by j***@gmail.com
Post by Neil Crighton
Post by Robert Cimrman
Post by Anne Archibald
1. add a keyword argument to intersect1d "assume_unique"; if it is not
present, check for uniqueness and emit a warning if not unique
2. change the warning to an exception
3. change the meaning of the function to that of intersect1d_nu if the
keyword argument is not present
1. merge _nu version into one function
-------------------------------------------------------
Post by Neil Crighton
Post by Robert Cimrman
return intersect1d_nu(ar1, ar2)
... # the current code
intersect1d_nu could be still exported to numpy namespace, or not.
+1 - from the user's point of view there should just be intersect1d and
setmember1d (i.e. no '_nu' versions). The assume_unique keyword Robert suggests
can be used if speed is a problem.
+ 1 on rolling the _nu versions this way into the plain version, this
would avoid a lot of the confusion.
It would not be a code breaking API change for existing correct usage
(but some speed regression without adding keyword)
+1
Post by j***@gmail.com
depreciate intersect1d_nu
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Post by Neil Crighton
intersect1d_nu could be still exported to numpy namespace, or not.
I would say not, if they are the default branch of the non _nu version
+1 on depreciation
+0
Post by j***@gmail.com
2. alias as "in"
---------------------
Post by Neil Crighton
I really like in1d (no underscore) as a new name for setmember1d_nu. inarray is
another possibility. I don't like 'ain'; 'a' in front of 'in' detracts from
readability, unlike the extra a in arange.
I don't like the extra "a"s either, ones name spaces are commonly used
alias setmember1d_nu as `in1d` or `isin1d`, because the function is a
"in" and not a set operation
+1
+1
Post by j***@gmail.com
3. behavior of other set functions
-----------------------------------------------
guarantee that setdiff1d works for non-unique arrays (even when
implementation changes), and change documentation
+1
+1, it is useful for non-unique arrays.
Post by j***@gmail.com
need to check other functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
union1d: works for non-unique arrays, obvious from source
Yes.
Post by j***@gmail.com
setxor1d: requires unique arrays
Post by Neil Crighton
Post by Robert Cimrman
Post by Anne Archibald
np.setxor1d([1,2,3,3,4,5], [0,0,1,2,2,6])
array([2, 4, 5, 6])
Post by Neil Crighton
Post by Robert Cimrman
Post by Anne Archibald
np.setxor1d(np.unique([1,2,3,3,4,5]), np.unique([0,0,1,2,2,6]))
array([0, 3, 4, 5, 6])
setxor: add keyword option and call unique by default
+1 for symmetry
+1 - you mean np.setxor1d(np.unique(a), np.unique(b)) to become
np.setxor1d(a, b, assume_unique=False), right?
Post by j***@gmail.com
ediff1d and unique1d are defined for non-unique arrays
yes
Post by j***@gmail.com
4. name of keyword
----------------------------
intersect1d(ar1, ar2, assume_unique=False)
alternative isunique=False or just unique=False
+1 less to write
We should look at other functions in numpy (and/or scipy), what is a
common scheme here. -1e-1 to the proposed names, as isunique is singular
only, and unique=False does not show clearly the intent for me. What
about ar1_unique=False, ar2_unique=False - to address each argument
specifically?
Post by j***@gmail.com
5. module name
-----------------------
rename arraysetops to something easier to read like setfun. I think it
would only affect internal changes since all functions are exported to
the main numpy name space
+1e-4 (I got used to arrayse_tops)
+0 (internal change only). Other numpy/scipy submodules containing a
bunch of functions are called *pack (fftpack, arpack, lapack), *alg
(linalg), *utils. *fun is used comonly in the matlab world.
Post by j***@gmail.com
5. keep docs in sync with correct usage
---------------------------------------------------------
obvious
+1

thanks,
r.
Robert Cimrman
2009-06-08 13:38:20 UTC
Permalink
Post by Robert Cimrman
Hi Josef,
thanks for the summary! I am responding below, later I will make an
enhancement ticket.
Done, see http://projects.scipy.org/numpy/ticket/1133
r.

David Warde-Farley
2009-06-05 07:10:59 UTC
Permalink
Post by Anne Archibald
It seems to me that this is the basic source of the problem. Perhaps
this can be addressed? I realize maintaining compatibility with the
1. add a keyword argument to intersect1d "assume_unique"; if it is not
present, check for uniqueness and emit a warning if not unique
2. change the warning to an exception
3. change the meaning of the function to that of intersect1d_nu if the
keyword argument is not present
One could do something similar with setmember1d.
+1 on this idea. I've been bitten by the non-unique stuff in the past,
especially with setmember1d, not realizing that both need to be unique.

David
Continue reading on narkive:
Loading...