Discussion:
[Numpy-discussion] [Suggestion] Labelled Array
Sérgio
2016-02-12 14:40:37 UTC
Permalink
Hello,

This is my first e-mail, I will try to make the idea simple.

Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])

The operations would create a new axis for label indexing.

You could think of it as a collection of masks, one for each label.

I don't know a way to make something like this efficiently without a loop.
Just wondering...

Sérgio.
Benjamin Root
2016-02-12 14:49:51 UTC
Permalink
Seems like you are talking about xarray: https://github.com/pydata/xarray

Cheers!
Ben Root
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a loop.
Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Benjamin Root
2016-02-12 14:52:54 UTC
Permalink
Re-reading your post, I see you are talking about something different. Not
exactly sure what your use-case is.

Ben Root
Post by Benjamin Root
Seems like you are talking about xarray: https://github.com/pydata/xarray
Cheers!
Ben Root
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Lluís Vilanova
2016-02-15 21:28:12 UTC
Permalink
Post by Benjamin Root
Seems like you are talking about xarray: https://github.com/pydata/xarray
Oh, I wasn't aware of xarray, but there's also this:

https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing


Cheers,
Lluis
Post by Benjamin Root
Cheers!
Ben Root
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a loop.
Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Paul Hobson
2016-02-15 22:31:12 UTC
Permalink
Just for posterity -- any future readers to this thread who need to do
pandas-like on record arrays should look at matplotlib's mlab submodule.

I've been in situations (::cough:: Esri production ::cough::) where I've
had one hand tied behind my back and unable to install pandas. mlab was a
big help there.

https://goo.gl/M7Mi8B

-paul
Post by Benjamin Root
https://github.com/pydata/xarray
https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing
Cheers,
Lluis
Post by Benjamin Root
Cheers!
Ben Root
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array
to
Post by Benjamin Root
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop.
Post by Benjamin Root
Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Benjamin Root
2016-02-19 18:44:16 UTC
Permalink
matplotlib would be more than happy if numpy could take those functions off
our hands! They don't get nearly the correct visibility in matplotlib
because no one is expecting them to be in a plotting library, and they
don't have any useful unit-tests. None of us made them, so we are very
hesitant to update them because of that.

Cheers!
Ben Root
I also want to add a historical note here, that 'groupby' has been
discussed a couple times before.
Travis Oliphant even made an NEP for it, and Wes McKinney lightly hinted
at adding it to numpy.
http://thread.gmane.org/gmane.comp.python.numeric.general/37480/focus=37480
http://thread.gmane.org/gmane.comp.python.numeric.general/38272/focus=38299
http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html
Travis's idea for a ufunc method 'reduceby' is more along the lines of
what I was originally thinking. Just musing about it, it might cover few
small cases pandas groupby might not: It could work on arbitrary ufuncs,
and over particular axes of multidimensional data. Eg, to sum over
pixels from NxNx3 image data. But maybe pandas can cover the
multidimensional case through additional index columns or with Panel.
xarray is now covering that area.
There are also recfunctions in numpy.lib that never got a lot of attention
and expansion.
There were plans to cover more of the matplotlib versions in numpy, but I
have no idea and didn't check what happened to it..
Josef
Cheers,
Allan
Post by Paul Hobson
Just for posterity -- any future readers to this thread who need to do
pandas-like on record arrays should look at matplotlib's mlab submodule.
I've been in situations (::cough:: Esri production ::cough::) where I've
had one hand tied behind my back and unable to install pandas. mlab was
a big help there.
https://goo.gl/M7Mi8B
-paul
https://github.com/pydata/xarray
https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing
Post by Paul Hobson
Cheers,
Lluis
Post by Benjamin Root
Cheers!
Ben Root
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label
array to
Post by Benjamin Root
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each
label.
Post by Benjamin Root
I don't know a way to make something like this efficiently
without a loop.
Post by Benjamin Root
Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Allan Haldane
2016-02-13 17:11:07 UTC
Permalink
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does

def split_classes(c, v):
return [v[c == u] for u in unique(c)]

Your example could be coded as
Post by Sérgio
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]

I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.

I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

Allan
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Allan Haldane
2016-02-13 18:01:53 UTC
Permalink
Sorry, to reply to myself here, but looking at it with fresh eyes maybe
the performance of the naive version isn't too bad. Here's a comparison
of the naive vs a better implementation:

def split_classes_naive(c, v):
return [v[c == u] for u in unique(c)]

def split_classes(c, v):
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
Post by Allan Haldane
Post by Sérgio
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
Post by Allan Haldane
Post by Sérgio
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop

In any case, maybe it is useful to Sergio or others.

Allan
Post by Allan Haldane
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
Post by Sérgio
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
j***@gmail.com
2016-02-13 18:29:44 UTC
Permalink
Post by Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh eyes maybe
the performance of the naive version isn't too bad. Here's a comparison of
return [v[c == u] for u in unique(c)]
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
Post by Allan Haldane
Post by Sérgio
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
Post by Allan Haldane
Post by Sérgio
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1 Million
or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I guess.

Overall numpy is falling far behind pandas in terms of simple groupby
operations. bincount and histogram (IIRC) worked for some cases but are
rather limited.

reduce_at looks nice for cases where it applies.

In contrast to the full sized labels in the original post, I only know of
applications where the labels are 1-D corresponding to rows or columns.

Josef
Post by Allan Haldane
In any case, maybe it is useful to Sergio or others.
Allan
Post by Allan Haldane
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
Post by Sérgio
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Jeff Reback
2016-02-13 18:39:34 UTC
Permalink
In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)

In [14]: c = np.random.randint(0,32,size=100000)

In [15]: v = np.arange(100000)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
c v
0 15 0
1 19 1
2 6 2
3 21 3
4 12 4
... .. ...
99995 7 99995
99996 2 99996
99997 27 99997
99998 28 99998
99999 7 99999

[100000 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
v
c
0 3136
1 3229
2 3093
3 3121
4 3041
.. ...
27 3128
28 3063
29 3147
30 3073
31 3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
v
c
0 49883.384885
1 50233.692165
2 48634.116069
3 50811.743992
4 50505.368629
.. ...
27 49715.349425
28 50363.501469
29 50485.395933
30 50190.155223
31 50691.041748

[32 rows x 1 columns]
Post by j***@gmail.com
Post by Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh eyes maybe
the performance of the naive version isn't too bad. Here's a comparison of
return [v[c == u] for u in unique(c)]
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
Post by Allan Haldane
Post by Sérgio
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
Post by Allan Haldane
Post by Sérgio
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1 Million
or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I guess.
Overall numpy is falling far behind pandas in terms of simple groupby
operations. bincount and histogram (IIRC) worked for some cases but are
rather limited.
reduce_at looks nice for cases where it applies.
In contrast to the full sized labels in the original post, I only know of
applications where the labels are 1-D corresponding to rows or columns.
Josef
Post by Allan Haldane
In any case, maybe it is useful to Sergio or others.
Allan
Post by Allan Haldane
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
Post by Sérgio
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Jeff Reback
2016-02-13 18:42:20 UTC
Permalink
These operations get slower as the number of groups increase, but with a
faster function (e.g. the standard ones which are cythonized), the constant
on
the increase is pretty low.

In [23]: c = np.random.randint(0,10000,size=100000)

In [24]: df = DataFrame({'v' : v, 'c' : c})

In [25]: %timeit df.groupby('c').count()
100 loops, best of 3: 3.18 ms per loop

In [26]: len(df.groupby('c').count())
Out[26]: 10000

In [27]: df.groupby('c').count()
Out[27]:
v
c
0 9
1 11
2 7
3 8
4 16
... ..
9995 11
9996 13
9997 13
9998 7
9999 10

[10000 rows x 1 columns]
Post by Jeff Reback
In [10]: pd.options.display.max_rows=10
In [13]: np.random.seed(1234)
In [14]: c = np.random.randint(0,32,size=100000)
In [15]: v = np.arange(100000)
In [16]: df = DataFrame({'v' : v, 'c' : c})
In [17]: df
c v
0 15 0
1 19 1
2 6 2
3 21 3
4 12 4
... .. ...
99995 7 99995
99996 2 99996
99997 27 99997
99998 28 99998
99999 7 99999
[100000 rows x 2 columns]
In [19]: df.groupby('c').count()
v
c
0 3136
1 3229
2 3093
3 3121
4 3041
.. ...
27 3128
28 3063
29 3147
30 3073
31 3090
[32 rows x 1 columns]
In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop
In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop
In [22]: df.groupby('c').mean()
v
c
0 49883.384885
1 50233.692165
2 48634.116069
3 50811.743992
4 50505.368629
.. ...
27 49715.349425
28 50363.501469
29 50485.395933
30 50190.155223
31 50691.041748
[32 rows x 1 columns]
Post by j***@gmail.com
Post by Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh eyes maybe
the performance of the naive version isn't too bad. Here's a comparison of
return [v[c == u] for u in unique(c)]
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
Post by Allan Haldane
Post by Sérgio
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
Post by Allan Haldane
Post by Sérgio
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1 Million
or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I guess.
Overall numpy is falling far behind pandas in terms of simple groupby
operations. bincount and histogram (IIRC) worked for some cases but are
rather limited.
reduce_at looks nice for cases where it applies.
In contrast to the full sized labels in the original post, I only know of
applications where the labels are 1-D corresponding to rows or columns.
Josef
Post by Allan Haldane
In any case, maybe it is useful to Sergio or others.
Allan
Post by Allan Haldane
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
Post by Sérgio
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
j***@gmail.com
2016-02-13 18:51:44 UTC
Permalink
Post by Jeff Reback
These operations get slower as the number of groups increase, but with a
faster function (e.g. the standard ones which are cythonized), the
constant on
the increase is pretty low.
In [23]: c = np.random.randint(0,10000,size=100000)
In [24]: df = DataFrame({'v' : v, 'c' : c})
In [25]: %timeit df.groupby('c').count()
100 loops, best of 3: 3.18 ms per loop
In [26]: len(df.groupby('c').count())
Out[26]: 10000
In [27]: df.groupby('c').count()
v
c
0 9
1 11
2 7
3 8
4 16
... ..
9995 11
9996 13
9997 13
9998 7
9999 10
[10000 rows x 1 columns]
One other difference across usecases is whether this is a single operation,
or we want to optimize the data format for a large number of different
calculations. (We have both cases in statsmodels.)

In the latter case it's worth spending some extra computational effort on
rearranging the data to be either sorted or in lists of arrays, (I guess
without having done any timings).

Josef
Post by Jeff Reback
Post by Jeff Reback
In [10]: pd.options.display.max_rows=10
In [13]: np.random.seed(1234)
In [14]: c = np.random.randint(0,32,size=100000)
In [15]: v = np.arange(100000)
In [16]: df = DataFrame({'v' : v, 'c' : c})
In [17]: df
c v
0 15 0
1 19 1
2 6 2
3 21 3
4 12 4
... .. ...
99995 7 99995
99996 2 99996
99997 27 99997
99998 28 99998
99999 7 99999
[100000 rows x 2 columns]
In [19]: df.groupby('c').count()
v
c
0 3136
1 3229
2 3093
3 3121
4 3041
.. ...
27 3128
28 3063
29 3147
30 3073
31 3090
[32 rows x 1 columns]
In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop
In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop
In [22]: df.groupby('c').mean()
v
c
0 49883.384885
1 50233.692165
2 48634.116069
3 50811.743992
4 50505.368629
.. ...
27 49715.349425
28 50363.501469
29 50485.395933
30 50190.155223
31 50691.041748
[32 rows x 1 columns]
Post by j***@gmail.com
Post by Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh eyes maybe
the performance of the naive version isn't too bad. Here's a comparison of
return [v[c == u] for u in unique(c)]
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
Post by Allan Haldane
Post by Sérgio
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
Post by Allan Haldane
Post by Sérgio
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1
Million or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I guess.
Overall numpy is falling far behind pandas in terms of simple groupby
operations. bincount and histogram (IIRC) worked for some cases but are
rather limited.
reduce_at looks nice for cases where it applies.
In contrast to the full sized labels in the original post, I only know
of applications where the labels are 1-D corresponding to rows or columns.
Josef
Post by Allan Haldane
In any case, maybe it is useful to Sergio or others.
Allan
Post by Allan Haldane
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
Post by Sérgio
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Allan Haldane
2016-02-14 03:41:13 UTC
Permalink
Impressive!

Possibly there's still a case for including a 'groupby' function in
numpy itself since it's a generally useful operation, but I do see less
of a need given the nice pandas functionality.

At least, next time someone asks a stackoverflow question like the ones
below someone should tell them to use pandas!

(copied from my gist for future list reference).

http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy

Allan
Post by Jeff Reback
In [10]: pd.options.display.max_rows=10
In [13]: np.random.seed(1234)
In [14]: c = np.random.randint(0,32,size=100000)
In [15]: v = np.arange(100000)
In [16]: df = DataFrame({'v' : v, 'c' : c})
In [17]: df
c v
0 15 0
1 19 1
2 6 2
3 21 3
4 12 4
... .. ...
99995 7 99995
99996 2 99996
99997 27 99997
99998 28 99998
99999 7 99999
[100000 rows x 2 columns]
In [19]: df.groupby('c').count()
v
c
0 3136
1 3229
2 3093
3 3121
4 3041
.. ...
27 3128
28 3063
29 3147
30 3073
31 3090
[32 rows x 1 columns]
In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop
In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop
In [22]: df.groupby('c').mean()
v
c
0 49883.384885
1 50233.692165
2 48634.116069
3 50811.743992
4 50505.368629
.. ...
27 49715.349425
28 50363.501469
29 50485.395933
30 50190.155223
31 50691.041748
[32 rows x 1 columns]
On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh
eyes maybe the performance of the naive version isn't too bad.
return [v[c == u] for u in unique(c)]
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1
Million or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I guess.
Overall numpy is falling far behind pandas in terms of simple
groupby operations. bincount and histogram (IIRC) worked for some
cases but are rather limited.
reduce_at looks nice for cases where it applies.
In contrast to the full sized labels in the original post, I only
know of applications where the labels are 1-D corresponding to rows
or columns.
Josef
In any case, maybe it is useful to Sergio or others.
Allan
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which
essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often
enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan
for a PR
was to have a speedy version of it that uses a single pass
through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad
idea?) I'd love
to hear.
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a
label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for
each label.
I don't know a way to make something like this
efficiently without a
loop. Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2016-02-13 18:16:19 UTC
Permalink
I believe this is basically a groupby, which is one of pandas's core
competencies... even if numpy were to add some utilities for this kind of
thing, then I doubt we'd do as well as them, so you might check whether
pandas works for you first :-)
Post by Sérgio
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a loop.
Just wondering...
Sérgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Sérgio
2016-02-16 14:05:51 UTC
Permalink
Post by Jeff Reback
image
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],

[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],

[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
Post by Jeff Reback
label
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7]])
Post by Jeff Reback
dt = pd.DataFrame(np.vstack((label.ravel(), image.reshape(3, 20))).T)
labelled_image = dt.groupby(0)
labelled_image.mean().values
array([[ 0, 20, 40],
[ 3, 23, 43],
[ 6, 26, 46],
[ 9, 29, 49],
[10, 30, 50],
[13, 33, 53],
[16, 36, 56],
[19, 39, 59]])

Sergio
Date: Sat, 13 Feb 2016 22:41:13 -0500
Subject: Re: [Numpy-discussion] [Suggestion] Labelled Array
Content-Type: text/plain; charset=windows-1252; format=flowed
Impressive!
Possibly there's still a case for including a 'groupby' function in
numpy itself since it's a generally useful operation, but I do see less
of a need given the nice pandas functionality.
At least, next time someone asks a stackoverflow question like the ones
below someone should tell them to use pandas!
(copied from my gist for future list reference).
http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy
Allan
Post by Jeff Reback
In [10]: pd.options.display.max_rows=10
In [13]: np.random.seed(1234)
In [14]: c = np.random.randint(0,32,size=100000)
In [15]: v = np.arange(100000)
In [16]: df = DataFrame({'v' : v, 'c' : c})
In [17]: df
c v
0 15 0
1 19 1
2 6 2
3 21 3
4 12 4
... .. ...
99995 7 99995
99996 2 99996
99997 27 99997
99998 28 99998
99999 7 99999
[100000 rows x 2 columns]
In [19]: df.groupby('c').count()
v
c
0 3136
1 3229
2 3093
3 3121
4 3041
.. ...
27 3128
28 3063
29 3147
30 3073
31 3090
[32 rows x 1 columns]
In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop
In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop
In [22]: df.groupby('c').mean()
v
c
0 49883.384885
1 50233.692165
2 48634.116069
3 50811.743992
4 50505.368629
.. ...
27 49715.349425
28 50363.501469
29 50485.395933
30 50190.155223
31 50691.041748
[32 rows x 1 columns]
On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh
eyes maybe the performance of the naive version isn't too bad.
return [v[c == u] for u in unique(c)]
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1
Million or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I
guess.
Overall numpy is falling far behind pandas in terms of simple
groupby operations. bincount and histogram (IIRC) worked for some
cases but are rather limited.
reduce_at looks nice for cases where it applies.
In contrast to the full sized labels in the original post, I only
know of applications where the labels are 1-D corresponding to rows
or columns.
Josef
In any case, maybe it is useful to Sergio or others.
Allan
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which
essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often
enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor
performance
because it creates many temporary boolean arrays, so my plan
for a PR
was to have a speedy version of it that uses a single pass
through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad
idea?) I'd love
to hear.
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
Hello,
This is my first e-mail, I will try to make the idea
simple.
Post by Jeff Reback
Similar to masked array it would be interesting to use a
label array to
guide operations.
x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
sum(x)
array([9, 12, 15])
The operations would create a new axis for label
indexing.
Post by Jeff Reback
You could think of it as a collection of masks, one for
each label.
I don't know a way to make something like this
efficiently without a
loop. Just wondering...
S?rgio.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...