of a need given the nice pandas functionality.
(copied from my gist for future list reference).
Post by Jeff RebackIn [10]: pd.options.display.max_rows=10
In [13]: np.random.seed(1234)
In [14]: c = np.random.randint(0,32,size=100000)
In [15]: v = np.arange(100000)
In [16]: df = DataFrame({'v' : v, 'c' : c})
In [17]: df
c v
0 15 0
1 19 1
2 6 2
3 21 3
4 12 4
... .. ...
99995 7 99995
99996 2 99996
99997 27 99997
99998 28 99998
99999 7 99999
[100000 rows x 2 columns]
In [19]: df.groupby('c').count()
0 3136
1 3229
2 3093
3 3121
4 3041
.. ...
27 3128
28 3063
29 3147
30 3073
31 3090
[32 rows x 1 columns]
In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop
In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop
In [22]: df.groupby('c').mean()
0 49883.384885
1 50233.692165
2 48634.116069
3 50811.743992
4 50505.368629
.. ...
27 49715.349425
28 50363.501469
29 50485.395933
30 50190.155223
31 50691.041748
[32 rows x 1 columns]
On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh
eyes maybe the performance of the naive version isn't too bad.
return [v[c == u] for u in unique(c)]
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
c = randint(0,32,size=100000)
v = arange(100000)
%timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
%timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1
Million or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I guess.
Overall numpy is falling far behind pandas in terms of simple
groupby operations. bincount and histogram (IIRC) worked for some
cases but are rather limited.
reduce_at looks nice for cases where it applies.
In contrast to the full sized labels in the original post, I only
know of applications where the labels are 1-D corresponding to rows
or columns.
In any case, maybe it is useful to Sergio or others.
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which
essentially does
return [v[c == u] for u in unique(c)]
Your example could be coded as
[sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often
enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan
for a PR
was to have a speedy version of it that uses a single pass
through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad
idea?) I'd love
to hear.
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a
label array to
guide operations.
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for
each label.
I don't know a way to make something like this
efficiently without a
loop. Just wondering...
NumPy-Discussion mailing list
NumPy-Discussion mailing list
NumPy-Discussion mailing list
NumPy-Discussion mailing list