Skip Montanaro
2016-07-03 02:08:30 UTC
(I'm probably going to botch the description...)
Suppose I have a 2D array of Python objects, the first n elements of each
row form a key, the rest of the elements form the value. Each key can (and
generally does) occur multiple times. I'd like to generate a new array
consisting of just the first (or last) row for each key occurrence. Rows
retain their relative order on output.
For example, suppose I have this array with key length 2:
[ 'a', 27, 14.5 ]
[ 'b', 12, 99.0 ]
[ 'a', 27, 15.7 ]
[ 'a', 17, 100.3 ]
[ 'b', 12, -329.0 ]
Selecting the first occurrence of each key would return this array:
[ 'a', 27, 14.5 ]
[ 'b', 12, 99.0 ]
[ 'a', 17, 100.3 ]
while selecting the last occurrence would return this array:
[ 'a', 27, 15.7 ]
[ 'a', 17, 100.3 ]
[ 'b', 12, -329.0 ]
In real life, my array is a bit larger than this example, with the input
being on the order of a million rows, and the output being around 5000
rows. Avoiding processing all those extra rows at the Python level would
speed things up.
I don't know what this filter might be called (though I'm sure I haven't
thought of something new), so searching Google or Bing for it would seem to
be fruitless. It strikes me as something which numpy or Pandas might already
have in their bag(s) of tricks.
Pointers appreciated,
Skip
Suppose I have a 2D array of Python objects, the first n elements of each
row form a key, the rest of the elements form the value. Each key can (and
generally does) occur multiple times. I'd like to generate a new array
consisting of just the first (or last) row for each key occurrence. Rows
retain their relative order on output.
For example, suppose I have this array with key length 2:
[ 'a', 27, 14.5 ]
[ 'b', 12, 99.0 ]
[ 'a', 27, 15.7 ]
[ 'a', 17, 100.3 ]
[ 'b', 12, -329.0 ]
Selecting the first occurrence of each key would return this array:
[ 'a', 27, 14.5 ]
[ 'b', 12, 99.0 ]
[ 'a', 17, 100.3 ]
while selecting the last occurrence would return this array:
[ 'a', 27, 15.7 ]
[ 'a', 17, 100.3 ]
[ 'b', 12, -329.0 ]
In real life, my array is a bit larger than this example, with the input
being on the order of a million rows, and the output being around 5000
rows. Avoiding processing all those extra rows at the Python level would
speed things up.
I don't know what this filter might be called (though I'm sure I haven't
thought of something new), so searching Google or Bing for it would seem to
be fruitless. It strikes me as something which numpy or Pandas might already
have in their bag(s) of tricks.
Pointers appreciated,
Skip