Discussion:
[Numpy-discussion] Dynamic array list implementation
Nicolas P. Rougier
2015-12-22 11:47:12 UTC
Permalink
I've coded a typed dynamic list based on numpy array (needed for the glumpy project).
Code is available from https://github.com/rougier/numpy-list


A Numpy array list is a strongly typed list whose type can be anything that can be interpreted as a numpy data type.
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
You can add several items at once by specifying common or individual size: a single scalar means all items are the same size while a list of sizes is used to specify individual item sizes.
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']


Nicolas
Chris Barker
2015-12-22 19:19:39 UTC
Permalink
sorry for being so lazy as to not go look at the project pages, but....

This sounds like it could be really useful, and maybe supercise a coupl eof
half-baked projects of mine. But -- what does "dynamic" mean?

- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
Post by Nicolas P. Rougier
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
for row in L:
print row
Post by Nicolas P. Rougier
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5

L* some_array

in which case, how does it do broadcasting???

Thanks,

-CHB
Post by Nicolas P. Rougier
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Nicolas P. Rougier
2015-12-22 20:21:42 UTC
Permalink
Yes, you can append/insert/remove items.
It works pretty much like a python list in fact (but with a single data type for all elements).

Nicolas
Post by Chris Barker
sorry for being so lazy as to not go look at the project pages, but....
This sounds like it could be really useful, and maybe supercise a coupl eof half-baked projects of mine. But -- what does "dynamic" mean?
- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print row
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5
L* some_array
in which case, how does it do broadcasting???
Thanks,
-CHB
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Stephan Hoyer
2015-12-23 08:34:47 UTC
Permalink
We have a type similar to this (a typed list) internally in pandas, although it is restricted to a single dimension and far from feature complete -- it only has .append and a .to_array() method for converting to a 1d numpy array. Our version is written in Cython, and we use it for performance reasons when we would otherwise need to create a list of unknown length:



https://github.com/pydata/pandas/blob/v0.17.1/pandas/hashtable.pyx#L99



In my experience, it's several times faster than using a builtin list from Cython, which makes sense given that it needs to copy about 1/3 the data (no type or reference count for individual elements). Obviously, it uses 1/3 the space to store the data, too. We currently don't expose this object externally, but it could be an interesting project to adapt this code into a standalone project that could be more broadly useful.






Cheers,

Stephan
Post by Chris Barker
sorry for being so lazy as to not go look at the project pages, but....
This sounds like it could be really useful, and maybe supercise a coupl eof
half-baked projects of mine. But -- what does "dynamic" mean?
- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
Post by Nicolas P. Rougier
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print row
Post by Nicolas P. Rougier
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5
L* some_array
in which case, how does it do broadcasting???
Thanks,
-CHB
Post by Nicolas P. Rougier
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Nicolas P. Rougier
2015-12-23 12:01:25 UTC
Permalink
Typed list in numpy would be a nice addition indeed and your cython implementation is nice (and small).

In my case I need to ensure a contiguous storage to allow easy upload onto the GPU.
Post by Stephan Hoyer
python benchmark.py
Python list, append 100000 items: 0.01161
Array list, append 100000 items: 0.46854
Array list, append 100000 items at once: 0.05801
Python list, prepend 100000 items: 1.96168
Array list, prepend 100000 items: 12.83371
Array list, append 100000 items at once: 0.06002
Post by Stephan Hoyer
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
for item in L: print(item)
[0]
[1 2]
[3 4 5]
[6 7 8 9]
Post by Stephan Hoyer
print (type(L.data))
<class 'numpy.ndarray'>
Post by Stephan Hoyer
print(L.data.dtype)
int64
Post by Stephan Hoyer
print(L.data.shape)
(10,)


I did not implement operations yet, but it would be a matter for transferring call to the underlying numpy data array.
Post by Stephan Hoyer
L._data *= 2
print(L)
[[0], [4 8], [12 16 20], [24 28 32 36]]
Post by Stephan Hoyer
https://github.com/pydata/pandas/blob/v0.17.1/pandas/hashtable.pyx#L99
In my experience, it's several times faster than using a builtin list from Cython, which makes sense given that it needs to copy about 1/3 the data (no type or reference count for individual elements). Obviously, it uses 1/3 the space to store the data, too. We currently don't expose this object externally, but it could be an interesting project to adapt this code into a standalone project that could be more broadly useful.
Cheers,
Stephan
sorry for being so lazy as to not go look at the project pages, but....
This sounds like it could be really useful, and maybe supercise a coupl eof half-baked projects of mine. But -- what does "dynamic" mean?
- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print row
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5
L* some_array
in which case, how does it do broadcasting???
Thanks,
-CHB
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker
2015-12-24 18:19:32 UTC
Permalink
On Wed, Dec 23, 2015 at 4:01 AM, Nicolas P. Rougier <
Post by Nicolas P. Rougier
Typed list in numpy would be a nice addition indeed and your cython
implementation is nice (and small).
It seems we have a lot of duplicated effort here. Pernonally, I have two
needs:

1) ragged arrays
2) "growable" arrays.

I have semi-complete version of both of these, which are completely
independent -- not sure if it makes sense to combine them, I suppose not.

But we've talked a bit about "typed list", and I'm not sure what that
means -- is it something that is entirely like a python list, except that
all the elements have the same type?

Anyway: I've been thinking about this fromt eh opposite direction: I want a
numpy array that you can append/extend. This comes from the fact that it's
not uncommon to need to build up an array where you don't know how large it
will be when you start. The common recommendation for doing that now is to
built it up in a list, and then, when you are done, turn it into an ndarray.

But that means you are limited to python types (or putting numpy scalars in
a list...), and it's not very memory efficient.

My version used a ndarray internally, and over allocates it a bit, using
ndarray.resize() to resize. this means that you can get the data pointer if
you want for Cython, etc... but also that it's getting re-allocated, so
that pointer is fragile, and you don't want other arrays to have views on
it.

Interestingly, if you are adding one float, for example, at a time to the
array, it's actually a bit faster to build it up in a list, and then make
an array out of it.

But it is more memory efficient and faster if you are using numpy dtypes
and especially if you are extend()ing it with chunks from other arrays.

I also have a not-quite finished version in Cython that statically handles
the core C data types -- that should be faster, but I haven't really
profiled it.

I'll try to get the code up on gitHub.

It would be nice to combine efforts.

-CHB
Post by Nicolas P. Rougier
In my case I need to ensure a contiguous storage to allow easy upload onto the GPU.
Post by Stephan Hoyer
python benchmark.py
Python list, append 100000 items: 0.01161
Array list, append 100000 items: 0.46854
Array list, append 100000 items at once: 0.05801
Python list, prepend 100000 items: 1.96168
Array list, prepend 100000 items: 12.83371
Array list, append 100000 items at once: 0.06002
Post by Stephan Hoyer
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
for item in L: print(item)
[0]
[1 2]
[3 4 5]
[6 7 8 9]
Post by Stephan Hoyer
print (type(L.data))
<class 'numpy.ndarray'>
Post by Stephan Hoyer
print(L.data.dtype)
int64
Post by Stephan Hoyer
print(L.data.shape)
(10,)
I did not implement operations yet, but it would be a matter for
transferring call to the underlying numpy data array.
Post by Stephan Hoyer
L._data *= 2
print(L)
[[0], [4 8], [12 16 20], [24 28 32 36]]
Post by Stephan Hoyer
We have a type similar to this (a typed list) internally in pandas,
although it is restricted to a single dimension and far from feature
complete -- it only has .append and a .to_array() method for converting to
a 1d numpy array. Our version is written in Cython, and we use it for
performance reasons when we would otherwise need to create a list of
Post by Stephan Hoyer
https://github.com/pydata/pandas/blob/v0.17.1/pandas/hashtable.pyx#L99
In my experience, it's several times faster than using a builtin list
from Cython, which makes sense given that it needs to copy about 1/3 the
data (no type or reference count for individual elements). Obviously, it
uses 1/3 the space to store the data, too. We currently don't expose this
object externally, but it could be an interesting project to adapt this
code into a standalone project that could be more broadly useful.
Post by Stephan Hoyer
Cheers,
Stephan
sorry for being so lazy as to not go look at the project pages, but....
This sounds like it could be really useful, and maybe supercise a coupl
eof half-baked projects of mine. But -- what does "dynamic" mean?
Post by Stephan Hoyer
- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print row
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5
L* some_array
in which case, how does it do broadcasting???
Thanks,
-CHB
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Chris Barker
2015-12-24 18:23:24 UTC
Permalink
Post by Chris Barker
I'll try to get the code up on gitHub.
Hey look -- it's already there:

https://github.com/PythonCHB/NumpyExtras

too many gitHub accounts.....

Here is the list/growable array/ accumulator:

https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py

And here is the ragged array:

https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/ragged_array.py

I haven't touched either of these for a while -- not really sure what state
they are in.

-CHB
Post by Chris Barker
It would be nice to combine efforts.
-CHB
Post by Nicolas P. Rougier
In my case I need to ensure a contiguous storage to allow easy upload onto the GPU.
Post by Stephan Hoyer
python benchmark.py
Python list, append 100000 items: 0.01161
Array list, append 100000 items: 0.46854
Array list, append 100000 items at once: 0.05801
Python list, prepend 100000 items: 1.96168
Array list, prepend 100000 items: 12.83371
Array list, append 100000 items at once: 0.06002
Post by Stephan Hoyer
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
for item in L: print(item)
[0]
[1 2]
[3 4 5]
[6 7 8 9]
Post by Stephan Hoyer
print (type(L.data))
<class 'numpy.ndarray'>
Post by Stephan Hoyer
print(L.data.dtype)
int64
Post by Stephan Hoyer
print(L.data.shape)
(10,)
I did not implement operations yet, but it would be a matter for
transferring call to the underlying numpy data array.
Post by Stephan Hoyer
L._data *= 2
print(L)
[[0], [4 8], [12 16 20], [24 28 32 36]]
Post by Stephan Hoyer
We have a type similar to this (a typed list) internally in pandas,
although it is restricted to a single dimension and far from feature
complete -- it only has .append and a .to_array() method for converting to
a 1d numpy array. Our version is written in Cython, and we use it for
performance reasons when we would otherwise need to create a list of
Post by Stephan Hoyer
https://github.com/pydata/pandas/blob/v0.17.1/pandas/hashtable.pyx#L99
In my experience, it's several times faster than using a builtin list
from Cython, which makes sense given that it needs to copy about 1/3 the
data (no type or reference count for individual elements). Obviously, it
uses 1/3 the space to store the data, too. We currently don't expose this
object externally, but it could be an interesting project to adapt this
code into a standalone project that could be more broadly useful.
Post by Stephan Hoyer
Cheers,
Stephan
sorry for being so lazy as to not go look at the project pages, but....
This sounds like it could be really useful, and maybe supercise a coupl
eof half-baked projects of mine. But -- what does "dynamic" mean?
Post by Stephan Hoyer
- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print row
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5
L* some_array
in which case, how does it do broadcasting???
Thanks,
-CHB
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Chris Barker
2015-12-28 18:58:18 UTC
Permalink
On Wed, Dec 23, 2015 at 4:01 AM, Nicolas P. Rougier <
Post by Nicolas P. Rougier
python benchmark.py
Python list, append 100000 items: 0.01161
Array list, append 100000 items: 0.46854
are you pre-allocating any extra space? if not -- it's going to be really,
really pokey when adding a little bit at a time.

With my Accumulator class:

https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py

I pre-allocate a larger numpy array to start, and it gets re-allocated,
with some extra, when filled, using ndarray.resize()

this is quite fast.

These are settable parameters in the class:

DEFAULT_BUFFER_SIZE = 128 # original buffer created.
BUFFER_EXTEND_SIZE = 1.25 # array.array uses 1+1/16 -- that seems small to
me.


I looked at the code in array.array (and list, I think), and it does stuff
to optimize very small arrays, which I figured wasn't the use-case here :-)

But I did a bunch of experimentation, and as long as you pre-allocate
_some_ it doesn't make much difference how much :-)

BTW,

I just went in an updated and tested the Accumulator class code -- it
needed some tweaks, but it's working now.

The cython version is in an unknown state...

some profiling:

In [11]: run profile_accumulator.py


In [12]: timeit accum1(10000)

100 loops, best of 3: 3.91 ms per loop

In [13]: timeit list1(10000)

1000 loops, best of 3: 1.15 ms per loop

These are simply appending 10,000 integers in a loop -- with teh list, the
list is turned into a numpy array at the end. So it's still faster to
accumulate in a list, then make an array, but only a about a factor of 3 --
I think this is because you are staring with a python integer -- with the
accumulator function, you need to be checking type and pulling a native
integer out with each append. but a list can append a python object with no
type checking or anything.

Then the conversion from list to array is all in C.

Note that the accumulator version is still more memory efficient...

In [14]: timeit accum2(10000)

100 loops, best of 3: 3.84 ms per loop

this version pre-allocated the whole internal buffer -- not much faster the
buffer re-allocation isn't a big deal (thanks to ndarray.resize using
realloc(), and not creating a new numpy array)

In [24]: timeit list_extend1(100000)

100 loops, best of 3: 4.15 ms per loop

In [25]: timeit accum_extend1(100000)

1000 loops, best of 3: 1.37 ms per loop

This time, the stuff is added in chunks 100 elements at a time -- the
chunks being created ahead of time -- a list with range() the first time,
and an array with arange() the second. much faster to extend with arrays...

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Nicolas P. Rougier
2015-12-30 14:34:28 UTC
Permalink
Post by Nicolas P. Rougier
python benchmark.py
Python list, append 100000 items: 0.01161
Array list, append 100000 items: 0.46854
are you pre-allocating any extra space? if not -- it's going to be really, really pokey when adding a little bit at a time.
Yes, I’m preallocating but it might not be optimal at all given your implementation is much faster.
I’ll try to adapt your code. Thanks.
Post by Nicolas P. Rougier
https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py
I pre-allocate a larger numpy array to start, and it gets re-allocated, with some extra, when filled, using ndarray.resize()
this is quite fast.
DEFAULT_BUFFER_SIZE = 128 # original buffer created.
BUFFER_EXTEND_SIZE = 1.25 # array.array uses 1+1/16 -- that seems small to me.
I looked at the code in array.array (and list, I think), and it does stuff to optimize very small arrays, which I figured wasn't the use-case here :-)
But I did a bunch of experimentation, and as long as you pre-allocate _some_ it doesn't make much difference how much :-)
BTW,
I just went in an updated and tested the Accumulator class code -- it needed some tweaks, but it's working now.
The cython version is in an unknown state...
In [11]: run profile_accumulator.py
In [12]: timeit accum1(10000)
100 loops, best of 3: 3.91 ms per loop
In [13]: timeit list1(10000)
1000 loops, best of 3: 1.15 ms per loop
These are simply appending 10,000 integers in a loop -- with teh list, the list is turned into a numpy array at the end. So it's still faster to accumulate in a list, then make an array, but only a about a factor of 3 -- I think this is because you are staring with a python integer -- with the accumulator function, you need to be checking type and pulling a native integer out with each append. but a list can append a python object with no type checking or anything.
Then the conversion from list to array is all in C.
Note that the accumulator version is still more memory efficient...
In [14]: timeit accum2(10000)
100 loops, best of 3: 3.84 ms per loop
this version pre-allocated the whole internal buffer -- not much faster the buffer re-allocation isn't a big deal (thanks to ndarray.resize using realloc(), and not creating a new numpy array)
In [24]: timeit list_extend1(100000)
100 loops, best of 3: 4.15 ms per loop
In [25]: timeit accum_extend1(100000)
1000 loops, best of 3: 1.37 ms per loop
This time, the stuff is added in chunks 100 elements at a time -- the chunks being created ahead of time -- a list with range() the first time, and an array with arange() the second. much faster to extend with arrays...
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker
2015-12-31 18:08:34 UTC
Permalink
On Wed, Dec 30, 2015 at 6:34 AM, Nicolas P. Rougier <
Post by Nicolas P. Rougier
Post by Nicolas P. Rougier
python benchmark.py
Python list, append 100000 items: 0.01161
Array list, append 100000 items: 0.46854
are you pre-allocating any extra space? if not -- it's going to be
really, really pokey when adding a little bit at a time.
Yes, I’m preallocating but it might not be optimal at all given your
implementation is much faster.
I’ll try to adapt your code. Thanks.
sounds good -- I'll try to take a look at yours soon - maybe we can merge
the projects. MIne is only operational in one small place, I think.

-CHB
Post by Nicolas P. Rougier
https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py
Post by Nicolas P. Rougier
I pre-allocate a larger numpy array to start, and it gets re-allocated,
with some extra, when filled, using ndarray.resize()
Post by Nicolas P. Rougier
this is quite fast.
DEFAULT_BUFFER_SIZE = 128 # original buffer created.
BUFFER_EXTEND_SIZE = 1.25 # array.array uses 1+1/16 -- that seems small
to me.
Post by Nicolas P. Rougier
I looked at the code in array.array (and list, I think), and it does
stuff to optimize very small arrays, which I figured wasn't the use-case
here :-)
Post by Nicolas P. Rougier
But I did a bunch of experimentation, and as long as you pre-allocate
_some_ it doesn't make much difference how much :-)
Post by Nicolas P. Rougier
BTW,
I just went in an updated and tested the Accumulator class code -- it
needed some tweaks, but it's working now.
Post by Nicolas P. Rougier
The cython version is in an unknown state...
In [11]: run profile_accumulator.py
In [12]: timeit accum1(10000)
100 loops, best of 3: 3.91 ms per loop
In [13]: timeit list1(10000)
1000 loops, best of 3: 1.15 ms per loop
These are simply appending 10,000 integers in a loop -- with teh list,
the list is turned into a numpy array at the end. So it's still faster to
accumulate in a list, then make an array, but only a about a factor of 3 --
I think this is because you are staring with a python integer -- with the
accumulator function, you need to be checking type and pulling a native
integer out with each append. but a list can append a python object with no
type checking or anything.
Post by Nicolas P. Rougier
Then the conversion from list to array is all in C.
Note that the accumulator version is still more memory efficient...
In [14]: timeit accum2(10000)
100 loops, best of 3: 3.84 ms per loop
this version pre-allocated the whole internal buffer -- not much faster
the buffer re-allocation isn't a big deal (thanks to ndarray.resize using
realloc(), and not creating a new numpy array)
Post by Nicolas P. Rougier
In [24]: timeit list_extend1(100000)
100 loops, best of 3: 4.15 ms per loop
In [25]: timeit accum_extend1(100000)
1000 loops, best of 3: 1.37 ms per loop
This time, the stuff is added in chunks 100 elements at a time -- the
chunks being created ahead of time -- a list with range() the first time,
and an array with arange() the second. much faster to extend with arrays...
Post by Nicolas P. Rougier
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Sebastian Berg
2015-12-23 12:31:27 UTC
Permalink
Post by Stephan Hoyer
We have a type similar to this (a typed list) internally in pandas,
although it is restricted to a single dimension and far from feature
complete -- it only has .append and a .to_array() method for
converting to a 1d numpy array. Our version is written in Cython, and
we use it for performance reasons when we would otherwise need to
https://github.com/pydata/pandas/blob/v0.17.1/pandas/hashtable.pyx#L99
Probably is a bit orthogonal since I guess you want/need cython, but
pythons buildin array.array should get you there pretty much as well. Of
course it requires the C typecode (though that should not be hard to
get) and does not support strings.

- Sebastian
Post by Stephan Hoyer
In my experience, it's several times faster than using a builtin list
from Cython, which makes sense given that it needs to copy about 1/3
the data (no type or reference count for individual elements).
Obviously, it uses 1/3 the space to store the data, too. We currently
don't expose this object externally, but it could be an interesting
project to adapt this code into a standalone project that could be
more broadly useful.
Cheers,
Stephan
sorry for being so lazy as to not go look at the project pages, but....
This sounds like it could be really useful, and maybe
supercise a coupl eof half-baked projects of mine. But -- what
does "dynamic" mean?
- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print row
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5
L* some_array
in which case, how does it do broadcasting???
Thanks,
-CHB
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker
2015-12-24 18:08:20 UTC
Permalink
Post by Sebastian Berg
Probably is a bit orthogonal since I guess you want/need cython, but
pythons builtin array.array should get you there pretty much as well.
I don't think it's orthogonal to cython -- you can access an array.array
directly from within cython -- it's actually about the easiest way to get a
array-like object in Cython/C (which you can then access via a memoryview,
etc).

Though I don't know there is a python object (i.e. pointer) option there.
(nor text).

-CHB
Post by Sebastian Berg
Of
course it requires the C typecode (though that should not be hard to
get) and does not support strings.
- Sebastian
Post by Stephan Hoyer
In my experience, it's several times faster than using a builtin list
from Cython, which makes sense given that it needs to copy about 1/3
the data (no type or reference count for individual elements).
Obviously, it uses 1/3 the space to store the data, too. We currently
don't expose this object externally, but it could be an interesting
project to adapt this code into a standalone project that could be
more broadly useful.
Cheers,
Stephan
sorry for being so lazy as to not go look at the project pages, but....
This sounds like it could be really useful, and maybe
supercise a coupl eof half-baked projects of mine. But -- what
does "dynamic" mean?
- can you append to these arrays?
- can it support "ragged arrrays" -- it looks like it does.
L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
print(L)
[[0], [1 2], [3 4 5], [6 7 8 9]]
print row
print(L.data)
[0 1 2 3 4 5 6 7 8
is .data a regular old 1-d numpy array?
L = ArrayList( np.arange(10), [3,3,4])
print(L)
[[0 1 2], [3 4 5], [6 7 8 9]]
print(L.data)
[0 1 2 3 4 5 6 7 8 9]
L * 5
L* some_array
in which case, how does it do broadcasting???
Thanks,
-CHB
L = ArrayList(["Hello", "world", "!"])
print(L[0])
'Hello'
L[1] = "brave new world"
print(L)
['Hello', 'brave new world', '!']
Nicolas
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Loading...