[Numpy-discussion] pickling/unpickling numpy.void and numpy.record for multiprocessing

Discussion:

Martin Spacek

2010-02-26 22:41:25 UTC

I have a 1D structured ndarray with several different fields in the dtype. I'm
using multiprocessing.Pool.map() to iterate over this structured ndarray,
passing one entry (of type numpy.void) at a time to the function to be called by
each process in the pool. After much confusion about why this wasn't working, I
finally realized that unpickling a previously pickled numpy.void results in

import numpy as np
x = np.zeros((2,), dtype=('i4,f4,a10'))
x[:] = [(1,2.,'Hello'), (2,3.,"World")]
x

array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])

x[0]

(1, 2.0, 'Hello')

type(x[0])

import pickle
s = pickle.dumps(x[0])
newx0 = pickle.loads(s)
newx0

(30917960, 1.6904535998413144e-38, '\xd0\xef\x1c\x1eZ\x03\x00d')

s
type(newx0)

newx0.dtype

dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])

x[0].dtype

dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])

np.version.version

'1.4.0'

This also seems to be the case for recarrays with their numpy.record entries.
I've tried using pickle and cPickle, with both the oldest and the newest
pickling protocol. This is in numpy 1.4 on win32 and win64, and numpy 1.3 on
32-bit linux. I'm using Python 2.6.4 in all cases. I also just tried it on
Python 2.5.2 with numpy 1.0.4. All have the same result, although the garbage
data is different each time.

I suppose numpy.void is as it suggests, a pointer to a specific place in memory.
I'm just surprised that this pointer isn't dereferenced before pickling Or is
it? I'm not skilled in interpreting the strings returned by pickle.dumps(). I do
see the word "Hello" in the string, so maybe the problem is during unpickling.

I've tried doing a copy, and even a deepcopy of a structured array numpy.void
entry, with no luck.

Is this a known limitation? Any suggestions on how I might get around this?
Pool.map() pickles each numpy.void entry as it iterates over the structured
array, before sending it to the next available process. My structured array only
needs to be read from by my multiple processes (one per core), so perhaps
there's a better way than sending copies of entries. Multithreading (using an
implementation of a ThreadPool I found somewhere) doesn't work because I'm
calling scipy.optimize.leastsq, which doesn't seem to release the GIL.

Thanks!

Martin

Robert Kern

2010-02-26 23:02:36 UTC

Permalink

Post by Martin Spacek
I have a 1D structured ndarray with several different fields in the dtype. I'm
using multiprocessing.Pool.map() to iterate over this structured ndarray,
passing one entry (of type numpy.void) at a time to the function to be called by
each process in the pool. After much confusion about why this wasn't working, I
finally realized that unpickling a previously pickled numpy.void results in
>>> import numpy as np
>>> x = np.zeros((2,), dtype=('i4,f4,a10'))
>>> x[:] = [(1,2.,'Hello'), (2,3.,"World")]
>>> x
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])
>>> x[0]
(1, 2.0, 'Hello')
>>> type(x[0])
<type 'numpy.void'>
>>> import pickle
>>> s = pickle.dumps(x[0])
>>> newx0 = pickle.loads(s)
>>> newx0
(30917960, 1.6904535998413144e-38, '\xd0\xef\x1c\x1eZ\x03\x00d')
>>> s
>>> type(newx0)
<type 'numpy.void'>
>>> newx0.dtype
dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])
>>> x[0].dtype
dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])
>>> np.version.version
'1.4.0'
This also seems to be the case for recarrays with their numpy.record entries.
I've tried using pickle and cPickle, with both the oldest and the newest
pickling protocol. This is in numpy 1.4 on win32 and win64, and numpy 1.3 on
32-bit linux. I'm using Python 2.6.4 in all cases. I also just tried it on
Python 2.5.2 with numpy 1.0.4. All have the same result, although the garbage
data is different each time.
I suppose numpy.void is as it suggests, a pointer to a specific place in memory.

No, it isn't. It's just a base dtype for all of the ad-hoc dtypes that
are created, for example, for record arrays.

Post by Martin Spacek
I'm just surprised that this pointer isn't dereferenced before pickling Or is
it? I'm not skilled in interpreting the strings returned by pickle.dumps(). I do
see the word "Hello" in the string, so maybe the problem is during unpickling.

Use pickletools.dis() on the string. It helps to understand what is
going on. The data string is definitely correct:

In [25]: t = '\x01\x00\x00\x00\x00\x00\***@Hello\x00\x00\x00\x00\x00'

In [29]: np.fromstring(t, x.dtype)
Out[29]:
array([(1, 2.0, 'Hello')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])

The implementation of numpy.core.multiarray.scalar is doing something wrong.

Post by Martin Spacek
I've tried doing a copy, and even a deepcopy of a structured array numpy.void
entry, with no luck.
Is this a known limitation?

Nope. New bug! Thanks!

Post by Martin Spacek
Any suggestions on how I might get around this?
Pool.map() pickles each numpy.void entry as it iterates over the structured
array, before sending it to the next available process. My structured array only
needs to be read from by my multiple processes (one per core), so perhaps
there's a better way than sending copies of entries. Multithreading (using an
implementation of a ThreadPool I found somewhere) doesn't work because I'm
calling scipy.optimize.leastsq, which doesn't seem to release the GIL.

Pickling of complete arrays works. A quick workaround would be to send
rank-0 scalars:

Pool.map(map(np.asarray, x))

Or just tuples:

Pool.map(map(tuple, x))

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

Martin Spacek

2010-02-27 01:37:08 UTC

Permalink

Post by Robert Kern

Post by Martin Spacek
Is this a known limitation?

Nope. New bug! Thanks!

Good. I'm not crazy after all :)

Post by Robert Kern
Pickling of complete arrays works. A quick workaround would be to send
Pool.map(map(np.asarray, x))
Pool.map(map(tuple, x))

Excellent! The first method works as a drop-in replacement for me. Seems better
than the second, because it conserves named field access. The only slight

Post by Robert Kern

Post by Martin Spacek

a = map(np.asarray, x)
a[0]['f0']

array(1)

Post by Robert Kern

Post by Martin Spacek

x[0]['f0']

1

...but that doesn't seem to affect my code. Thanks a bunch for the quick solution!

Martin

Pauli Virtanen

2010-02-26 23:26:00 UTC

Permalink

pe, 2010-02-26 kello 14:41 -0800, Martin Spacek kirjoitti:
[clip: pickling/unpickling numpy.void scalar objects]

Post by Martin Spacek
I suppose numpy.void is as it suggests, a pointer to a specific place in memory.
I'm just surprised that this pointer isn't dereferenced before pickling Or is
it? I'm not skilled in interpreting the strings returned by pickle.dumps(). I do
see the word "Hello" in the string, so maybe the problem is during unpickling.

No, the unpickled void scalar will own its data. The problem is that
either the data is not saved correctly (unlikely), or it is unpickled
incorrectly.

The relevant code path to look at is multiarraymodule:array_scalar ->
scalarapi.c:PyArray_Scalar. Needs some cgdb'ing to find out what's going
on there.

Please file a bug report on this.

Post by Martin Spacek
Is this a known limitation? Any suggestions on how I might get around this?
Pool.map() pickles each numpy.void entry as it iterates over the structured
array, before sending it to the next available process.

Use 1-element arrays instead of void scalars. Those will pickle
correctly. Perhaps reshaping your array to (N, 1) will be enough.

--
Pauli Virtanen

Martin Spacek

2010-02-27 01:40:17 UTC

Permalink

Post by Pauli Virtanen
No, the unpickled void scalar will own its data. The problem is that
either the data is not saved correctly (unlikely), or it is unpickled
incorrectly.
The relevant code path to look at is multiarraymodule:array_scalar ->
scalarapi.c:PyArray_Scalar. Needs some cgdb'ing to find out what's going
on there.
Please file a bug report on this.

OK, Done. See http://projects.scipy.org/numpy/ticket/1415

Martin