[Numpy-discussion] Preserving NumPy views when pickling

Discussion:

Stephan Hoyer

2016-10-25 19:38:16 UTC

With a custom wrapper class, it's possible to preserve NumPy views when
pickling:
https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling

This can result in significant time/space savings with pickling views along
with base arrays and brings the behavior of NumPy more in line with Python
proper. Is this something that we can/should port into NumPy itself?

Nathaniel Smith

2016-10-25 20:07:52 UTC

Permalink

Post by Stephan Hoyer
With a custom wrapper class, it's possible to preserve NumPy views when
https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling
This can result in significant time/space savings with pickling views along
with base arrays and brings the behavior of NumPy more in line with Python
proper. Is this something that we can/should port into NumPy itself?

Concretely, what do would you suggest should happen with:

base = np.zeros(100000000)
view = base[:10]

# case 1
pickle.dump(view, file)

# case 2
pickle.dump(base, file)
pickle.dump(view, file)

# case 3
pickle.dump(view, file)
pickle.dump(base, file)

?

--
Nathaniel J. Smith -- https://vorpus.org

Stephan Hoyer

2016-10-25 22:07:04 UTC

Permalink

Post by Nathaniel Smith
base = np.zeros(100000000)
view = base[:10]
# case 1
pickle.dump(view, file)
# case 2
pickle.dump(base, file)
pickle.dump(view, file)
# case 3
pickle.dump(view, file)
pickle.dump(base, file)
?

I see what you're getting at here. We would need a rule for when to include
the base in the pickle and when not to. Otherwise, pickle.dump(view, file)
always contains data from the base pickle, even with view is much smaller
than base.

The safe answer is "only use views in the pickle when base is already being
pickled", but that isn't possible to check unless all the arrays are
together in a custom container. So, this isn't really feasible for NumPy.

Robert Kern

2016-10-25 23:28:22 UTC

Permalink

Post by Stephan Hoyer

I see what you're getting at here. We would need a rule for when to

include the base in the pickle and when not to. Otherwise,
pickle.dump(view, file) always contains data from the base pickle, even
with view is much smaller than base.

Post by Stephan Hoyer
The safe answer is "only use views in the pickle when base is already

being pickled", but that isn't possible to check unless all the arrays are
together in a custom container. So, this isn't really feasible for NumPy.

It would be possible with a custom Pickler/Unpickler since they already
keep track of objects previously (un)pickled. That would handle [base,
view] okay but not [view, base], so it's probably not going to be all that
useful outside of special situations. It would make a neat recipe, but I
probably would not provide it in numpy itself.

--
Robert Kern

Matthew Harrigan

2016-10-26 00:09:09 UTC

Permalink

It seems pickle keeps track of references for basic python types.

x = [1]
y = [x]
x,y = pickle.loads(pickle.dumps((x,y)))
x.append(2)
print(y)

Post by Robert Kern

Post by Stephan Hoyer

[[1,2]]

Numpy arrays are different but references are forgotten after
pickle/unpickle. Shared objects do not remain shared. Based on the quote
below it could be considered bug with numpy/pickle.

Object sharing (references to the same object in different places): This is
similar to self-referencing objects; pickle stores the object once, and
ensures that all other references point to the master copy. Shared objects
remain shared, which can be very important for mutable objects. link
<https://docs.python.org/2.0/lib/module-pickle.html>

Another example with ndarrays:

x = np.arange(5)
y = x[::-1]
x, y = pickle.loads(pickle.dumps((x, y)))
x[0] = 9
print(y)

Post by Robert Kern

Post by Stephan Hoyer

[4, 3, 2, 1, 0]

In this case the two arrays share the exact same object for the data buffer
(although object might not be the right word here)

Post by Robert Kern

Post by Stephan Hoyer

base = np.zeros(100000000)
view = base[:10]
# case 1
pickle.dump(view, file)
# case 2
pickle.dump(base, file)
pickle.dump(view, file)
# case 3
pickle.dump(view, file)
pickle.dump(base, file)
?

I see what you're getting at here. We would need a rule for when to

include the base in the pickle and when not to. Otherwise,
pickle.dump(view, file) always contains data from the base pickle, even
with view is much smaller than base.

Post by Stephan Hoyer
The safe answer is "only use views in the pickle when base is already

being pickled", but that isn't possible to check unless all the arrays are
together in a custom container. So, this isn't really feasible for NumPy.
It would be possible with a custom Pickler/Unpickler since they already
keep track of objects previously (un)pickled. That would handle [base,
view] okay but not [view, base], so it's probably not going to be all that
useful outside of special situations. It would make a neat recipe, but I
probably would not provide it in numpy itself.
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

2016-10-26 00:29:54 UTC

Permalink

On Tue, Oct 25, 2016 at 5:09 PM, Matthew Harrigan <

Post by Matthew Harrigan
It seems pickle keeps track of references for basic python types.
x = [1]
y = [x]
x,y = pickle.loads(pickle.dumps((x,y)))
x.append(2)
print(y)

[[1,2]]

Numpy arrays are different but references are forgotten after

pickle/unpickle. Shared objects do not remain shared. Based on the quote
below it could be considered bug with numpy/pickle.

Not a bug, but an explicit design decision on numpy's part.

--
Robert Kern

Feng Yu

2016-10-26 02:05:39 UTC

Permalink

Hi,

Just another perspective. base' and 'data' in PyArrayObject are two
separate variables.

base can point to any PyObject, but it is `data` that defines where
data is accessed in memory.

1. There is no clear way to pickle a pointer (`data`) in a meaningful
way. In order for `data` member to make sense we still need to
'readout' the values stored at `data` pointer in the pickle.

2. By definition base is not necessary a numpy array but it is just
some other object for managing the memory.

3. One can surely pickle the `base` object as a reference, but it is
useless if the data memory has been reconstructed independently during
unpickling.

4. Unless there is clear way to notify the referencing numpy array of
the new data pointer. There probably isn't.

BTW, is the stride information is lost during pickling, too? The
behavior shall probably be documented if not yet.

Yu

Post by Robert Kern
On Tue, Oct 25, 2016 at 5:09 PM, Matthew Harrigan

Post by Matthew Harrigan
It seems pickle keeps track of references for basic python types.
x = [1]
y = [x]
x,y = pickle.loads(pickle.dumps((x,y)))
x.append(2)
print(y)

[[1,2]]

Numpy arrays are different but references are forgotten after
pickle/unpickle. Shared objects do not remain shared. Based on the quote
below it could be considered bug with numpy/pickle.

Not a bug, but an explicit design decision on numpy's part.
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

2016-10-26 02:39:14 UTC

Permalink

Post by Feng Yu
Hi,
Just another perspective. base' and 'data' in PyArrayObject are two
separate variables.
base can point to any PyObject, but it is `data` that defines where
data is accessed in memory.
1. There is no clear way to pickle a pointer (`data`) in a meaningful
way. In order for `data` member to make sense we still need to
'readout' the values stored at `data` pointer in the pickle.
2. By definition base is not necessary a numpy array but it is just
some other object for managing the memory.

In general, yes, but most often it's another ndarray, and the child is
related to the parent by a slice operation that could be computed by
comparing the `data` tuples. The exercise here isn't to always represent
the general case in this way, but to see what can be done opportunistically
and if that actually helps solve a practical problem.

Post by Feng Yu
3. One can surely pickle the `base` object as a reference, but it is
useless if the data memory has been reconstructed independently during
unpickling.
4. Unless there is clear way to notify the referencing numpy array of
the new data pointer. There probably isn't.
BTW, is the stride information is lost during pickling, too? The
behavior shall probably be documented if not yet.

The stride information may be lost, yes. We reserve the right to retain it,
though (for example, if .T is contiguous then we might well serialize the
transposed data linearly and return a view on that data upon
deserialization). I don't believe that we guarantee that the unpickled
result is contiguous.

--
Robert Kern

Nathaniel Smith

2016-10-26 03:36:29 UTC

Permalink

On Tue, Oct 25, 2016 at 5:09 PM, Matthew Harrigan

Post by Matthew Harrigan
It seems pickle keeps track of references for basic python types.
x = [1]
y = [x]
x,y = pickle.loads(pickle.dumps((x,y)))
x.append(2)
print(y)

[[1,2]]

Yes, but the problem is: suppose I have a 10 gigabyte array, and then
take a 20 byte slice of it, and then pickle that slice. Do you expect
the pickle file to be 20 bytes, or 10 gigabytes? Both options are
possible, but you have to pick one, and numpy picks 20 bytes. The
advantage is obviously that you don't have mysterious 10 gigabyte
pickle files; the disadvantage is that you can't reconstruct the view
relationships afterwards. (You might think: oh, but we can be clever,
and only record the view relationships if the user pickles both
objects together. But while pickle might know whether the user is
pickling both objects together, it unfortunately doesn't tell numpy,
so we can't really do anything clever or different in this case.)

-n

--
Nathaniel J. Smith -- https://vorpus.org