[Numpy-discussion] Memory mapping and NPZ files

Discussion:

Mathieu Dubois

2015-12-09 14:51:55 UTC

Dear all,

If I am correct, using mmap_mode with Npz files has no effect i.e.:
f = np.load("data.npz", mmap_mode="r")
X = f['X']
will load all the data in memory.

Can somebody confirm that?

If I'm correct, the mmap_mode argument could be passed to the NpzFile
class which could in turn perform the correct operation. One way to
handle that would be to use the ZipFile.extract method to write the Npy
file on disk and then load it with numpy.load with the mmap_mode
argument. Note that the user will have to remove the file to reclaim
disk space (I guess that's OK).

One problem that could arise is that the extracted Npy file can be large
(it's the purpose of using memory mapping) and therefore it may be
useful to offer some control on where this file is extracted (for
instance /tmp can be too small to extract the file here). numpy.load
could offer a new option for that (passed to ZipFile.extract).

Does it make sense?

Thanks in advance,
Mathieu

Sturla Molden

2015-12-10 12:55:55 UTC

Permalink

Post by Mathieu Dubois
Does it make sense?

No. Memory mapping should just memory map, not do all sorts of crap.

Sturla

Mathieu Dubois

2015-12-10 19:06:45 UTC

Permalink

Post by Sturla Molden

Post by Mathieu Dubois
Does it make sense?

No. Memory mapping should just memory map, not do all sorts of crap.

The point is precisely that, you can't do memory mapping with Npz files
(while it works with Npy files).

Mathieu

Sturla Molden

2015-12-11 10:22:09 UTC

Permalink

Post by Mathieu Dubois
The point is precisely that, you can't do memory mapping with Npz files
(while it works with Npy files).

The operating system can memory map any file. But as npz-files are
compressed, you will need to uncompress the contents in your memory mapping
to make sense of it. I would suggest you use PyTables instead of npz-files.
It allows on the fly compression and uncompression (via blosc) and will
probably do what you want.

Sturla

Mathieu Dubois

2015-12-12 18:53:50 UTC

Permalink

Post by Sturla Molden

Post by Mathieu Dubois
The point is precisely that, you can't do memory mapping with Npz files
(while it works with Npy files).

The operating system can memory map any file. But as npz-files are
compressed, you will need to uncompress the contents in your memory mapping
to make sense of it.

We agree on that. The goal is to be able to create a np.memmap array
from an Npz file.

Post by Sturla Molden
I would suggest you use PyTables instead of npz-files.
It allows on the fly compression and uncompression (via blosc) and will
probably do what you want.

Yes I know I can use other solutions. The point is that np.load silently
ignore the mmap option so I wanted to discuss ways to improve this.

Mathieu

Nathaniel Smith

2015-12-12 22:22:46 UTC

Permalink

Post by Sturla Molden

Post by Mathieu Dubois
The point is precisely that, you can't do memory mapping with Npz files
(while it works with Npy files).

The operating system can memory map any file. But as npz-files are
compressed, you will need to uncompress the contents in your memory mapping
to make sense of it.

We agree on that. The goal is to be able to create a np.memmap array from

an Npz file.

Post by Sturla Molden
I would suggest you use PyTables instead of npz-files.
It allows on the fly compression and uncompression (via blosc) and will
probably do what you want.

Yes I know I can use other solutions. The point is that np.load silently

ignore the mmap option so I wanted to discuss ways to improve this.

I can see a good argument for transitioning to a rule where mmap=False
doesn't mmap, mmap=True mmaps if the file is uncompressed and raises an
error for compressed files, and mmap="if-possible" gives the current
behavior.

(It's even possible that the current code would already accept
"if-possible" as a alias for True, which would make the transition easier.)

Or maybe "never"/"always"/"if-possible" would be better for type
consistency reasons, while deprecating the use of bools altogether. But
this transition might be a bit more of a hassle, since these definitely
won't work on older numpy's.

Silently creating a massive temporary file doesn't seem like a great idea
to me in any case. Creating a temporary file + mmaping it is essentially
equivalent to just loading the data into swappable RAM, except that the
swap case is guaranteed not to accidentally leave a massive temp file lying
around afterwards.

-n

Sebastian Berg

2015-12-10 14:35:38 UTC

Permalink

Post by Mathieu Dubois
Dear all,
f = np.load("data.npz", mmap_mode="r")
X = f['X']
will load all the data in memory.

My take on it is, that no, I do not want implicit extraction/copy of the
file.
However, npz files are not necessarily compressed, and I expect that in
the non-compressed version, memory-mapping is possible on the
uncompressed version.
If that is possible, it would ideally work for uncompressed npz files
and could raise an error which suggests to manually uncompress the file
when mmap_mode is given.

- Sebastian

Post by Mathieu Dubois
Can somebody confirm that?
If I'm correct, the mmap_mode argument could be passed to the NpzFile
class which could in turn perform the correct operation. One way to
handle that would be to use the ZipFile.extract method to write the
Npy file on disk and then load it with numpy.load with the mmap_mode
argument. Note that the user will have to remove the file to reclaim
disk space (I guess that's OK).
One problem that could arise is that the extracted Npy file can be
large (it's the purpose of using memory mapping) and therefore it may
be useful to offer some control on where this file is extracted (for
instance /tmp can be too small to extract the file here). numpy.load
could offer a new option for that (passed to ZipFile.extract).
Does it make sense?
Thanks in advance,
Mathieu
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Mathieu Dubois

2015-12-10 19:07:16 UTC

Permalink

Post by Sebastian Berg

Post by Mathieu Dubois
Dear all,
f = np.load("data.npz", mmap_mode="r")
X = f['X']
will load all the data in memory.

My take on it is, that no, I do not want implicit extraction/copy of the
file.

I agree it's controversial.

Post by Sebastian Berg
However, npz files are not necessarily compressed, and I expect that in
the non-compressed version, memory-mapping is possible on the
uncompressed version.
If that is possible, it would ideally work for uncompressed npz files
and could raise an error which suggests to manually uncompress the file
when mmap_mode is given.

I got the same idea this afternoon. I will test that soon.

Thanks for your constructive answer!
Mathieu

Post by Sebastian Berg
- Sebastian

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Erik Bray

2015-12-11 21:35:28 UTC

Permalink

On Wed, Dec 9, 2015 at 9:51 AM, Mathieu Dubois

Post by Mathieu Dubois
Dear all,
f = np.load("data.npz", mmap_mode="r")
X = f['X']
will load all the data in memory.
Can somebody confirm that?
If I'm correct, the mmap_mode argument could be passed to the NpzFile class
which could in turn perform the correct operation. One way to handle that
would be to use the ZipFile.extract method to write the Npy file on disk and
then load it with numpy.load with the mmap_mode argument. Note that the user
will have to remove the file to reclaim disk space (I guess that's OK).
One problem that could arise is that the extracted Npy file can be large
(it's the purpose of using memory mapping) and therefore it may be useful to
offer some control on where this file is extracted (for instance /tmp can be
too small to extract the file here). numpy.load could offer a new option for
that (passed to ZipFile.extract).

I have struggled for a long time with a similar (albeit more obscure
problem) with PyFITS / astropy.io.fits when it comes to supporting
memory-mapping of compressed FITS files. For those unaware FITS is a
file format used primarily in Astronomy.

I have all kinds of wacky ideas for optimizing this, but at the moment
when you load data from a compressed FITS file with memory-mapping
enabled, obviously there's not much benefit because the contents of
the file are uncompressed in memory (there is a *little* benefit in
that the compressed data is mmap'd, but the compressed data is
typically much smaller than the uncompressed data).

Currently, in this case, I just issue a warning when the user
explicitly requests mmap=True, but won't get much benefit from it.
Maybe np.load could do the same, but I don't have a strong opinion
about it. (I only added the warning in PyFITS because a user
requested it and was kind enough to provide a patch--seemed
reasonable).

Erik