[Numpy-discussion] About the npz format

Discussion:

onefire

2014-04-16 18:26:37 UTC

Hi all,

I have been playing with the idea of using Numpy's binary format as a
lightweight alternative to HDF5 (which I believe is the "right" way to do
if one does not have a problem with the dependency).

I am pretty happy with the npy format, but the npz format seems to be
broken as far as performance is concerned (or I am missing obvious!). The
following ipython session illustrates the issue:

ln [1]: import numpy as np

In [2]: x = np.linspace(1, 10, 50000000)

In [3]: %time np.save("x.npy", x)
CPU times: user 40 ms, sys: 230 ms, total: 270 ms
Wall time: 488 ms

In [4]: %time np.savez("x.npz", data = x)
CPU times: user 657 ms, sys: 707 ms, total: 1.36 s
Wall time: 7.7 s

I can inspect the files to verify that they contain the same data, and I
can change the example, but this seems to always hold (I am running Arch
Linux, but I've done the test on other machines too): for bigger arrays,
the npz format seems to add an unbelievable amount of overhead.

Looking at Numpy's code, it looks like the real work is being done by
Python's zipfile module, and I suspect that all the extra time is spent
computing the crc32. Am I correct in my assumption (I am not familiar with
zipfile's internals)? Or perhaps I am doing something really dumb and there
is an easy way to speed things up?

Assuming that I am correct, my next question is: why compute the crc32 at
all? I mean, I know that it is part of what defines a "zip file", but is it
really necessary for a npz file to be a (compliant) zip file? If, for
example, I open the resulting npz file with a hex editor, and insert a
bogus crc32, np.load will happily load the file anyway (Gnome's Archive
Manager will do the same) To me this suggests that the fact that npz files
are zip files is not that important. .

Perhaps, people think that the ability to browse arrays and extract
individual ones like they would do with a regular zip file is really
important, but reading the little documentation that I found, I got the
impression that npz files are zip files just because this was the easiest
way to have multiple arrays in the same file. But my main point is: it
should be fairly simple to make npz files much more efficient with simple
changes like not computing checksums (or using a different algorithm like
adler32).

Let me know what you think about this. I've searched around the internet,
and on places like Stackoverflow, it seems that the standard answer is: you
are doing it wrong, forget Numpy's format and start using hdf5! Please do
not give that answer. Like I said in the beginning, I am well aware of
hdf5 and I use it on my "production code" (on C++). But I believe that
there should be a lightweight alternative (right now, to use hdf5 I need to
have installed the C library, the C++ wrappers, and the h5py library to
play with the data using Python, that is a bit too heavy for my needs). I
really like Numpy's format (if anything, it makes me feel better knowing
that it is
so easy to reverse engineer it, while the hdf5 format is very complicated),
but the (apparent) poor performance of npz files if a deal breaker.

Gilberto

Valentin Haenel

2014-04-16 20:57:30 UTC