Well, maybe something like a simple class emulating a dictionary that
stores a key-value on disk would be more than enough. Then you can use
whatever persistence layer that you want (even HDF5, but not necessarily).
As a demonstration I did a quick and dirty implementation for such a
persistent key-store thing (
https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the
KeyStore class (less than 40 lines long) is responsible for storing the
value (2 arrays) into a key (a directory). As I am quite a big fan of
compression, I implemented a couple of serialization flavors: one using the
.npz format (so no other dependencies than NumPy are needed) and the other
using the ctable object from the bcolz package (bcolz.blosc.org). Here are
some performance numbers:
python key-store.py -f numpy -d __test -l 0
########## Checking method: numpy (via .npz files) ############
Building database. Wait please...
Time ( creation) --> 1.906
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.191
Number of elements out of getitem: 10518976
***@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
75M __test
So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ
can compress data as well, so let's see how it goes:
$ python key-store.py -f numpy -d __test -l 9
########## Checking method: numpy (via .npz files) ############
Building database. Wait please...
Time ( creation) --> 6.636
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.384
Number of elements out of getitem: 10518976
***@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
28M __test
Ok, in this case we have got almost a 3x compression ratio, which is not
bad. However, the performance has degraded a lot. Let's use now bcolz.
First in non-compressed mode:
$ python key-store.py -f bcolz -d __test -l 0
########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
############
Building database. Wait please...
Time ( creation) --> 0.479
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.103
Number of elements out of getitem: 10518976
***@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
82M __test
Without compression, bcolz takes a bit more (~10%) space than NPZ.
However, bcolz is actually meant to be used with compression on by default:
$ python key-store.py -f bcolz -d __test -l 9
########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
############
Building database. Wait please...
Time ( creation) --> 0.487
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.98
Number of elements out of getitem: 10518976
***@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
29M __test
So, the final disk usage is quite similar to NPZ, but it can store and
retrieve lots faster. Also, the data decompression speed is on par to
using non-compression. This is because bcolz uses Blosc behind the scenes,
which is much faster than zlib (used by NPZ) --and sometimes faster than a
memcpy(). However, even we are doing I/O against the disk, this dataset is
so small that fits in the OS filesystem cache, so the benchmark is actually
checking I/O at memory speeds, not disk speeds.
In order to do a more real-life comparison, let's use a dataset that is
much larger than the amount of memory in my laptop (8 GB):
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
/media/faltet/docker/__test -l 0
########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
############
Building database. Wait please...
Time ( creation) --> 133.650
Retrieving 100 keys in arbitrary order...
Time ( query) --> 2.881
Number of elements out of getitem: 91907396
***@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
/media/faltet/docker/__test
39G /media/faltet/docker/__test
and now, with compression on:
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
/media/faltet/docker/__test -l 9
########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
############
Building database. Wait please...
Time ( creation) --> 145.633
Retrieving 100 keys in arbitrary order...
Time ( query) --> 1.339
Number of elements out of getitem: 91907396
***@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
/media/faltet/docker/__test
12G /media/faltet/docker/__test
So, we are still seeing the 3x compression ratio. But the interesting
thing here is that the compressed version works a 50% faster than the
uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a
SSD (hence the low query times), so the compression advantage is even more
noticeable than when using memory as above (as expected).
But anyway, this is just a demonstration that you don't need heavy tools to
achieve what you want. And as a corollary, (fast) compressors can save you
not only storage, but processing time too.
Francesc
Post by Nathaniel SmithI'd try storing the data in hdf5 (probably via h5py, which is a more
basic interface without all the bells-and-whistles that pytables
adds), though any method you use is going to be limited by the need to
do a seek before each read. Storing the data on SSD will probably help
a lot if you can afford it for your data size.
Post by Ryan R. RosarioHi,
I have a very large dictionary that must be shared across processes and
does not fit in RAM. I need access to this object to be fast. The key is an
integer ID and the value is a list containing two elements, both of them
numpy arrays (one has ints, the other has floats). The key is sequential,
starts at 0, and there are no gaps, so the âouterâ layer of this data
structure could really just be a list with the key actually being the
index. The lengths of each pair of arrays may differ across keys.
Post by Ryan R. Rosario{
[
numpy.array([1,8,15,âŠ, 16000]),
numpy.array([0.1,0.1,0.1,âŠ,0.1])
],
[
numpy.array([5,6]),
numpy.array([0.5,0.5])
],
âŠ
}
- manager proxy objects, but the object was so big that low-level
code threw an exception due to format and monkey-patching wasnât successful.
Post by Ryan R. Rosario- Redis, which was far too slow due to setting up connections and
data conversion etc.
Post by Ryan R. Rosario- Numpy rec arrays + memory mapping, but there is a restriction
that the numpy arrays in each âcolumnâ must be of fixed and same size.
Post by Ryan R. Rosario- I looked at PyTables, which may be a solution, but seems to have
a very steep learning curve.
Post by Ryan R. Rosario- I havenât tried SQLite3, but I am worried about the time it
takes to query the DB for a sequential ID, and then translate byte arrays.
Post by Ryan R. RosarioAny ideas? I greatly appreciate any guidance you can provide.
Thanks,
Ryan
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted