[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

Discussion:

Saullo Castro

2014-10-26 08:46:48 UTC

I would like to start working on a memory efficient alternative for
np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the
data while the file iterator is exhausted.

The motivation came from this SO question:

http://stackoverflow.com/q/26569852/832621

where for huge arrays the current NumPy ASCII readers are really slow and
require ~6 times more memory. This case I tested with Pandas' read_csv()
and it required 2 times more memory.

I would be glad if you could share your experience on this matter.

Greetings,
Saullo

Jeff Reback

2014-10-26 11:54:14 UTC

Permalink

you should have a read here/
http://wesmckinney.com/blog/?p=543

going below the 2x memory usage on read in is non trivial and costly in terms of performance

I would like to start working on a memory efficient alternative for np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the data while the file iterator is exhausted.
http://stackoverflow.com/q/26569852/832621
where for huge arrays the current NumPy ASCII readers are really slow and require ~6 times more memory. This case I tested with Pandas' read_csv() and it required 2 times more memory.
I would be glad if you could share your experience on this matter.
Greetings,
Saullo
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Eelco Hoogendoorn

2014-10-26 13:21:03 UTC

Permalink

Im not sure why the memory doubling is necessary. Isnt it possible to
preallocate the arrays and write to them? I suppose this might be
inefficient though, in case you end up reading only a small subset of rows
out of a mostly corrupt file? But that seems to be a rather uncommon corner
case.

Either way, id say a doubling of memory use is fair game for numpy.
Generality is more important than absolute performance. The most important
thing is that temporary python datastructures are avoided. That shouldn't
be too hard to accomplish, and would realize most of the performance and
memory gains, I imagine.

Post by Jeff Reback
you should have a read here/
http://wesmckinney.com/blog/?p=543
going below the 2x memory usage on read in is non trivial and costly in
terms of performance
I would like to start working on a memory efficient alternative for
np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the
data while the file iterator is exhausted.
http://stackoverflow.com/q/26569852/832621
where for huge arrays the current NumPy ASCII readers are really slow and
require ~6 times more memory. This case I tested with Pandas' read_csv()
and it required 2 times more memory.
I would be glad if you could share your experience on this matter.
Greetings,
Saullo
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

2014-10-26 13:32:18 UTC

Permalink

On Sun, Oct 26, 2014 at 1:21 PM, Eelco Hoogendoorn

Post by Eelco Hoogendoorn
Im not sure why the memory doubling is necessary. Isnt it possible to
preallocate the arrays and write to them?

Not without reading the whole file first to know how many rows to preallocate.

--
Robert Kern

RayS

2014-10-26 18:40:52 UTC

Permalink

Post by Robert Kern
On Sun, Oct 26, 2014 at 1:21 PM, Eelco Hoogendoorn

Post by Eelco Hoogendoorn
Im not sure why the memory doubling is necessary. Isnt it possible to
preallocate the arrays and write to them?

Not without reading the whole file first to know how many rows to preallocate

Seems to me that loadtxt()
http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
should have an optional shape. I often know how many rows I have (#
of samples of data) from other meta data.
Then:
- if the file is smaller for some reason (you're not sure and pad
your estimate) it could do one of
- zero pad array
- raise()
- return truncated view
- if larger
- raise()
- return data read (this would act like fileObject.read( size ) )
- Ray S

Derek Homeier

2014-10-26 13:43:44 UTC

Permalink

Im not sure why the memory doubling is necessary. Isnt it possible to preallocate the arrays and write to them? I suppose this might be inefficient though, in case you end up reading only a small subset of rows out of a mostly corrupt file? But that seems to be a rather uncommon corner case.
Either way, id say a doubling of memory use is fair game for numpy. Generality is more important than absolute performance. The most important thing is that temporary python datastructures are avoided. That shouldn't be too hard to accomplish, and would realize most of the performance and memory gains, I imagine.

Preallocation is not straightforward because the parser needs to be able in general to work with streamed input.
I think I even still have a branch on github bypassing this on request (by keyword argument).
But a factor 2 is already a huge improvement over that factor ~6 coming from the current text readers buffering
the entire input as list of list of Python strings, not to speak of the vast performance gain from using a parser
implemented in C like pandas’ - in fact one of the last times this subject came up one suggestion was to steal
pandas.read_cvs and adopt as required.
Someone also posted some code or the draft thereof for using resizable arrays quite a while ago, which would
reduce the memory overhead for very large arrays.

Cheers,
Derek

Chris Barker

2014-10-28 20:09:09 UTC

Permalink

A few thoughts:

1) yes, a faster, more memory efficient text file parser would be great.
Yeah, if your workflow relies on parsing lots of huge text files, you
probably need another workflow. But it's a really really common thing to
nee to do -- why not do it fast?

2) """you are describing a special case where you know the data size
apriori (eg not streaming), dtypes are readily apparent from a small sample
case
and in general your data is not messy """

sure -- that's a special case, but it's a really common special case (OK --
without the know your data size ,anyway...)

3)

Post by Derek Homeier
Someone also posted some code or the draft thereof for using resizable
arrays quite a while ago, which would
reduce the memory overhead for very large arrays.

That may have been me -- I have a resizable array class, both pure python
and not-quite finished Cython version. In practice, if you add stuff to the
array row by row (or item by item), it's no faster than putting it all in a
list and then converting to an array -- but it IS more memory efficient,
which seems to be the issue here. Let me know if you want it -- I really
need to get it up on gitHub one of these days.

My take: for fast parsing of big files you need:

To do the parsing/converting in C -- what wrong with good old fscanf, at
least for the basic types -- it's pretty darn fast.

Memory efficiency -- somethign like my growable array is not all that hard
to implement and pretty darn quick -- you just do the usual trick_ over
allocate a bit of memory, and when it gets full re-allocate a larger chunk.
It turns out, at least on the hardware I tested on, that the performance is
not very sensitive to how much you over allocate -- if it's tiny (1
element) performance really sucks, but once you get to a 10% or so (maybe
less) over-allocation, you don't notice the difference.

Keep the auto-figuring out of the structure / dtypes separate from the high
speed parsing code. I"d say write high speed parsing code first -- that
requires specification of the data types and structure, then, if you want,
write some nice pure python code that tries to auto-detect all that. If
it's a small file, it's fast regardless. if it's a large file, then the
overhead of teh fancy parsing will be lost, and you'll want the line by
line parsing to be as fast as possible.

Post by Derek Homeier
From a quick loo, it seems that the Panda's code is pretty nice -- maybe

the 2X memory footprint should be ignored.

-Chris

Post by Derek Homeier
Cheers,
Derek
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

2014-10-28 20:24:28 UTC

Permalink

Post by Chris Barker
Memory efficiency -- somethign like my growable array is not all that

hard to implement and pretty darn quick -- you just do the usual trick_
over allocate a bit of memory, and when it gets full re-allocate a larger
chunk.

Can't you just do this with regular numpy using .resize()? What does your
special class add? (Just curious.)

Post by Chris Barker
From a quick loo, it seems that the Panda's code is pretty nice -- maybe

the 2X memory footprint should be ignored.

+1

It's fun to sit around and brainstorm clever implementation strategies, but
Wes already went ahead and implemented all the tricky bits, and optimized
them too. No point in reinventing the wheel.

(Plus as I pointed out upthread, it's entirely likely that this "2x
overhead" is based on a misunderstanding/oversimplification of how virtual
memory works, and the actual practical overhead is much lower.)

-n

Julian Taylor

2014-10-28 20:30:40 UTC

Permalink

Post by Nathaniel Smith

Post by Chris Barker
Memory efficiency -- somethign like my growable array is not all that

hard to implement and pretty darn quick -- you just do the usual trick_
over allocate a bit of memory, and when it gets full re-allocate a
larger chunk.
Can't you just do this with regular numpy using .resize()? What does
your special class add? (Just curious.)

Post by Chris Barker
From a quick loo, it seems that the Panda's code is pretty nice --

maybe the 2X memory footprint should be ignored.
+1
It's fun to sit around and brainstorm clever implementation strategies,
but Wes already went ahead and implemented all the tricky bits, and
optimized them too. No point in reinventing the wheel.

just to through it in there, astropy recently also added a faster ascii
file reader:
https://groups.google.com/forum/#!topic/astropy-dev/biCgb3cF0v0
not familiar with how it compares to pandas.

how is pandas support for unicode text files?
unicode is the big weak point of numpys current text readers and needs
to addressed.

Chris Barker

2014-10-28 21:41:44 UTC

Permalink

Post by Nathaniel Smith

Post by Chris Barker
Memory efficiency -- somethign like my growable array is not all that

hard to implement and pretty darn quick -- you just do the usual trick_
over allocate a bit of memory, and when it gets full re-allocate a larger
chunk.
Can't you just do this with regular numpy using .resize()? What does your
special class add? (Just curious.)

it used resize under the hood -- it just adds the bookeeping for the over
allocation, etc, and lets you access teh data as though it wasn't
over-allocated

like I said, not that difficult.

I haven't touched it for a while, but it you are curious I just threw it up
on gitHub:

https://github.com/PythonCHB/NumpyExtras

you want accumulator.py -- there is also a cython version that I didn't
quite finish...it theory, it should be a be faster in some cases by
reducing the need to round-trip between numpy and python data types...

in practice, I don't think I got it to a point where I could do real-world
profiling.

It's fun to sit around and brainstorm clever implementation strategies, but

Post by Nathaniel Smith
Wes already went ahead and implemented all the tricky bits, and optimized
them too. No point in reinventing the wheel.
(Plus as I pointed out upthread, it's entirely likely that this "2x
overhead" is based on a misunderstanding/oversimplification of how virtual
memory works, and the actual practical overhead is much lower.)

good point.

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Benjamin Root

2014-10-28 20:25:52 UTC

Permalink

As a bit of an aside, I have just discovered that for fixed-width text
data, numpy's text readers seems to edge out pandas' read_fwf(), and numpy
has the advantage of being able to specify the dtypes ahead of time (seems
that the pandas version just won't allow it, which means I end up with
float64's and object dtypes instead of float32's and |S12 dtypes where I
want them).

Cheers!
Ben Root

Post by Chris Barker
1) yes, a faster, more memory efficient text file parser would be great.
Yeah, if your workflow relies on parsing lots of huge text files, you
probably need another workflow. But it's a really really common thing to
nee to do -- why not do it fast?
2) """you are describing a special case where you know the data size
apriori (eg not streaming), dtypes are readily apparent from a small sample
case
and in general your data is not messy """
sure -- that's a special case, but it's a really common special case (OK
-- without the know your data size ,anyway...)
3)

Post by Derek Homeier
Someone also posted some code or the draft thereof for using resizable
arrays quite a while ago, which would
reduce the memory overhead for very large arrays.

That may have been me -- I have a resizable array class, both pure python
and not-quite finished Cython version. In practice, if you add stuff to the
array row by row (or item by item), it's no faster than putting it all in a
list and then converting to an array -- but it IS more memory efficient,
which seems to be the issue here. Let me know if you want it -- I really
need to get it up on gitHub one of these days.
To do the parsing/converting in C -- what wrong with good old fscanf, at
least for the basic types -- it's pretty darn fast.
Memory efficiency -- somethign like my growable array is not all that hard
to implement and pretty darn quick -- you just do the usual trick_ over
allocate a bit of memory, and when it gets full re-allocate a larger chunk.
It turns out, at least on the hardware I tested on, that the performance is
not very sensitive to how much you over allocate -- if it's tiny (1
element) performance really sucks, but once you get to a 10% or so (maybe
less) over-allocation, you don't notice the difference.
Keep the auto-figuring out of the structure / dtypes separate from the
high speed parsing code. I"d say write high speed parsing code first --
that requires specification of the data types and structure, then, if you
want, write some nice pure python code that tries to auto-detect all that.
If it's a small file, it's fast regardless. if it's a large file, then the
overhead of teh fancy parsing will be lost, and you'll want the line by
line parsing to be as fast as possible.
From a quick loo, it seems that the Panda's code is pretty nice -- maybe
the 2X memory footprint should be ignored.
-Chris

Post by Derek Homeier
Cheers,
Derek
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Daπid

2014-10-26 13:41:07 UTC

Permalink

Post by Jeff Reback
you should have a read here/
http://wesmckinney.com/blog/?p=543
going below the 2x memory usage on read in is non trivial and costly in
terms of performance

If you know in advance the number of rows (because it is in the header,
counted with wc -l, or any other prior information) you can preallocate the
array and fill in the numbers as you read, with virtually no overhead.

If the number of rows is unknown, an alternative is to use a chunked data
container like Bcolz [1] (former carray) instead of Python structures. It
may be used as such, or copied back to a ndarray if we want the memory to
be aligned. Including a bit of compression we can get the memory overhead
to somewhere under 2x (depending on the dataset), at the cost of not so
much CPU time, and this could be very useful for large data and slow
filesystems.

/David.

[1] http://bcolz.blosc.org/

Jeff Reback

2014-10-26 14:09:39 UTC

Permalink

you are describing a special case where you know the data size apriori (eg not streaming), dtypes are readily apparent from a small sample case
and in general your data is not messy

I would agree if these can be satisfied then you can achieve closer to a 1x memory overhead

using bcolZ is great but prob not a realistic option for a dependency for numpy (you should prob just memory map it directly instead); though this has a big perf impact - so need to weigh these things

not all cases deserve the same treatment - chunking is often the best option IMHO - provides a constant memory usage (though ultimately still 2x); but combined with memory mapping can provide a fixed resource utilization

Post by Jeff Reback
you should have a read here/
http://wesmckinney.com/blog/?p=543
going below the 2x memory usage on read in is non trivial and costly in terms of performance

If you know in advance the number of rows (because it is in the header, counted with wc -l, or any other prior information) you can preallocate the array and fill in the numbers as you read, with virtually no overhead.
If the number of rows is unknown, an alternative is to use a chunked data container like Bcolz [1] (former carray) instead of Python structures. It may be used as such, or copied back to a ndarray if we want the memory to be aligned. Including a bit of compression we can get the memory overhead to somewhere under 2x (depending on the dataset), at the cost of not so much CPU time, and this could be very useful for large data and slow filesystems.
/David.
[1] http://bcolz.blosc.org/
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

2014-10-26 14:16:11 UTC

Permalink

Post by Jeff Reback
you should have a read here/
http://wesmckinney.com/blog/?p=543
going below the 2x memory usage on read in is non trivial and costly in terms of performance

On Linux you can probably go below 2x overhead easily, by exploiting the
fact that realloc on large memory blocks is basically O(1) (yes really):
http://blog.httrack.com/blog/2014/04/05/a-story-of-realloc-and-laziness/

Sadly osx does not provide anything similar and I can't tell for sure about
windows.

Though on further thought, the numbers Wes quotes there aren't actually the
most informative - massif will tell you how much virtual memory you have
allocated, but a lot of that is going to be a pure vm accounting trick. The
output array memory will actually be allocated incrementally one block at a
time as you fill it in. This means that if you can free each temporary
chunk immediately after you copy it into the output array, then even simple
approaches can have very low overhead. It's possible pandas's actual
overhead is already closer to 1x than 2x, and this is just hidden by the
tools Wes is using to measure it.

-n

Post by Jeff Reback

Post by Saullo Castro
I would like to start working on a memory efficient alternative for

np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the
data while the file iterator is exhausted.

Post by Jeff Reback

Post by Saullo Castro
http://stackoverflow.com/q/26569852/832621
where for huge arrays the current NumPy ASCII readers are really slow

and require ~6 times more memory. This case I tested with Pandas'
read_csv() and it required 2 times more memory.

Post by Jeff Reback

Post by Saullo Castro
I would be glad if you could share your experience on this matter.
Greetings,
Saullo
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Daniele Nicolodi

2014-10-26 16:42:32 UTC

Permalink

Post by Saullo Castro
I would like to start working on a memory efficient alternative for
np.loadtxt and np.genfromtxt that uses arrays instead of lists to store
the data while the file iterator is exhausted.

...

Post by Saullo Castro
I would be glad if you could share your experience on this matter.

Saullo Castro

2014-10-26 18:27:54 UTC

Permalink

I agree with @Daniele's point, storing huge arrays in text files migh
indicate a bad process.... but once these functions can be improved, why
not? Unless this turns to be a burden to change.

Regarding the estimation of the array size, I don't see a big performance
loss when the file iterator is exhausting once more in order to estimate
the number of rows and pre-allocate the proper arrays to avoid using list
of lists. The hardest part seems to be dealing with arrays of strings
(perhaps easily solved with dtype=object) and structured arrays.

Cheers,
Saullo

Send NumPy-Discussion mailing list submissions to
To subscribe or unsubscribe via the World Wide Web, visit
http://mail.scipy.org/mailman/listinfo/numpy-discussion
or, via email, send a message with subject or body 'help' to
You can reach the person managing the list at
When replying, please edit your Subject line so it is more specific
than "Re: Contents of NumPy-Discussion digest..."
1. Re: Memory efficient alternative for np.loadtxt and
np.genfromtxt (Daniele Nicolodi)
----------------------------------------------------------------------
Message: 1
Date: Sun, 26 Oct 2014 17:42:32 +0100
Subject: Re: [Numpy-discussion] Memory efficient alternative for
np.loadtxt and np.genfromtxt
Content-Type: text/plain; charset=windows-1252

...

Post by Saullo Castro
I would be glad if you could share your experience on this matter.

I'm of the opinion that if your workflow requires you to regularly load
large arrays from text files, something else needs to be fixed rather
than the numpy speed and memory usage in reading data from text files.
There are a number of data formats that are interoperable and allow to
store data much more efficiently. hdf5 is one natural choice, maybe with
the blosc compressor.
Cheers,
Daniele
------------------------------
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
End of NumPy-Discussion Digest, Vol 97, Issue 57
************************************************