Discussion:
[Numpy-discussion] deprecate fromstring() for text reading?
Chris Barker
2015-10-22 17:03:15 UTC
Permalink
There was just a question about a bug/issue with scipy.fromstring (which is
numpy.fromstring) when used to read integers from a text file.

https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html

fromstring() is bugging and inflexible for reading text files -- and it is
a very, very ugly mess of code. I dug into it a while back, and gave up --
just to much of a mess!

So we really should completely re-implement it, or deprecate it. I doubt
anyone is going to do a big refactor, so that means deprecating it.

Also -- if we do want a fast read numbers from text files function (which
would be nice, actually), it really should get a new name anyway.

(and the hopefully coming new dtype system would make it easier to write
cleanly)

I'm not sure what deprecating something means, though -- have it raise a
deprecation warning in the next version?

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Marten van Kerkwijk
2015-10-22 22:35:28 UTC
Permalink
I think it would be good to keep the usage to read binary data at least. Or
is there a good alternative to `np.fromstring(<bytes>, dtype=...)`? --
Marten
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring (which
is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and it is
a very, very ugly mess of code. I dug into it a while back, and gave up --
just to much of a mess!
So we really should completely re-implement it, or deprecate it. I doubt
anyone is going to do a big refactor, so that means deprecating it.
Also -- if we do want a fast read numbers from text files function (which
would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write
cleanly)
I'm not sure what deprecating something means, though -- have it raise a
deprecation warning in the next version?
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker - NOAA Federal
2015-10-22 23:47:30 UTC
Permalink
I think it would be good to keep the usage to read binary data at least.


Agreed -- it's only the text file reading I'm proposing to deprecate. It
was kind of weird to cram it in there in the first place.

Oh, fromfile() has the same issues.

Chris


Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`? --
Marten
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring (which
is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and it is
a very, very ugly mess of code. I dug into it a while back, and gave up --
just to much of a mess!
So we really should completely re-implement it, or deprecate it. I doubt
anyone is going to do a big refactor, so that means deprecating it.
Also -- if we do want a fast read numbers from text files function (which
would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write
cleanly)
I'm not sure what deprecating something means, though -- have it raise a
deprecation warning in the next version?
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Charles R Harris
2015-10-23 22:13:02 UTC
Permalink
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at least.
Agreed -- it's only the text file reading I'm proposing to deprecate. It
was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`? --
Marten
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring (which
is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and it
is a very, very ugly mess of code. I dug into it a while back, and gave up
-- just to much of a mess!
So we really should completely re-implement it, or deprecate it. I doubt
anyone is going to do a big refactor, so that means deprecating it.
Also -- if we do want a fast read numbers from text files function (which
would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write
cleanly)
I'm not sure what deprecating something means, though -- have it raise a
deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff
Rebeck and see about moving that forward.

Chuck
Jeff Reback
2015-10-23 22:30:39 UTC
Permalink
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at least.
Agreed -- it's only the text file reading I'm proposing to deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`? -- Marten
There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and it is a very, very ugly mess of code. I dug into it a while back, and gave up -- just to much of a mess!
So we really should completely re-implement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also -- if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though -- have it raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
Chuck
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
IIRC Thomas Caswell was interested in doing this :)

Jeff
Nathaniel Smith
2015-10-23 22:49:06 UTC
Permalink
Post by Jeff Reback
Post by Charles R Harris
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at least.
Agreed -- it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`? -- Marten
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and
it is a very, very ugly mess of code. I dug into it a while back, and gave
up -- just to much of a mess!
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
So we really should completely re-implement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
Also -- if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though -- have it
raise a deprecation warning in the next version?
Post by Jeff Reback
Post by Charles R Harris
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff
Rebeck and see about moving that forward.
Post by Jeff Reback
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night
since SciPy he has dutifully been feeling guilty about not having done it
yet. I think this week his paltry excuse is that he's "on his honeymoon" or
something.

...which is to say that if someone has some spare cycles to take this over
then I think that might be a nice wedding present for him :-).

(The basic idea is to take the text reading backend behind pandas.read_csv
and extract it into a standalone package that pandas could depend on, and
that could also be used by other packages like numpy (among others -- I
thing dato's SFrame package has a fork of this code as well?))

-n
Jeff Reback
2015-10-23 23:02:36 UTC
Permalink
Post by Jeff Reback
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at least.
Agreed -- it's only the text file reading I'm proposing to deprecate. It was kind of weird to cram it in there in the first place.
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>, dtype=...)`? -- Marten
There was just a question about a bug/issue with scipy.fromstring (which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and it is a very, very ugly mess of code. I dug into it a while back, and gave up -- just to much of a mess!
So we really should completely re-implement it, or deprecate it. I doubt anyone is going to do a big refactor, so that means deprecating it.
Also -- if we do want a fast read numbers from text files function (which would be nice, actually), it really should get a new name anyway.
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though -- have it raise a deprecation warning in the next version?
There was discussion at SciPy 2015 of separating out the text reading abilities of Pandas so that numpy could include it. We should contact Jeff Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night since SciPy he has dutifully been feeling guilty about not having done it yet. I think this week his paltry excuse is that he's "on his honeymoon" or something.
...which is to say that if someone has some spare cycles to take this over then I think that might be a nice wedding present for him :-).
(The basic idea is to take the text reading backend behind pandas.read_csv and extract it into a standalone package that pandas could depend on, and that could also be used by other packages like numpy (among others -- I thing dato's SFrame package has a fork of this code as well?))
-n
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
I can certainly provide guidance on how/what to extract but don't have spare cycles myself for this :(
Chris Barker - NOAA Federal
2015-10-24 00:22:55 UTC
Permalink
Grabbing the pandas csv reader would be great, and I hope it happens sooner
than later, though alas, I haven't the spare cycles for it either.

In the meantime though, can we put a deprecation Warning in when using
fromstring() on text files? It's really pretty broken.

-Chris
Post by Jeff Reback
Post by Charles R Harris
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at least.
Agreed -- it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`? -- Marten
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and
it is a very, very ugly mess of code. I dug into it a while back, and gave
up -- just to much of a mess!
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
So we really should completely re-implement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
Also -- if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
(and the hopefully coming new dtype system would make it easier to write cleanly)
I'm not sure what deprecating something means, though -- have it
raise a deprecation warning in the next version?
Post by Jeff Reback
Post by Charles R Harris
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff
Rebeck and see about moving that forward.
Post by Jeff Reback
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night
since SciPy he has dutifully been feeling guilty about not having done it
yet. I think this week his paltry excuse is that he's "on his honeymoon" or
something.

...which is to say that if someone has some spare cycles to take this over
then I think that might be a nice wedding present for him :-).

(The basic idea is to take the text reading backend behind pandas.read_csv
and extract it into a standalone package that pandas could depend on, and
that could also be used by other packages like numpy (among others -- I
thing dato's SFrame package has a fork of this code as well?))

-n

_______________________________________________
NumPy-Discussion mailing list
NumPy-***@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


I can certainly provide guidance on how/what to extract but don't have
spare cycles myself for this :(
Benjamin Root
2015-10-27 14:30:08 UTC
Permalink
FWIW, when I needed a fast Fixed Width reader for a very large dataset last
year, I found that np.genfromtext() was faster than pandas' read_fwf().
IIRC, pandas' text reading code fell back to pure python for fixed width
scenarios.

On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Grabbing the pandas csv reader would be great, and I hope it happens
sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using
fromstring() on text files? It's really pretty broken.
-Chris
Post by Jeff Reback
Post by Charles R Harris
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at
least.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Agreed -- it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`? -- Marten
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
fromstring() is bugging and inflexible for reading text files -- and
it is a very, very ugly mess of code. I dug into it a while back, and gave
up -- just to much of a mess!
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
So we really should completely re-implement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
Also -- if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
(and the hopefully coming new dtype system would make it easier to
write cleanly)
Post by Jeff Reback
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
I'm not sure what deprecating something means, though -- have it
raise a deprecation warning in the next version?
Post by Jeff Reback
Post by Charles R Harris
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff
Rebeck and see about moving that forward.
Post by Jeff Reback
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night
since SciPy he has dutifully been feeling guilty about not having done it
yet. I think this week his paltry excuse is that he's "on his honeymoon" or
something.
...which is to say that if someone has some spare cycles to take this over
then I think that might be a nice wedding present for him :-).
(The basic idea is to take the text reading backend behind pandas.read_csv
and extract it into a standalone package that pandas could depend on, and
that could also be used by other packages like numpy (among others -- I
thing dato's SFrame package has a fork of this code as well?))
-n
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
I can certainly provide guidance on how/what to extract but don't have
spare cycles myself for this :(
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker
2015-11-02 23:44:06 UTC
Permalink
Post by Benjamin Root
FWIW, when I needed a fast Fixed Width reader
was there potentially no whitespace between fields in that case? In which
case, it really isn a different use-case than delimited text -- if it's at
all common, a version written in C would be nice and fast. and nat hard to
do.

But fromstring never would have helped you with that anyway :-)

-CHB
Post by Benjamin Root
for a very large dataset last year, I found that np.genfromtext() was
faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back
to pure python for fixed width scenarios.
On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Grabbing the pandas csv reader would be great, and I hope it happens
sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using
fromstring() on text files? It's really pretty broken.
-Chris
On Oct 23, 2015, at 6:13 PM, Charles R Harris <
Post by Charles R Harris
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at
least.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Agreed -- it's only the text file reading I'm proposing to deprecate.
It was kind of weird to cram it in there in the first place.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`? -- Marten
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
fromstring() is bugging and inflexible for reading text files --
and it is a very, very ugly mess of code. I dug into it a while back, and
gave up -- just to much of a mess!
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
So we really should completely re-implement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
Also -- if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
(and the hopefully coming new dtype system would make it easier to
write cleanly)
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
I'm not sure what deprecating something means, though -- have it
raise a deprecation warning in the next version?
Post by Charles R Harris
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff
Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night
since SciPy he has dutifully been feeling guilty about not having done it
yet. I think this week his paltry excuse is that he's "on his honeymoon" or
something.
...which is to say that if someone has some spare cycles to take this
over then I think that might be a nice wedding present for him :-).
(The basic idea is to take the text reading backend behind
pandas.read_csv and extract it into a standalone package that pandas could
depend on, and that could also be used by other packages like numpy (among
others -- I thing dato's SFrame package has a fork of this code as well?))
-n
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
I can certainly provide guidance on how/what to extract but don't have
spare cycles myself for this :(
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Benjamin Root
2015-11-03 14:59:59 UTC
Permalink
Correct, there were entries that would sometimes take up their entire
width. The delimited text readers could not read this particular dataset.
The dataset I am referring to is the processed ISD data:
https://www.ncdc.noaa.gov/isd

As for fromstring() not being able to help there, I didn't mean to imply
that it would. I was more aiming to point out a situation where the NumPy's
text file reader was significantly better than the Pandas version, so we
would want to make sure that we properly benchmark any significant changes
to NumPy's text reading code. Who knows where else NumPy beats Pandas?

Ben
Post by Chris Barker
Post by Benjamin Root
FWIW, when I needed a fast Fixed Width reader
was there potentially no whitespace between fields in that case? In which
case, it really isn a different use-case than delimited text -- if it's at
all common, a version written in C would be nice and fast. and nat hard to
do.
But fromstring never would have helped you with that anyway :-)
-CHB
Post by Benjamin Root
for a very large dataset last year, I found that np.genfromtext() was
faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back
to pure python for fixed width scenarios.
On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Grabbing the pandas csv reader would be great, and I hope it happens
sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using
fromstring() on text files? It's really pretty broken.
-Chris
On Oct 23, 2015, at 6:13 PM, Charles R Harris <
Post by Charles R Harris
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at
least.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Agreed -- it's only the text file reading I'm proposing to
deprecate. It was kind of weird to cram it in there in the first place.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`? -- Marten
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker <
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
fromstring() is bugging and inflexible for reading text files --
and it is a very, very ugly mess of code. I dug into it a while back, and
gave up -- just to much of a mess!
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
So we really should completely re-implement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
Also -- if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
(and the hopefully coming new dtype system would make it easier to
write cleanly)
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
I'm not sure what deprecating something means, though -- have it
raise a deprecation warning in the next version?
Post by Charles R Harris
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff
Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night
since SciPy he has dutifully been feeling guilty about not having done it
yet. I think this week his paltry excuse is that he's "on his honeymoon" or
something.
...which is to say that if someone has some spare cycles to take this
over then I think that might be a nice wedding present for him :-).
(The basic idea is to take the text reading backend behind
pandas.read_csv and extract it into a standalone package that pandas could
depend on, and that could also be used by other packages like numpy (among
others -- I thing dato's SFrame package has a fork of this code as well?))
-n
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
I can certainly provide guidance on how/what to extract but don't have
spare cycles myself for this :(
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker - NOAA Federal
2015-11-03 17:03:01 UTC
Permalink
I was more aiming to point out a situation where the NumPy's text file
reader was significantly better than the Pandas version, so we would want
to make sure that we properly benchmark any significant changes to NumPy's
text reading code. Who knows where else NumPy beats Pandas?
Indeed. For this example, I think a fixed-with reader really is a different
animal, and it's probably a good idea to have a high performance one in
Numpy. Among other things, you wouldn't want it to try to auto-determine
data types or anything like that.

I think what's on the table now is to bring in a new delimited reader --
I.e. CSV in its various flavors.

CHB


Ben
Post by Chris Barker
Post by Benjamin Root
FWIW, when I needed a fast Fixed Width reader
was there potentially no whitespace between fields in that case? In which
case, it really isn a different use-case than delimited text -- if it's at
all common, a version written in C would be nice and fast. and nat hard to
do.
But fromstring never would have helped you with that anyway :-)
-CHB
Post by Benjamin Root
for a very large dataset last year, I found that np.genfromtext() was
faster than pandas' read_fwf(). IIRC, pandas' text reading code fell back
to pure python for fixed width scenarios.
On Fri, Oct 23, 2015 at 8:22 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Grabbing the pandas csv reader would be great, and I hope it happens
sooner than later, though alas, I haven't the spare cycles for it either.
In the meantime though, can we put a deprecation Warning in when using
fromstring() on text files? It's really pretty broken.
-Chris
On Oct 23, 2015, at 6:13 PM, Charles R Harris <
Post by Charles R Harris
On Thu, Oct 22, 2015 at 5:47 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
I think it would be good to keep the usage to read binary data at
least.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Agreed -- it's only the text file reading I'm proposing to
deprecate. It was kind of weird to cram it in there in the first place.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Oh, fromfile() has the same issues.
Chris
Post by Marten van Kerkwijk
Or is there a good alternative to `np.fromstring(<bytes>,
dtype=...)`? -- Marten
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
On Thu, Oct 22, 2015 at 1:03 PM, Chris Barker <
Post by Chris Barker
There was just a question about a bug/issue with scipy.fromstring
(which is numpy.fromstring) when used to read integers from a text file.
https://mail.scipy.org/pipermail/scipy-user/2015-October/036746.html
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
fromstring() is bugging and inflexible for reading text files --
and it is a very, very ugly mess of code. I dug into it a while back, and
gave up -- just to much of a mess!
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
So we really should completely re-implement it, or deprecate it. I
doubt anyone is going to do a big refactor, so that means deprecating it.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
Also -- if we do want a fast read numbers from text files function
(which would be nice, actually), it really should get a new name anyway.
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
(and the hopefully coming new dtype system would make it easier to
write cleanly)
Post by Charles R Harris
Post by Chris Barker - NOAA Federal
Post by Marten van Kerkwijk
Post by Chris Barker
I'm not sure what deprecating something means, though -- have it
raise a deprecation warning in the next version?
Post by Charles R Harris
There was discussion at SciPy 2015 of separating out the text reading
abilities of Pandas so that numpy could include it. We should contact Jeff
Rebeck and see about moving that forward.
IIRC Thomas Caswell was interested in doing this :)
When he was in Berkeley a few weeks ago he assured me that every night
since SciPy he has dutifully been feeling guilty about not having done it
yet. I think this week his paltry excuse is that he's "on his honeymoon" or
something.
...which is to say that if someone has some spare cycles to take this
over then I think that might be a nice wedding present for him :-).
(The basic idea is to take the text reading backend behind
pandas.read_csv and extract it into a standalone package that pandas could
depend on, and that could also be used by other packages like numpy (among
others -- I thing dato's SFrame package has a fork of this code as well?))
-n
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
I can certainly provide guidance on how/what to extract but don't have
spare cycles myself for this :(
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Derek Homeier
2015-11-04 20:00:19 UTC
Permalink
I was more aiming to point out a situation where the NumPy's text file reader was significantly better than the Pandas version, so we would want to make sure that we properly benchmark any significant changes to NumPy's text reading code. Who knows where else NumPy beats Pandas?
Indeed. For this example, I think a fixed-with reader really is a different animal, and it's probably a good idea to have a high performance one in Numpy. Among other things, you wouldn't want it to try to auto-determine data types or anything like that.
I think what's on the table now is to bring in a new delimited reader -- I.e. CSV in its various flavors.
To add my own handful of change or at least another data point, I had been looking into both
the pandas and the Astropy fast readers as a fast loadtxt/genfromtxt replacement; at the time
I found the Astropy cparser source somewhat easier to dig into, although looking now Pandas'
parser.pyx seems clear enough as well.
Some comparison of the two can be found at
http://astropy.readthedocs.org/en/stable/io/ascii/fast_ascii_io.html#speed-gains

Unfortunately the Astropy fast reader currently does not support fixed-width format either, and
adding this functionality would require modifications to the tokenizer C code - not sure how
extensive.

Cheers,
Derek

Loading...