[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Discussion:

Hedieh Ebrahimi

2014-01-15 10:12:42 UTC

Hello,

I am trying to use the following line of code :

fileContent=loadtxt(filePath,dtype=str)

in order to load a text file located at path= filePath in to a numpy array
called fileContent.

IŽve simplifed my file for the purpose of this question but the file looks
something like this:

file Content :

C:\Users\Documents\Project\mytextfile1.txt

C:\Users\Documents\Project\mytextfile2.txt

C:\Users\Documents\Project\mytextfile3.txt

I try to print my fileContent array after I read it and it looks like this :

["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
"b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
"b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]

Why is this happening and how can I prevent it ?
Also if I have a line that starts like this in my file, python will crash
on me. how can i fix this ?

!--Timestep ( line in file starting with !-- )

I guess it has to have something to do with datatype. if I donot define the
datatype it will be float by default which will give me an error an if I
define the datatype as string as I did above, then I get to the problems
that I mentioned above.

IŽd appreciate any help on how to fix this.

Thanks

Daπid

2014-01-15 10:25:26 UTC

Permalink

On 15 January 2014 11:12, Hedieh Ebrahimi <***@amphos21.com>wrote:

> I try to print my fileContent array after I read it and it looks like this
> :
>
> ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>
> Why is this happening and how can I prevent it ?
> Also if I have a line that starts like this in my file, python will crash
> on me. how can i fix this ?
>

What is wrong with this case? If you are concerned about the multiple
backslashes, they are there because they are special symbols, and so they
have to be escaped (you actually want a backslash, not whatever else they
could mean).

Depending on what else is on the file, you may be better off reading the
file in pure python. Assuming there is nothing else, something like this
would work:

[line.strip() for line in open(filePath, 'r').readlines()]

/David.

Julian Taylor

2014-01-15 12:38:57 UTC

Permalink

On 01/15/2014 11:25 AM, Daπid wrote:
> On 15 January 2014 11:12, Hedieh Ebrahimi <***@amphos21.com
> <mailto:***@amphos21.com>> wrote:
>
> I try to print my fileContent array after I read it and it looks
> like this :
>
> ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>
> Why is this happening and how can I prevent it ?
> Also if I have a line that starts like this in my file, python will
> crash on me. how can i fix this ?
>
>
> What is wrong with this case? If you are concerned about the multiple
> backslashes, they are there because they are special symbols, and so
> they have to be escaped (you actually want a backslash, not whatever
> else they could mean).
>

you have the bytes representation and a duplicate slash in it.
Its due to unicode strings in python3.
A workaround that only works for ascii is:

np.loadtxt(file, dtype=bytes).astype(str)

for non ascii I guess you should use python directly as numpy would also
require a python loop with explicit decoding.

Currently handling strings in python3 with numpy is even worse than
before, you always have to go over bytes and do explicit decodes to get
python strings out of ascii data.

What we might need in numpy is new string xtypes specifying encodings to
allow sane conversion to python3 strings without the excessive memory
usage of 4 byte unicode (ucs-4).
e.g. if its ascii reuse a (which currently maps to bytes)

np.loadtxt(file, dtype='a')

for utf 8 data:

d = np.loadtxt(file, dtype='utf8')

so that type(d[0]) is unicode and not bytes as is currently the case if
you don't want to store your arrays in 4 bytes per character.

Julian Taylor

2014-01-15 12:43:50 UTC

Permalink

On 01/15/2014 01:38 PM, Julian Taylor wrote:
> On 01/15/2014 11:25 AM, Daπid wrote:
>> On 15 January 2014 11:12, Hedieh Ebrahimi <***@amphos21.com
...
> for utf 8 data:
>
> d = np.loadtxt(file, dtype='utf8')
>

ups this is a very bad example as we can't have utf8 as its variable
length, but we can have ascii and ucs-2 for lower footprint encodings
with proper python string integration.

Chris Barker

2014-01-15 17:27:28 UTC

Permalink

On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor <
***@googlemail.com> wrote:

> > I try to print my fileContent array after I read it and it looks
> > like this :
> >
> > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>

> you have the bytes representation and a duplicate slash in it.
>

the duplicate slash confuses me, but I'm not running py3 to test, so...

> np.loadtxt(file, dtype=bytes).astype(str)
>
> for non ascii I guess you should use python directly as numpy would also
> require a python loop with explicit decoding.
>
> Currently handling strings in python3 with numpy is even worse than
> before, you always have to go over bytes and do explicit decodes to get
> python strings out of ascii data.
>

There is a MASSIVE set of threads on Python-dev about better support for
ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have
two issue shere that could be adressed:

1) loadtext behavior -- it's a really, really common case for data files
suitable for loadtxt to be ascii, but they also could be another encoding
-- so loadtext should have the option to specify the encoding (default to
ascii? or ascii-compatible?)

The trick here is handling both these cases correctly -- clearly loadtxt is
broken on py3 now. This example works fine under py2.

It seems to be reading the file as bytes, then passing those bytes off to a
unicode string (str in py3), without specifying an encoding (which I think
is how that b' ...'
junk gets in there.

note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as
well:

In [7]: np.loadtxt('pathlist.txt', dtype=unicode)
Out[7]:
array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt',
u'C:\\Users\\Documents\\Project\\mytextfile2.txt',
u'C:\\Users\\Documents\\Project\\mytextfile3.txt'],
dtype='<U42')

which is what should happen in py3. So the internal loadtxt code must be
confusing bytes and unicode objects...

Anyway, this should work, and there should be an obvious way to spell it.

2) numpy string types -- it seems numpy already has a both a string type
and unicode type -- perhaps some re-naming or better documentation is in
order:
the string type 'S10', for example, should be clearly defined as 1-byte
per character ascii-compatible.

I'm not sure how many bytes the unicode type has, but it may make sense to
be abel to choose UCS-2 or UCS-4 -- though memory is cheep, I'd probably go
with UCS-4 and be done with it.

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Charles R Harris

2014-01-15 17:57:51 UTC

Permalink

On Wed, Jan 15, 2014 at 10:27 AM, Chris Barker <***@noaa.gov>wrote:

> On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor <
> ***@googlemail.com> wrote:
>
>> > I try to print my fileContent array after I read it and it looks
>> > like this :
>> >
>> > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
>> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
>> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>>
>
>
>> you have the bytes representation and a duplicate slash in it.
>>
>
> the duplicate slash confuses me, but I'm not running py3 to test, so...
>
>
>> np.loadtxt(file, dtype=bytes).astype(str)
>>
>> for non ascii I guess you should use python directly as numpy would also
>> require a python loop with explicit decoding.
>>
>> Currently handling strings in python3 with numpy is even worse than
>> before, you always have to go over bytes and do explicit decodes to get
>> python strings out of ascii data.
>>
>
> There is a MASSIVE set of threads on Python-dev about better support for
> ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have
> two issue shere that could be adressed:
>
> 1) loadtext behavior -- it's a really, really common case for data files
> suitable for loadtxt to be ascii, but they also could be another encoding
> -- so loadtext should have the option to specify the encoding (default to
> ascii? or ascii-compatible?)
>
> The trick here is handling both these cases correctly -- clearly loadtxt
> is broken on py3 now. This example works fine under py2.
>
> It seems to be reading the file as bytes, then passing those bytes off to
> a unicode string (str in py3), without specifying an encoding (which I
> think is how that b' ...'
> junk gets in there.
>
> note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as
> well:
>
> In [7]: np.loadtxt('pathlist.txt', dtype=unicode)
> Out[7]:
> array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt',
> u'C:\\Users\\Documents\\Project\\mytextfile2.txt',
> u'C:\\Users\\Documents\\Project\\mytextfile3.txt'],
> dtype='<U42')
>
> which is what should happen in py3. So the internal loadtxt code must be
> confusing bytes and unicode objects...
>
> Anyway, this should work, and there should be an obvious way to spell it.
>
> 2) numpy string types -- it seems numpy already has a both a string type
> and unicode type -- perhaps some re-naming or better documentation is in
> order:
> the string type 'S10', for example, should be clearly defined as 1-byte
> per character ascii-compatible.
>
> I'm not sure how many bytes the unicode type has, but it may make sense to
> be abel to choose UCS-2 or UCS-4 -- though memory is cheep, I'd probably go
> with UCS-4 and be done with it.
>

There was a discussion of this long ago and UCS-4 was chosen as the numpy
standard. There are just too many complications that arise in supporting
both.

Chuck

Julian Taylor

2014-01-15 18:25:31 UTC

Permalink

On 15.01.2014 18:57, Charles R Harris wrote:
> ...
>
> There was a discussion of this long ago and UCS-4 was chosen as the
> numpy standard. There are just too many complications that arise in
> supporting both.
>

my guess is that that discussion was before python3 and you could still
simply treat bytes == string?

In python3 you need extra code to deal with arrays containing strings as
the S type is interpreted as bytes which is not a string type anymore [0].
Someone on irc (I think Freddie Witherden CC'd) had a use case with huge
ascii tables in numpy which now have to be stored as 4 bytes unicode on
disk or decode bytes all the time.

I personally don't use strings in arrays so I can neither judge the
impact nor the use, but it seems to me like at least having an ascii
dtype for python2<->python3 compatibility would be useful.

[0] https://github.com/numpy/numpy/issues/4162

Chris Barker

2014-01-15 20:07:35 UTC

Permalink

Julian -- beat me to it!

On Wed, Jan 15, 2014 at 10:25 AM, Julian Taylor <
***@googlemail.com> wrote:

> On 15.01.2014 18:57, Charles R Harris wrote:
> > There was a discussion of this long ago and UCS-4 was chosen as the
> > numpy standard. There are just too many complications that arise in
> > supporting both.
>

supporting both UCS-4 and UCS-2 would be more pain than it's worth.

> In python3 you need extra code to deal with arrays containing strings as
> the S type is interpreted as bytes which is not a string type anymore [0].
>

ouch! I was just assuming that it still was -- yes, I really think we need
a one-byte-per char string type -- probably ascii, but we could do latin-1
and let the buyer beware of the higher value bytes

Someone on irc (I think Freddie Witherden CC'd) had a use case with huge
> ascii tables in numpy which now have to be stored as 4 bytes unicode on
> disk or decode bytes all the time.
>

and ascii data is not the least bit rare in the science world in
particular.

> I personally don't use strings in arrays so I can neither judge the
> impact nor the use, but it seems to me like at least having an ascii
> dtype for python2<->python3 compatibility would be useful.
>

I think py2<->py3 compatibilty is a separate issue -- we should have this
if it's a good thing to have, not because of that. And it is a good thing
to have.

And since this is a new thread -- regardless of the decision on this,
loadtxt is broken -- we certainly should be able to parse ascii text and
return something reasonable -- unicode strings would have been fine in the
OPs case, if they didn't have the extra bytes to tring crap in them.

[0] https://github.com/numpy/numpy/issues/4162

from that:

The transition towards split string/bytes types in Python 3 has the
unfortunate side effect of breaking the following snippet:

np.array("Hello", dtype="|S").item() == "Hello"
Sorry for not testing in py3, but this makes it look like the "S" dtype is
one-byte per char strings, but creates a bytes object, rather than a
unicode (py3 str) object.

As in my other note, I think it would be better to have it return a unicode
string by default.

But it looks like you can still use it to store large quantities of ascii
data if you want.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Chris Barker

2014-01-15 19:40:58 UTC

Permalink

On Wed, Jan 15, 2014 at 9:57 AM, Charles R Harris <***@gmail.com
> wrote:

> There was a discussion of this long ago and UCS-4 was chosen as the numpy
> standard. There are just too many complications that arise in supporting
> both.
>

fair enough -- but loadtxt appears to be broken just the same. Any
proposals for that?

My proposal:

loadtxt accepts an encoding argument.

default is ascii -- that's what it's doing now, anyway, yes?

If the file is encoded ascii, then a one-byte-per character dtype is used
for text data, unless the user specifies otherwise (do they need to specify
anyway?)

If the file has another encoding, the the default dtype for text is unicode.

Not sure about other one-byte per character encodings (e.g. latin-1)

The defaults may be moot, if the loadtxt doesn't have auto-detection of
text in a filie anyway.

This all required that there be an obvious way for the user to spell the
one-byte-per character dtype -- I think 'S' will do it.

Note to OP: what happens if you specify 'S' for your dtype, rather than str
- it works for me on py2:

In [16]: np.loadtxt('pathlist.txt', dtype='S')
Out[16]:
array(['C:\\Users\\Documents\\Project\\mytextfile1.txt',
'C:\\Users\\Documents\\Project\\mytextfile2.txt',
'C:\\Users\\Documents\\Project\\mytextfile3.txt'],
dtype='|S42')

Note: this leaves us with what to pass back to the user when they index
into an array of type 'S*' -- a bytes object or a unicode object (decoded
as ascii). I think a unicode object, in keeping with proper py3 behavior.
This would be like we currently do with, say floating point numbers:

We can store/operate with 32 bit floats, but when you pass it back as a
python type, you get the native python float -- 64bit.

NOTE: another option is to use latin-1 all around, rather than ascii -- you
may get garbage from the higher value bytes, but it won't barf on you.

-Chris

> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Oscar Benjamin

2014-01-16 10:43:05 UTC

Permalink

On Wed, Jan 15, 2014 at 11:40:58AM -0800, Chris Barker wrote:
> On Wed, Jan 15, 2014 at 9:57 AM, Charles R Harris <***@gmail.com
> > wrote:
>
>
> > There was a discussion of this long ago and UCS-4 was chosen as the numpy
> > standard. There are just too many complications that arise in supporting
> > both.
> >
>
> fair enough -- but loadtxt appears to be broken just the same. Any
> proposals for that?
>
> My proposal:
>
> loadtxt accepts an encoding argument.
>
> default is ascii -- that's what it's doing now, anyway, yes?

No it's loading the file reading a line, encoding the line with latin-1, and
then putting the repr of the resulting byte-string as a unicode string into a
UCS-4 array (dtype='<Ux'). I can't see any good reason for that behaviour.

>
> If the file is encoded ascii, then a one-byte-per character dtype is used
> for text data, unless the user specifies otherwise (do they need to specify
> anyway?)
>
> If the file has another encoding, the the default dtype for text is unicode.

That's a silly idea. There's already the dtype='S' for ascii that will give
one byte per character.

However numpy.loadtxt(dtype='S') doesn't actually use ascii IIUC. It loads
the file as text with the default system encoding, encodes the text with
latin-1 and stores the resulting bytes into a dtype='S' array. I think it
should just open the file in binary read the bytes and store them in the
dtype='S' array. The current behaviour strikes me as a hangover from the
Python 2.x 8-bit text model.

> Not sure about other one-byte per character encodings (e.g. latin-1)
>
> The defaults may be moot, if the loadtxt doesn't have auto-detection of
> text in a filie anyway.
>
> This all required that there be an obvious way for the user to spell the
> one-byte-per character dtype -- I think 'S' will do it.

They should use 'S' and not encoding='ascii'. If the user provides an encoding
then it should be used to open the file and decode it to unicode resulting in
a dtype='U' array. (Python 3 handles this all for you).

> Note to OP: what happens if you specify 'S' for your dtype, rather than str
> - it works for me on py2:
>
> In [16]: np.loadtxt('pathlist.txt', dtype='S')
> Out[16]:
> array(['C:\\Users\\Documents\\Project\\mytextfile1.txt',
> 'C:\\Users\\Documents\\Project\\mytextfile2.txt',
> 'C:\\Users\\Documents\\Project\\mytextfile3.txt'],
> dtype='|S42')

It only seems to work because you're using ascii data. On Py3 you'll have byte
strings corresponding to the text in the file encoded as latin-1 (regardless
of the encoding used in the file). loadtxt doesn't open the file in binary or
specify an encoding so the file will be opened with the system default
encoding as determined by the standard builtins.open. The resulting text is
decoded according to that encoding and then reencoded as latin-1 which will
corrupt the binary form of the data if the system encoding is not compatible
with latin-1 (e.g. ascii and latin-1 will work but utf-8 will not).

>
> Note: this leaves us with what to pass back to the user when they index
> into an array of type 'S*' -- a bytes object or a unicode object (decoded
> as ascii). I think a unicode object, in keeping with proper py3 behavior.
> This would be like we currently do with, say floating point numbers:
>
> We can store/operate with 32 bit floats, but when you pass it back as a
> python type, you get the native python float -- 64bit.
>
> NOTE: another option is to use latin-1 all around, rather than ascii -- you
> may get garbage from the higher value bytes, but it won't barf on you.

I guess you're alluding to the idea that reading/writing files as latin-1 will
pretend to seamlessly decode/encode any bytes preserving binary data in any
round-trip. This concept is already broken if you intend to do any processing,
indexing or slicing of the array. Additionally the current loadtxt behaviour
fails to achieve this round-trip even for the 'S' dtype even if you don't do
any processing:

$ ipython3
Python 3.2.3 (default, Sep 25 2013, 18:22:43)
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: with open('tmp.py', 'w') as fout: # Implicitly utf-8 here
fout.write('Åå\n' * 3)
...:

In [2]: import numpy

In [3]: a = numpy.loadtxt('tmp.py')
<snip>
ValueError: could not convert string to float: b'\xc5\xe5'

In [4]: a = numpy.loadtxt('tmp.py', dtype='S')

In [5]: a
Out[5]:
array([b'\xc5\xe5', b'\xc5\xe5', b'\xc5\xe5'],
dtype='|S2')

In [6]: a.tostring()
Out[6]: b'\xc5\xe5\xc5\xe5\xc5\xe5'

In [7]: with open('tmp.py', 'rb') as fin:
...: text = fin.read()
...:

In [8]: text
Out[8]: b'\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n'

This is a mess. I don't know about how to handle backwards compatibility but
the sensible way to handle this in *both* Python 2 and 3 is that dtype='S'
opens the file in binary, reads byte strings, and stores them in an array with
dtype='S'. dtype='U' should open the file as text with an encoding argument
(or system default if not supplied), decode the bytes and create an array with
dtype='U'. The only reasonable difference between Python 2 and 3 is which of
these two behaviours dtype=str should do.

Oscar

Chris Barker

2014-01-16 17:08:38 UTC

Permalink

On Thu, Jan 16, 2014 at 2:43 AM, Oscar Benjamin
<***@gmail.com>wrote:

> > My proposal:
> >
> > loadtxt accepts an encoding argument.
> >
> > default is ascii -- that's what it's doing now, anyway, yes?
>
> No it's loading the file reading a line, encoding the line with latin-1,
> and
> then putting the repr of the resulting byte-string as a unicode string
> into a
> UCS-4 array (dtype='<Ux'). I can't see any good reason for that behaviour.

agreed -- really odd. If we're going assume latin-1 -- why not put the
decode unicode string in the the string?

But what about parsing numbers? latin-1 decoded to a unicode object, then
parsed? Reasonable enough.

> If the file is encoded ascii, then a one-byte-per character dtype is used
> > for text data, unless the user specifies otherwise (do they need to
> specify
> > anyway?)
> >
> > If the file has another encoding, the the default dtype for text is
> unicode.
>
> That's a silly idea. There's already the dtype='S' for ascii that will give
> one byte per character.
>

Except that 'S' is being translated to a bytes object, and in py3 bytes is
not really text -- see the other thread.

However numpy.loadtxt(dtype='S') doesn't actually use ascii IIUC. It loads
> the file as text with the default system encoding,

not such a bad idea in principle, but I think with scientific data files in
particular, the file was just as likely generated on a different system, so
system settings should be avoided. My guess is that a large fraction of
systems have system encodings that are ascii-compatible, so we'll get away
with this most of the time, but explicit is better than implicit, and all
that.

encodes the text with
> latin-1 and stores the resulting bytes into a dtype='S' array. I think it
> should just open the file in binary read the bytes and store them in the
> dtype='S' array. The current behaviour strikes me as a hangover from the
> Python 2.x 8-bit text model.
>

not sure it's even that -- I suspect it's a broken attempt to match the py3
text model...

> Not sure about other one-byte per character encodings (e.g. latin-1) The
> defaults may be moot, if the loadtxt doesn't have auto-detection of text in
> a filie anyway.
>

I'm not suggesting auto0detection, but I am suggesting the ability to
specify an encoding, and in that case, we need a default, and I don't think
it should be the system encoding.

> This all required that there be an obvious way for the user to spell the
> > one-byte-per character dtype -- I think 'S' will do it.
>
> They should use 'S' and not encoding='ascii'.

that is stating implicitly that 'S' is ascii-compatible, but it gets
traslated to the py3 bytes type, which the pyton dev folks REALLY want to
mean "arbitrary bytes", rather than 'ascii text'.

practically, it means you need to decode it to use it as text -- compare
with a string, etc...

If the user provides an encoding
> then it should be used to open the file and decode it to unicode resulting
> in
> a dtype='U' array. (Python 3 handles this all for you).

I think it may be an important use case to pull ansi-compatible text out of
a file and put it into a 1-byte per character dtype (i.,e 'S'). Folks
don't necessarily want or need 4 bytes per charater.

In practice this probably only makes sense it the file is in an
ascii-compatible encoding anyway, but I like the idea of keeping the file
encoding and the dtype independent.

It only seems to work because you're using ascii data.
>

(or latin-1?) well, yes, but that was the OP's example. though it was file
names, so he'd probably ultimately want them as py3 strings...

> which will
> corrupt the binary form of the data if the system encoding is not
> compatible
> with latin-1 (e.g. ascii and latin-1 will work but utf-8 will not).

a good reason not to use the system default encoding!

> NOTE: another option is to use latin-1 all around, rather than ascii --
> you
> > may get garbage from the higher value bytes, but it won't barf on you.
>
> I guess you're alluding to the idea that reading/writing files as latin-1
> will
> pretend to seamlessly decode/encode any bytes preserving binary data in any
> round-trip.

yes, exactly -- a practical common use case is that there is non-ascii
compliant bytes in a data stream, but that the use-case doesn't care what
they are. If you use ascii, then you get exceptions you don't need to get.

> This concept is already broken if you intend to do any processing,
> indexing or slicing of the array.

no it's not -- latin-1 is ascii-compatible (as is utf-8), so a lot
of processing will work fine -- splitting on whitespace or whatever, etc.

yes, indexing can go to heck if you have utf-8 or, of course, non-ascii
compatible encoding -- but that's never going to work without specifying an
encoding anyway.

> Additionally the current loadtxt behaviour
> fails to achieve this round-trip even for the 'S' dtype even if you don't
> do
> any processing:
>

right -- I think we agree that it's broken now.

This is a mess. I don't know about how to handle backwards compatibility but
> the sensible way to handle this in *both* Python 2 and 3 is that dtype='S'
> opens the file in binary, reads byte strings, and stores them in an array
> with
> dtype='S'. dtype='U' should open the file as text with an encoding argument
> (or system default if not supplied), decode the bytes and create an array
> with
> dtype='U'.

agreed -- except for the system encoding part....

> The only reasonable difference between Python 2 and 3 is which of
> these two behaviours dtype=str should do.

well, str is a py3 string in py3 -- so it should be dtype 'U'. Personally,
I avoid using the native types for dtype arguemtns anyway, so users should
use:

dtype=np.unicode
or
dtype=np.string0 (or np.string_) -- or????

How do you spell the dtype that 'S' give you????

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Julian Taylor

2014-01-17 09:38:15 UTC

Permalink

This thread is getting a little out of hand which is my fault for initially
mixing different topics in one mail, so let me try to summarize:
We have three issues here:

- a loadtxt bug when loading strings in python3
this has nothing to do with encodings or dtypes it is a bug that should be
fixed. Not more not less.

the fix is probably removing a repr() somewhere and converting the data to
unicode as the user requested as str == unicode in py3, this is the normal
change you must account for when migrating to p3.

- no possibility to specify the encoding of a file in loadtxt
this is a missing feature, currently it uses the system default which is
good and should stay that way.
It is only missing an option to tell it to treat it differently.
There should be little debate about changing the default, especially not
using latin1. The system default exists for a good reason. Note on linux it
is UTF-8 which is a good choice. I'm not familiar with windows but all
programs should at least have the option to use UTF-8 as output too.
This has nothing to do with indexing or any kind of processing of the numpy
arrays.

The fix should be trivial to do, just add an encoding keyword argument and
pass it on to python.
The workaround should be passing a file object to loadtxt instead of a file
name. Python file objects already have the encoding argument.

- inconvenience in dealing with strings in python 3.
bytes are not strings in python3 which means ascii data is either a byte
array which can be inconvenient to deal with or 4 byte unicode which wastes
space.
A proposal to fix this would be to add a one or two byte dtype with a
specific encoding that behaves similar to bytes but converts to string when
outputting to python for comparisons etc.
For backward compatibility we *cannot* change S. Maybe we could change the
meaning of 'a' but it would be safer to add a new dtype, possibly 'S' can
be deprecated in favor of 'B' when we have a specific encoding dtype.
The main issue is probably: is it worth it and who does the work?

Pauli Virtanen

2014-01-17 10:59:27 UTC

Permalink

Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
[clip]
> - inconvenience in dealing with strings in python 3.
>
> bytes are not strings in python3 which means ascii data is either a byte
> array which can be inconvenient to deal with or 4 byte unicode which
> wastes space.
>
> A proposal to fix this would be to add a one or two byte dtype with a specific
> encoding that behaves similar to bytes but converts to string when outputting
> to python for comparisons etc.
>
> For backward compatibility we *cannot* change S. Maybe we could change
> the meaning of 'a' but it would be safer to add a new dtype, possibly
> 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype.
>
> The main issue is probably: is it worth it and who does the work?

I don't think this is a good idea: the bytes vs. unicode separation in
Python 3 exists for a good reason. If unicode is not needed, why not just
use the bytes data type throughout the program?

(Also, assuming that ASCII is in general good for text-format data is
quite US-centric.)

Christopher Barker wrote:
>
> How do you spell the dtype that 'S' give you????
>

'S' is bytes.

dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent.

--
Pauli Virtanen

j***@gmail.com

2014-01-17 12:35:42 UTC

Permalink

On Fri, Jan 17, 2014 at 5:59 AM, Pauli Virtanen <***@iki.fi> wrote:
> Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> [clip]
>> - inconvenience in dealing with strings in python 3.
>>
>> bytes are not strings in python3 which means ascii data is either a byte
>> array which can be inconvenient to deal with or 4 byte unicode which
>> wastes space.
>>
>> A proposal to fix this would be to add a one or two byte dtype with a specific
>> encoding that behaves similar to bytes but converts to string when outputting
>> to python for comparisons etc.
>>
>> For backward compatibility we *cannot* change S. Maybe we could change
>> the meaning of 'a' but it would be safer to add a new dtype, possibly
>> 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype.
>>
>> The main issue is probably: is it worth it and who does the work?
>
> I don't think this is a good idea: the bytes vs. unicode separation in
> Python 3 exists for a good reason. If unicode is not needed, why not just
> use the bytes data type throughout the program?
>
> (Also, assuming that ASCII is in general good for text-format data is
> quite US-centric.)
>
> Christopher Barker wrote:
>>
>> How do you spell the dtype that 'S' give you????
>>
>
> 'S' is bytes.
>
> dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent.

'S' is bytes, is a feature not a bug, I thought.

I didn't pay much attention to the two threads because I don't use
loadtxt. But I think the same issue is in genfromtxt, recfromtxt, ...

I don't have a lot of experience with python 3, but in the initial
python 3 compatibility conversion of statsmodels, I followed numpy's
lead and used the numpy helper functions and converted all strings to
bytes.

Everything loaded by genfromtxt or similar reades bytes, files are
opened with "rb".

In most places our code doesn't really care, as long as numpy.unique,
and similar work either way. But in some cases there were some strange
things working with bytes.

There are also some weirder cases with non-ASCII "strings", and I also
have problems in interactive work when the interpreter encoding
interfers.
Also maybe related, our Stata data file reader genfromdta handles
cyrillic languages (Russian IIRC) in the same way as ascii, I don't
know the details but Skipper fixed a bug so it works.

I'm pretty sure interaction statsmodels/pandas/patsy has problems/bugs
with non-ASCII support in variable names, but my impression is that
string data as bytes causes few problems.

Josef

>
> --
> Pauli Virtanen
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Oscar Benjamin

2014-01-17 12:44:16 UTC

Permalink

On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> [clip]
> > - inconvenience in dealing with strings in python 3.
> >
> > bytes are not strings in python3 which means ascii data is either a byte
> > array which can be inconvenient to deal with or 4 byte unicode which
> > wastes space.

It doesn't waste that much space in practice. People have been happily using
Python 2's 4-byte-per-char unicode string on wide builds (e.g. on Linux) for
years in all kinds of text heavy applications.

$ python2
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof(u'a' * 1000)
4052

> > For backward compatibility we *cannot* change S.

Do you mean to say that loadtxt cannot be changed from decoding using system
default, splitting on newlines and whitespace and then encoding the substrings
as latin-1?

An obvious improvement would be along the lines of what Chris Barker
suggested: decode as latin-1, do the processing and then reencode as latin-1.
Or just open the file in binary and use the bytes string methods. Either of
these has the advantage that it won't corrupt the binary representation of the
data - assuming ascii-compatible whitespace and newlines (e.g. utf-8 and most
currently used 8-bit encodings).

In the situations where the current behaviour differs from this the user
*definitely* has mojibake. Can anyone possibly be relying on that (except in
the sense of having implemented a workaround that would break if it was
fixed)?

> > Maybe we could change
> > the meaning of 'a' but it would be safer to add a new dtype, possibly
> > 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype.
> >
> > The main issue is probably: is it worth it and who does the work?
>
> I don't think this is a good idea: the bytes vs. unicode separation in
> Python 3 exists for a good reason. If unicode is not needed, why not just
> use the bytes data type throughout the program?

Or on the other hand, why try to use bytes when you're clearly dealing with
text data?

If you're concerned about memory usage why not use Python strings? As of
CPython 3.3 strings consisting only of latin-1 characters are stored with 1
char-per-byte. This is only really sensible for immutable strings with an
opaque memory representation though so numpy shouldn't try to copy it.

> (Also, assuming that ASCII is in general good for text-format data is
> quite US-centric.)

Indeed. The original use case in this thread was a text file containing file
paths. In most of the world there's a reasonable chance that file paths can
contain non-ascii characters. The current behaviour of decoding using one
codec and encoding with latin-1 would, in many cases, break if the user tried
to e.g. open() a file using a byte-string from the array.

Oscar

Julian Taylor

2014-01-17 13:10:19 UTC

Permalink

On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
<***@gmail.com>wrote:

> On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> > [clip]
>

> > > For backward compatibility we *cannot* change S.
>
> Do you mean to say that loadtxt cannot be changed from decoding using
> system
> default, splitting on newlines and whitespace and then encoding the
> substrings
> as latin-1?
>

unicode dtypes have nothing to do with the loadtxt issue. They are not
related.

>
> An obvious improvement would be along the lines of what Chris Barker
> suggested: decode as latin-1, do the processing and then reencode as
> latin-1.
>

no, the right solution is to add an encoding argument.
Its a 4 line patch for python2 and a 2 line patch for python3 and the issue
is solved, I'll file a PR later.

No latin1 de/encoding is required for anything, I don't know why you would
want do to that in this context.
Does opening latin1 files even work with current loadtxt?
It currently uses UTF-8 which is to my knowledge not compatible with latin1.

Julian Taylor

2014-01-17 13:31:32 UTC

Permalink

On Fri, Jan 17, 2014 at 2:10 PM, Julian Taylor <
***@googlemail.com> wrote:

> On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin <
> ***@gmail.com> wrote:...
> ...
> No latin1 de/encoding is required for anything, I don't know why you would
> want do to that in this context.
> Does opening latin1 files even work with current loadtxt?
> It currently uses UTF-8 which is to my knowledge not compatible with
> latin1.
>

just tried it, doesn't work so there is nothing we need to keep working:

f = codecs.open('test.txt', 'wt', encoding='latin1')
f.write(u'ÃÃ¶\n')
f.close()
np.loadtxt('test.txt')

ValueError: could not convert string to float: ï¿œï¿œ
or UnicodeDecodeError: if provided with unicode dtype

there are a couple more unicode issues in the test loading (it converts to
bytes even if unicode is requested), but they look simple to fix.

Oscar Benjamin

2014-01-17 13:40:34 UTC

Permalink

On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
> <***@gmail.com>wrote:
>
> > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> > > [clip]
> >
>
> > > > For backward compatibility we *cannot* change S.
> >
> > Do you mean to say that loadtxt cannot be changed from decoding using
> > system
> > default, splitting on newlines and whitespace and then encoding the
> > substrings
> > as latin-1?
> >
>
> unicode dtypes have nothing to do with the loadtxt issue. They are not
> related.

I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier,
if the file is not encoded as ascii or latin-1 then the byte strings are
corrupted (see below).

This is because loadtxt opens the file with the default system encoding (by
not explicitly specifying an encoding):
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732

It then processes each line with asbytes() which encodes them as latin-1:
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28

Being an English speaker I don't normally use non-ascii characters in
filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or
(and rightly so!).

> >
> > An obvious improvement would be along the lines of what Chris Barker
> > suggested: decode as latin-1, do the processing and then reencode as
> > latin-1.
> >
>
> no, the right solution is to add an encoding argument.
> Its a 4 line patch for python2 and a 2 line patch for python3 and the issue
> is solved, I'll file a PR later.

What is the encoding argument for? Is it to be used to decode, process the
text and then re-encode it for an array with dtype='S'?

Note that there are two encodings: one for reading from the file and one for
storing in the array. The former describes the content of the file and the
latter will be used if I extract a byte-string from the array and pass it to
any Python API.

> No latin1 de/encoding is required for anything, I don't know why you would
> want do to that in this context.
> Does opening latin1 files even work with current loadtxt?

It's the only encoding that works for dtype='S'.

> It currently uses UTF-8 which is to my knowledge not compatible with latin1.

It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and
store in the array, corrupting any non-ascii characters. Here's a
demonstration:

$ ipython3
Python 3.2.3 (default, Sep 25 2013, 18:22:43)
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: with open('Õscar.txt', 'w') as fout: pass

In [2]: import os

In [3]: os.listdir('.')
Out[3]: ['Õscar.txt']

In [4]: with open('filenames.txt', 'w') as fout:
...: fout.writelines([f + '\n' for f in os.listdir('.')])
...:

In [5]: with open('filenames.txt') as fin:
...: print(fin.read())
...:
filenames.txt
Õscar.txt

In [6]: import numpy

In [7]: filenames = numpy.loadtxt('filenames.txt')
<snip>
ValueError: could not convert string to float: b'filenames.txt'

In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S')

In [9]: filenames
Out[9]:
array([b'filenames.txt', b'\xd5scar.txt'],
dtype='|S13')

In [10]: open(filenames[1])
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
/users/enojb/.rcs/tmp/<ipython-input-10-3bf2418688a2> in <module>()
----> 1 open(filenames[1])

IOError: [Errno 2] No such file or directory: '\udcd5scar.txt'

In [11]: open('Õscar.txt'.encode('utf-8'))
Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'>

Oscar

j***@gmail.com

2014-01-17 14:11:22 UTC

Permalink

On Fri, Jan 17, 2014 at 8:40 AM, Oscar Benjamin
<***@gmail.com> wrote:
> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
>> On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
>> <***@gmail.com>wrote:
>>
>> > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
>> > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
>> > > [clip]
>> >
>>
>> > > > For backward compatibility we *cannot* change S.
>> >
>> > Do you mean to say that loadtxt cannot be changed from decoding using
>> > system
>> > default, splitting on newlines and whitespace and then encoding the
>> > substrings
>> > as latin-1?
>> >
>>
>> unicode dtypes have nothing to do with the loadtxt issue. They are not
>> related.
>
> I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier,
> if the file is not encoded as ascii or latin-1 then the byte strings are
> corrupted (see below).
>
> This is because loadtxt opens the file with the default system encoding (by
> not explicitly specifying an encoding):
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>
> It then processes each line with asbytes() which encodes them as latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>
> Being an English speaker I don't normally use non-ascii characters in
> filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or
> (and rightly so!).
>
>> >
>> > An obvious improvement would be along the lines of what Chris Barker
>> > suggested: decode as latin-1, do the processing and then reencode as
>> > latin-1.
>> >
>>
>> no, the right solution is to add an encoding argument.
>> Its a 4 line patch for python2 and a 2 line patch for python3 and the issue
>> is solved, I'll file a PR later.
>
> What is the encoding argument for? Is it to be used to decode, process the
> text and then re-encode it for an array with dtype='S'?
>
> Note that there are two encodings: one for reading from the file and one for
> storing in the array. The former describes the content of the file and the
> latter will be used if I extract a byte-string from the array and pass it to
> any Python API.
>
>> No latin1 de/encoding is required for anything, I don't know why you would
>> want do to that in this context.
>> Does opening latin1 files even work with current loadtxt?
>
> It's the only encoding that works for dtype='S'.
>
>> It currently uses UTF-8 which is to my knowledge not compatible with latin1.
>
> It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and
> store in the array, corrupting any non-ascii characters. Here's a
> demonstration:
>
> $ ipython3
> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
> Type "copyright", "credits" or "license" for more information.
>
> IPython 0.12.1 -- An enhanced Interactive Python.
> ? -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help -> Python's own help system.
> object? -> Details about 'object', use 'object??' for extra details.
>
> In [1]: with open('Õscar.txt', 'w') as fout: pass
>
> In [2]: import os
>
> In [3]: os.listdir('.')
> Out[3]: ['Õscar.txt']
>
> In [4]: with open('filenames.txt', 'w') as fout:
> ...: fout.writelines([f + '\n' for f in os.listdir('.')])
> ...:
>
> In [5]: with open('filenames.txt') as fin:
> ...: print(fin.read())
> ...:
> filenames.txt
> Õscar.txt
>
>
> In [6]: import numpy
>
> In [7]: filenames = numpy.loadtxt('filenames.txt')
> <snip>
> ValueError: could not convert string to float: b'filenames.txt'
>
> In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S')
>
> In [9]: filenames
> Out[9]:
> array([b'filenames.txt', b'\xd5scar.txt'],
> dtype='|S13')
>
> In [10]: open(filenames[1])
> ---------------------------------------------------------------------------
> IOError Traceback (most recent call last)
> /users/enojb/.rcs/tmp/<ipython-input-10-3bf2418688a2> in <module>()
> ----> 1 open(filenames[1])
>
> IOError: [Errno 2] No such file or directory: '\udcd5scar.txt'
>
> In [11]: open('Õscar.txt'.encode('utf-8'))
> Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'>

Windows seems to use consistent en/decoding throughout (example run in IDLE)

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32

>>> filenames = numpy.loadtxt('filenames.txt', dtype='S')
>>> filenames
array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py',
b'\xd5scar.txt'],
dtype='|S18')
>>> fn = open(filenames[-1])
>>> fn.read()
'1,2,3,hello\n5,6,7,Õscar\n'
>>> fn
<_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'>

Josef

>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Julian Taylor

2014-01-17 14:12:32 UTC

Permalink

On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
<***@gmail.com>wrote:

> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
> > <***@gmail.com>wrote:
> >
> > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> > > > [clip]
> > >
> >
> > > > > For backward compatibility we *cannot* change S.
> > >
> > > Do you mean to say that loadtxt cannot be changed from decoding using
> > > system
> > > default, splitting on newlines and whitespace and then encoding the
> > > substrings
> > > as latin-1?
> > >
> >
> > unicode dtypes have nothing to do with the loadtxt issue. They are not
> > related.
>
> I'm talking about what loadtxt does with the 'S' dtype. As I showed
> earlier,
> if the file is not encoded as ascii or latin-1 then the byte strings are
> corrupted (see below).
>
> This is because loadtxt opens the file with the default system encoding (by
> not explicitly specifying an encoding):
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>
> It then processes each line with asbytes() which encodes them as latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>

wow this is just horrible, it might be the source of the bug.

>
> Being an English speaker I don't normally use non-ascii characters in
> filenames but my system (Ubuntu Linux) still uses utf-8 rather than
> latin-1 or
> (and rightly so!).
>
> > >
> > > An obvious improvement would be along the lines of what Chris Barker
> > > suggested: decode as latin-1, do the processing and then reencode as
> > > latin-1.
> > >
> >
> > no, the right solution is to add an encoding argument.
> > Its a 4 line patch for python2 and a 2 line patch for python3 and the
> issue
> > is solved, I'll file a PR later.
>
> What is the encoding argument for? Is it to be used to decode, process the
> text and then re-encode it for an array with dtype='S'?
>

it is only used to decode the file into text, nothing more.
loadtxt is supposed to load text files, it should never have to deal with
bytes ever.
But I haven't looked into the function deeply yet, there might be ugly
surprises.

The output of the array is determined by the dtype argument and not by the
encoding argument.

Lets please let the loadtxt issue go to rest.
We know the issue, we know it can be fixed without adding anything
complicated to numpy.
We just have to use what python already provides us.
The technical details of the fix can be discussed in the github issue.
(Plan to have a look this weekend, but if someone else wants to do it let
me know).

Oscar Benjamin

2014-01-17 15:26:05 UTC

Permalink

On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
> <***@gmail.com>wrote:
>
> > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> > >
> > > no, the right solution is to add an encoding argument.
> > > Its a 4 line patch for python2 and a 2 line patch for python3 and the
> > issue
> > > is solved, I'll file a PR later.
> >
> > What is the encoding argument for? Is it to be used to decode, process the
> > text and then re-encode it for an array with dtype='S'?
> >
>
> it is only used to decode the file into text, nothing more.
> loadtxt is supposed to load text files, it should never have to deal with
> bytes ever.
> But I haven't looked into the function deeply yet, there might be ugly
> surprises.
>
> The output of the array is determined by the dtype argument and not by the
> encoding argument.

If the dtype is 'S' then the output should be bytes and you therefore
need to encode the text; there's no such thing as storing text in
bytes without an encoding.

Strictly speaking the 'U' dtype uses the encoding 'ucs-4' or 'utf-32'
which just happens to be as simple as expressing the corresponding
unicode code points as int32 so it's reasonable to think of it as "not
encoded" in some sense (although endianness becomes an issue in
utf-32).

On 17 January 2014 14:11, <***@gmail.com> wrote:
> Windows seems to use consistent en/decoding throughout (example run in IDLE)

The reason for the Py3k bytes/text overhaul is that there were lots of
situations where things *seemed* to work until someone happens to use
a character you didn't try. "Seems to" doesn't cut it! :)

> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 32 bit (Intel)] on win32
>
>>>> filenames = numpy.loadtxt('filenames.txt', dtype='S')
>>>> filenames
> array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py',
> b'\xd5scar.txt'],
> dtype='|S18')
>>>> fn = open(filenames[-1])
>>>> fn.read()
> '1,2,3,hello\n5,6,7,Õscar\n'
>>>> fn
> <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'>

You don't show how you created the file. I think that in your case the
content of 'filenames.txt' is correctly encoded latin-1.

My guess is that you did the same as me and opened it in text mode and
wrote the unicode string allowing Python to encode it for you. Judging
by the encoding on fn above I'd say that it wrote the file with cp1252
which is mostly compatible with latin-1. Try it with a byte that is
incompatible between cp1252 and latin-1 e.g.:

In [3]: b'\x80'.decode('cp1252')
Out[3]: '€'

In [4]: b'\x80'.decode('latin-1')
Out[4]: '\x80'

In [5]: b'\x80'.decode('cp1252').encode('latin-1')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
/users/enojb/<ipython-input-5-cfd8b16d6d9f> in <module>()
----> 1 b'\x80'.decode('cp1252').encode('latin-1')

UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in
position 0: ordinal not in range(256)

Oscar

j***@gmail.com

2014-01-17 15:58:25 UTC

Permalink

On Fri, Jan 17, 2014 at 10:26 AM, Oscar Benjamin
<***@gmail.com> wrote:
> On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote:
>> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
>> <***@gmail.com>wrote:
>>
>> > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
>> > >
>> > > no, the right solution is to add an encoding argument.
>> > > Its a 4 line patch for python2 and a 2 line patch for python3 and the
>> > issue
>> > > is solved, I'll file a PR later.
>> >
>> > What is the encoding argument for? Is it to be used to decode, process the
>> > text and then re-encode it for an array with dtype='S'?
>> >
>>
>> it is only used to decode the file into text, nothing more.
>> loadtxt is supposed to load text files, it should never have to deal with
>> bytes ever.
>> But I haven't looked into the function deeply yet, there might be ugly
>> surprises.
>>
>> The output of the array is determined by the dtype argument and not by the
>> encoding argument.
>
> If the dtype is 'S' then the output should be bytes and you therefore
> need to encode the text; there's no such thing as storing text in
> bytes without an encoding.
>
> Strictly speaking the 'U' dtype uses the encoding 'ucs-4' or 'utf-32'
> which just happens to be as simple as expressing the corresponding
> unicode code points as int32 so it's reasonable to think of it as "not
> encoded" in some sense (although endianness becomes an issue in
> utf-32).
>
> On 17 January 2014 14:11, <***@gmail.com> wrote:
>> Windows seems to use consistent en/decoding throughout (example run in IDLE)
>
> The reason for the Py3k bytes/text overhaul is that there were lots of
> situations where things *seemed* to work until someone happens to use
> a character you didn't try. "Seems to" doesn't cut it! :)
>
>> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
>> 32 bit (Intel)] on win32
>>
>>>>> filenames = numpy.loadtxt('filenames.txt', dtype='S')
>>>>> filenames
>> array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py',
>> b'\xd5scar.txt'],
>> dtype='|S18')
>>>>> fn = open(filenames[-1])
>>>>> fn.read()
>> '1,2,3,hello\n5,6,7,Õscar\n'
>>>>> fn
>> <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'>
>
> You don't show how you created the file. I think that in your case the
> content of 'filenames.txt' is correctly encoded latin-1.

I had created it with os.listdir but deleted some lines
Running the full script again I still get the same correct answer for fn
------------
import os
if 1:
with open('filenames5.txt', 'w') as fout:
fout.writelines([f + '\n' for f in os.listdir('.')])
with open('filenames.txt') as fin:
print(fin.read())

import numpy

#filenames = numpy.loadtxt('filenames.txt')
filenames = numpy.loadtxt('filenames5.txt', dtype='S')
fn = open(filenames[-1])
------------

>
> My guess is that you did the same as me and opened it in text mode and
> wrote the unicode string allowing Python to encode it for you. Judging
> by the encoding on fn above I'd say that it wrote the file with cp1252
> which is mostly compatible with latin-1. Try it with a byte that is
> incompatible between cp1252 and latin-1 e.g.:
>
> In [3]: b'\x80'.decode('cp1252')
> Out[3]: '€'
>
> In [4]: b'\x80'.decode('latin-1')
> Out[4]: '\x80'
>
> In [5]: b'\x80'.decode('cp1252').encode('latin-1')
> ---------------------------------------------------------------------------
> UnicodeEncodeError Traceback (most recent call last)
> /users/enojb/<ipython-input-5-cfd8b16d6d9f> in <module>()
> ----> 1 b'\x80'.decode('cp1252').encode('latin-1')
>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in
> position 0: ordinal not in range(256)

I get similar problems when I use a file that someone else has
written, however I haven't seen much problems if I do everything on
Windows.

The main problems I get and where I don't know how it's supposed to
work in the best way is when we get "foreign" data.

some examples I just played with that are closer to what we use in
statsmodels but don't have any unit tests

>>> filenames1 = numpy.recfromtxt(open('Õscar.txt',"rb"), delimiter=',')
>>> filenames1
rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xd5scar')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S5')])
>>> filenames1['f3'][-1]
b'\xd5scar'
>>> filenames1['f3'] == 'Õscar'
False
>>> filenames1['f3'] == 'Õscar'.encode('cp1252')
array([False, True], dtype=bool)
>>> filenames1['f3'] == 'hello'
False
>>> filenames1['f3'] == b'hello'
array([ True, False], dtype=bool)
>>> filenames1['f3'] == b'\xd5scar'
array([False, True], dtype=bool)
>>> filenames1['f3'] == np.array(['Õscar'.encode('utf8')], 'S5')
array([False, False], dtype=bool)

Josef

>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Oscar Benjamin

2014-01-17 17:13:05 UTC

Permalink

On Fri, Jan 17, 2014 at 10:58:25AM -0500, ***@gmail.com wrote:
> On Fri, Jan 17, 2014 at 10:26 AM, Oscar Benjamin
> <***@gmail.com> wrote:
> > On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote:
> >
> > You don't show how you created the file. I think that in your case the
> > content of 'filenames.txt' is correctly encoded latin-1.
>
> I had created it with os.listdir but deleted some lines

You used os.listdir to generate the unicode strings that you write to the
file. The underlying Win32 API returns filenames encoded as utf-16 but Python
takes care of decoding them under the hood so you just get abstract unicode
strings here in Python 3.

It is the write method of the file object that encodes the unicode strings and
hence determines the byte content of 'filenames5.txt'. You can check the
fout.encoding attribute to see what encoding it uses by default.

> Running the full script again I still get the same correct answer for fn
> ------------
> import os
> if 1:
> with open('filenames5.txt', 'w') as fout:
> fout.writelines([f + '\n' for f in os.listdir('.')])
> with open('filenames.txt') as fin:
> print(fin.read())
>
> import numpy
>
> #filenames = numpy.loadtxt('filenames.txt')
> filenames = numpy.loadtxt('filenames5.txt', dtype='S')
> fn = open(filenames[-1])

The question is what do you get when you do:

In [1]: with open('tmp.txt', 'w') as fout:
...: print(fout.encoding)
...:
UTF-8

I get utf-8 by default if no encoding is specified. This means that when I
write to the file like so

In [2]: with open('tmp.txt', 'w') as fout:
...: fout.write('Õscar')
...:

If I read it back in binary I get different bytes from you:

In [3]: with open('tmp.txt', 'rb') as fin:
...: print(fin.read())
...:
b'\xc3\x95scar'

Numpy.loadtxt will correctly decode those bytes as utf-8:

In [5]: b'\xc3\x95scar'.decode('utf-8')
Out[5]: 'Õscar'

But then it reencodes them with latin-1 before storing them in the array:

In [6]: b'\xc3\x95scar'.decode('utf-8').encode('latin-1')
Out[6]: b'\xd5scar'

This byte string will not be recognised by my Linux OS (POSIX uses bytes for
filenames and an exact match is needed). So if I pass that to open() it will
fail.

<snip>
>
> I get similar problems when I use a file that someone else has
> written, however I haven't seen much problems if I do everything on
> Windows.

If you use a proper explicit encoding then you can savetxt from any system and
loadtxt on any other without corruption.

> The main problems I get and where I don't know how it's supposed to
> work in the best way is when we get "foreign" data.

Text data needs to have metadata specifying the encoding. This is something
that people who pass data around need to think about.

Oscar

Julian Taylor

2014-01-17 19:18:47 UTC

Permalink

On 17.01.2014 15:12, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
> <***@gmail.com <mailto:***@gmail.com>> wrote:
>
> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
> > <***@gmail.com <mailto:***@gmail.com>>wrote:
> >
> > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > > > Julian Taylor <jtaylor.debian <at> googlemail.com
> <http://googlemail.com>> writes:
> > > > [clip]
> > >
> >
> > > > > For backward compatibility we *cannot* change S.
> > >
> > > Do you mean to say that loadtxt cannot be changed from decoding
> using
> > > system
> > > default, splitting on newlines and whitespace and then encoding the
> > > substrings
> > > as latin-1?
> > >
> >
> > unicode dtypes have nothing to do with the loadtxt issue. They are not
> > related.
>
> I'm talking about what loadtxt does with the 'S' dtype. As I showed
> earlier,
> if the file is not encoded as ascii or latin-1 then the byte strings are
> corrupted (see below).
>
> This is because loadtxt opens the file with the default system
> encoding (by
> not explicitly specifying an encoding):
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>
> It then processes each line with asbytes() which encodes them as
> latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>
>
>
> wow this is just horrible, it might be the source of the bug.
>
>
>
>
> Being an English speaker I don't normally use non-ascii characters in
> filenames but my system (Ubuntu Linux) still uses utf-8 rather than
> latin-1 or
> (and rightly so!).
>
> > >
> > > An obvious improvement would be along the lines of what Chris Barker
> > > suggested: decode as latin-1, do the processing and then reencode as
> > > latin-1.
> > >
> >
> > no, the right solution is to add an encoding argument.
> > Its a 4 line patch for python2 and a 2 line patch for python3 and
> the issue
> > is solved, I'll file a PR later.
>
> What is the encoding argument for? Is it to be used to decode,
> process the
> text and then re-encode it for an array with dtype='S'?
>
>
> it is only used to decode the file into text, nothing more.
> loadtxt is supposed to load text files, it should never have to deal
> with bytes ever.
> But I haven't looked into the function deeply yet, there might be ugly
> surprises.
>
> The output of the array is determined by the dtype argument and not by
> the encoding argument.
>
> Lets please let the loadtxt issue go to rest.
> We know the issue, we know it can be fixed without adding anything
> complicated to numpy.
> We just have to use what python already provides us.
> The technical details of the fix can be discussed in the github issue.
> (Plan to have a look this weekend, but if someone else wants to do it
> let me know).
>

Work in progress PR:
https://github.com/numpy/numpy/pull/4208

I also seem to have fixed the original bug, while wasn't even my
intention with that PR :)
apparently it was indeed one of the broken asbytes calls.

if you have applications using loadtxt please give it a try, but
genfromtxt is still completely broken (and a much larger fix, asbytes
everywhere)

j***@gmail.com

2014-01-17 19:58:21 UTC

Permalink

On Fri, Jan 17, 2014 at 2:18 PM, Julian Taylor
<***@googlemail.com> wrote:
> On 17.01.2014 15:12, Julian Taylor wrote:
>> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>
>> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
>> > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
>> > <***@gmail.com <mailto:***@gmail.com>>wrote:
>> >
>> > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
>> > > > Julian Taylor <jtaylor.debian <at> googlemail.com
>> <http://googlemail.com>> writes:
>> > > > [clip]
>> > >
>> >
>> > > > > For backward compatibility we *cannot* change S.
>> > >
>> > > Do you mean to say that loadtxt cannot be changed from decoding
>> using
>> > > system
>> > > default, splitting on newlines and whitespace and then encoding the
>> > > substrings
>> > > as latin-1?
>> > >
>> >
>> > unicode dtypes have nothing to do with the loadtxt issue. They are not
>> > related.
>>
>> I'm talking about what loadtxt does with the 'S' dtype. As I showed
>> earlier,
>> if the file is not encoded as ascii or latin-1 then the byte strings are
>> corrupted (see below).
>>
>> This is because loadtxt opens the file with the default system
>> encoding (by
>> not explicitly specifying an encoding):
>> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>>
>> It then processes each line with asbytes() which encodes them as
>> latin-1:
>> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
>> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>>
>>
>>
>> wow this is just horrible, it might be the source of the bug.
>>
>>
>>
>>
>> Being an English speaker I don't normally use non-ascii characters in
>> filenames but my system (Ubuntu Linux) still uses utf-8 rather than
>> latin-1 or
>> (and rightly so!).
>>
>> > >
>> > > An obvious improvement would be along the lines of what Chris Barker
>> > > suggested: decode as latin-1, do the processing and then reencode as
>> > > latin-1.
>> > >
>> >
>> > no, the right solution is to add an encoding argument.
>> > Its a 4 line patch for python2 and a 2 line patch for python3 and
>> the issue
>> > is solved, I'll file a PR later.
>>
>> What is the encoding argument for? Is it to be used to decode,
>> process the
>> text and then re-encode it for an array with dtype='S'?
>>
>>
>> it is only used to decode the file into text, nothing more.
>> loadtxt is supposed to load text files, it should never have to deal
>> with bytes ever.
>> But I haven't looked into the function deeply yet, there might be ugly
>> surprises.
>>
>> The output of the array is determined by the dtype argument and not by
>> the encoding argument.
>>
>> Lets please let the loadtxt issue go to rest.
>> We know the issue, we know it can be fixed without adding anything
>> complicated to numpy.
>> We just have to use what python already provides us.
>> The technical details of the fix can be discussed in the github issue.
>> (Plan to have a look this weekend, but if someone else wants to do it
>> let me know).
>>
>
> Work in progress PR:
> https://github.com/numpy/numpy/pull/4208
>
> I also seem to have fixed the original bug, while wasn't even my
> intention with that PR :)
> apparently it was indeed one of the broken asbytes calls.
>
> if you have applications using loadtxt please give it a try, but
> genfromtxt is still completely broken (and a much larger fix, asbytes
> everywhere)

does this still work?

>>> numpy.loadtxt(open('Õscar_3.txt',"rb"), 'S')
array([b'1,2,3,hello', b'5,6,7,\xc3\x95scarscar', b'15,2,3,hello',
b'20,2,3,\xc3\x95scar'],
dtype='|S16')

to compare

>>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), delimiter=',')
Traceback (most recent call last):
File "<pyshell#251>", line 1, in <module>
numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), delimiter=',')
File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
line 1828, in recfromtxt
output = genfromtxt(fname, **kwargs)
File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
line 1351, in genfromtxt
first_values = split_line(first_line)
File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py",
line 207, in _delimited_splitter
line = line.split(self.comments)[0]
TypeError: Can't convert 'bytes' object to str implicitly

>>> numpy.recfromtxt(open('Õscar_3.txt',"rb"), delimiter=',')
rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'),
(15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')])

Josef

> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Chris Barker

2014-01-17 20:17:58 UTC

Permalink

>>> numpy.recfromtxt(open('Ãscar_3.txt',"r", encoding='utf8'),
delimiter=',')

> Traceback (most recent call last):
> File "<pyshell#251>", line 1, in <module>
> numpy.recfromtxt(open('Ãscar_3.txt',"r", encoding='utf8'),
> delimiter=',')
> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
> line 1828, in recfromtxt
> output = genfromtxt(fname, **kwargs)
> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
> line 1351, in genfromtxt
> first_values = split_line(first_line)
> File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py",
> line 207, in _delimited_splitter
> line = line.split(self.comments)[0]
> TypeError: Can't convert 'bytes' object to str implicitly
>

That's pretty broken -- if you know the encoding, you should certainly be
able to get a proper unicode string out of it..

> >>> numpy.recfromtxt(open('Ãscar_3.txt',"rb"), delimiter=',')
> rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'),
> (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')],
> dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')])
>

So the problem here is that recfromtxt is making all "text" bytes objects.
('S' ?) -- which is probably not what you want particularly if you specify
an encoding. Though I can't figure out at the moment why the previous one
failed -- where did the bytes object come from when the encoding was
specified?

By the way -- this is apparently a utf-file with some non-ascii text in it.
By my proposal, without an encoding specified, it should default to latin-1:

In that case, you might get unicode string objects that are incorrectly
decoded. But:

it would not raise an exception

you could recover the proper text with:

the_text.encode(latin-1).decode('utf-8')

On the other hand, if this was as ascii-compatible non-utf8 encoding file,
and we tried to read it as utf-8, it could barf on the non-ascii text
altogether, and if it didn't the non-ascii text would be corrupted and
impossible to recover.

I think the issue is that I'm not really proposing latin-1 -- I'm
proposing "a ascii compatible encoding that will do the right thing with
ascii bytes, and pass through any other bytes untouched" - latin-1, at
least as implemented by Python, satisfies that criterium.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

j***@gmail.com

2014-01-17 20:36:12 UTC

Permalink

On Fri, Jan 17, 2014 at 3:17 PM, Chris Barker <***@noaa.gov> wrote:
> >>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'),
> delimiter=',')
>>
>> Traceback (most recent call last):
>> File "<pyshell#251>", line 1, in <module>
>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'),
>> delimiter=',')
>> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
>> line 1828, in recfromtxt
>> output = genfromtxt(fname, **kwargs)
>> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
>> line 1351, in genfromtxt
>> first_values = split_line(first_line)
>> File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py",
>> line 207, in _delimited_splitter
>> line = line.split(self.comments)[0]
>> TypeError: Can't convert 'bytes' object to str implicitly
>
>
> That's pretty broken -- if you know the encoding, you should certainly be
> able to get a proper unicode string out of it..
>
>>
>> >>> numpy.recfromtxt(open('Õscar_3.txt',"rb"), delimiter=',')
>> rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'),
>> (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')],
>> dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')])
>
>
> So the problem here is that recfromtxt is making all "text" bytes objects.
> ('S' ?) -- which is probably not what you want particularly if you specify
> an encoding. Though I can't figure out at the moment why the previous one
> failed -- where did the bytes object come from when the encoding was
> specified?

Yes, it's a utf-8 file with nonascii.

I don't know what I **should** want.

For now I do want bytes, because that's how I changed statsmodels in
the py3 conversion.

This was just based on the fact that recfromtxt doesn't work with
strings on python 3, so I switched to using bytes following the lead
of numpy.

I'm mainly worried about backwards compatibility, since we have been
using this for 2 or 3 years. It would be easy to change in statsmodels
when gen/recfromtxt is fixed, but I assume there is lots of other code
using similar interpretation of S/bytes in numpy.

Josef

>
> By the way -- this is apparently a utf-file with some non-ascii text in it.
> By my proposal, without an encoding specified, it should default to latin-1:
>
> In that case, you might get unicode string objects that are incorrectly
> decoded. But:
>
> it would not raise an exception
>
> you could recover the proper text with:
>
> the_text.encode(latin-1).decode('utf-8')
>
> On the other hand, if this was as ascii-compatible non-utf8 encoding file,
> and we tried to read it as utf-8, it could barf on the non-ascii text
> altogether, and if it didn't the non-ascii text would be corrupted and
> impossible to recover.
>
> I think the issue is that I'm not really proposing latin-1 -- I'm proposing
> "a ascii compatible encoding that will do the right thing with ascii bytes,
> and pass through any other bytes untouched" - latin-1, at least as
> implemented by Python, satisfies that criterium.
>
> -Chris
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> ***@noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Chris Barker

2014-01-17 21:20:39 UTC

Permalink

On Fri, Jan 17, 2014 at 12:36 PM, <***@gmail.com> wrote:

> > ('S' ?) -- which is probably not what you want particularly if you
> specify
> > an encoding. Though I can't figure out at the moment why the previous one
> > failed -- where did the bytes object come from when the encoding was
> > specified?
>
> Yes, it's a utf-8 file with nonascii.
>
> I don't know what I **should** want.
>

well, you **should** want:

The numbers parsed out for you (Other wise, why use recfromtxt), and the
text as properly decoded unicode strings.

Python does very well with unicode -- and you are MUCH happier if you do
the encoding/decoding as close to I/O as possible. recfromtxt is, in a way,
decoding already, converting ascii representation of numbers to an internal
binary representation -- why not handle the text at the same time.

There certainly are use cases for keeping the text as encoded bytes, but
I'd say those fall into the categories of:

1) Special case
2) You should know what you are doing.

So having recfromtxt auto-determine that for you makes little sense.

Note that if you don't know the file encoding, this is tricky. My thoughts:

1) don't use the system default encoding!!! (see my other note on that!)

2) Either:
a) open as a binary file and use bytes for anything that doesn't parse
as text -- this means that the user will need to do the conversion to text
themselves

b) decode as latin-1: this would work well for ascii and _some_ non-ascii
text, and would be recoverable for ALL text.

I prefer (b). The point here is that if the user gets bytes, then they
will either have to assume ascii, or need to hand-decode it, and if they
just want assume ascii, they have a bytes object with limited
text functionality so will probably need to decode it anyway (unless they
are just passing it through)

If the user gets unicode objects that are may not properly decoded, they
can either assume it was ascii, and if they only do ascii-compatible things
with it, it will work, or they can encode/decode it and get the proper
stuff back, but only if they know the encoding, and if that's the case, why
did they not specify that in the first place?

> For now I do want bytes, because that's how I changed statsmodels in
> the py3 conversion.
>
> This was just based on the fact that recfromtxt doesn't work with
> strings on python 3, so I switched to using bytes following the lead
> of numpy.
>

Well, that's really too bad -- it doesn't sound like you wanted bytes, it
sounds like you wanted something that didn't crash -- fair enough. But the
"proper" solution is for recfromtext to support text....

I'm mainly worried about backwards compatibility, since we have been
> using this for 2 or 3 years. It would be easy to change in statsmodels
> when gen/recfromtxt is fixed, but I assume there is lots of other code
> using similar interpretation of S/bytes in numpy.
>

well, it does sound like enough folks are using 'S' to mean bytes -- too
bad, but what can we do now about that?

I'd like a 's' for ascii-stings though.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

j***@gmail.com

2014-01-17 21:43:58 UTC

Permalink

On Fri, Jan 17, 2014 at 4:20 PM, Chris Barker <***@noaa.gov> wrote:
> On Fri, Jan 17, 2014 at 12:36 PM, <***@gmail.com> wrote:
>>
>> > ('S' ?) -- which is probably not what you want particularly if you
>> > specify
>> > an encoding. Though I can't figure out at the moment why the previous
>> > one
>> > failed -- where did the bytes object come from when the encoding was
>> > specified?
>>
>> Yes, it's a utf-8 file with nonascii.
>>
>> I don't know what I **should** want.
>
>
> well, you **should** want:
>
> The numbers parsed out for you (Other wise, why use recfromtxt), and the
> text as properly decoded unicode strings.
>
> Python does very well with unicode -- and you are MUCH happier if you do the
> encoding/decoding as close to I/O as possible. recfromtxt is, in a way,
> decoding already, converting ascii representation of numbers to an internal
> binary representation -- why not handle the text at the same time.
>
> There certainly are use cases for keeping the text as encoded bytes, but I'd
> say those fall into the categories of:
>
> 1) Special case
> 2) You should know what you are doing.
>
> So having recfromtxt auto-determine that for you makes little sense.
>
> Note that if you don't know the file encoding, this is tricky. My thoughts:
>
> 1) don't use the system default encoding!!! (see my other note on that!)
>
> 2) Either:
> a) open as a binary file and use bytes for anything that doesn't parse
> as text -- this means that the user will need to do the conversion to text
> themselves
>
> b) decode as latin-1: this would work well for ascii and _some_ non-ascii
> text, and would be recoverable for ALL text.
>
> I prefer (b). The point here is that if the user gets bytes, then they will
> either have to assume ascii, or need to hand-decode it, and if they just
> want assume ascii, they have a bytes object with limited text functionality
> so will probably need to decode it anyway (unless they are just passing it
> through)
>
> If the user gets unicode objects that are may not properly decoded, they can
> either assume it was ascii, and if they only do ascii-compatible things with
> it, it will work, or they can encode/decode it and get the proper stuff
> back, but only if they know the encoding, and if that's the case, why did
> they not specify that in the first place?
>
>>
>> For now I do want bytes, because that's how I changed statsmodels in
>> the py3 conversion.
>>
>> This was just based on the fact that recfromtxt doesn't work with
>> strings on python 3, so I switched to using bytes following the lead
>> of numpy.
>
>
> Well, that's really too bad -- it doesn't sound like you wanted bytes, it
> sounds like you wanted something that didn't crash -- fair enough. But the
> "proper" solution is for recfromtext to support text....

But also solution 2a) is fine for most of the code
Often it doesn't really matter

>>> dta_4
array([(1, 2, 3, b'hello', 'hello'),
(5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'),
(15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', 'Õscar')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3',
'S10'), ('f4', '<U9')])

>>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int)
array([[1, 0, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0]])
>>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int)
array([[1, 0, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0]])

similar doing a for loop comparing to the uniques.
bytes are fine and nobody has to tell me what encoding they are using.

It doesn't work so well for pretty printing results, so using there
latin-1 as you describe above might be a good solution if users don't
decode to text/string

Josef

>
>> I'm mainly worried about backwards compatibility, since we have been
>> using this for 2 or 3 years. It would be easy to change in statsmodels
>> when gen/recfromtxt is fixed, but I assume there is lots of other code
>> using similar interpretation of S/bytes in numpy.
>
>
> well, it does sound like enough folks are using 'S' to mean bytes -- too
> bad, but what can we do now about that?
>
> I'd like a 's' for ascii-stings though.
>
> -Chris
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> ***@noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Chris Barker

2014-01-17 21:55:56 UTC

Permalink

On Fri, Jan 17, 2014 at 1:43 PM, <***@gmail.com> wrote:

> > 2) Either:
> > a) open as a binary file and use bytes for anything that doesn't
> parse
> > as text -- this means that the user will need to do the conversion to
> text
> > themselves
> >
> > b) decode as latin-1: this would work well for ascii and _some_
> non-ascii
> > text, and would be recoverable for ALL text.
>

> But also solution 2a) is fine for most of the code
> Often it doesn't really matter
>

indeed -- I did list it as an option ;-)

> >>> dta_4
> array([(1, 2, 3, b'hello', 'hello'),
> (5, 6, 7, b'\xc3\x95scarscar', 'Ãscarscar'),
> (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar',
> 'Ãscar')],
> dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3',
> 'S10'), ('f4', '<U9')])
>
> >>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int)
> array([[1, 0, 0],
> [0, 0, 1],
> [1, 0, 0],
> [0, 1, 0]])
> >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int)
> array([[1, 0, 0],
> [0, 0, 1],
> [1, 0, 0],
> [0, 1, 0]])
>
> similar doing a for loop comparing to the uniques.
> bytes are fine and nobody has to tell me what encoding they are using.
>

and this same operation would work fine if that text was in (possibly
improperly decoded) unicode objects.

> It doesn't work so well for pretty printing results, so using there
> latin-1 as you describe above might be a good solution if users don't
> decode to text/string
>

exactly -- if you really need to work with the text, you need to know the
encoding. Period. End of Story.

If you don't know the encoding then there is still some stuff you can do
with it, so you want something that:

a) won't barf on any input

b) will preserve the bytes if you need to pass them along, or compare them,
or...

Either bytes or latin-1 decoded strings will work for that. bytes are
better, as it's more explicit that you may not have valid text here.
unicode strings are better as you can do stringy things with them. Either
way, you'll need to encode or decode to get full functionality.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Aldcroft, Thomas

2014-01-17 22:40:47 UTC

Permalink

On Fri, Jan 17, 2014 at 4:43 PM, <***@gmail.com> wrote:

> On Fri, Jan 17, 2014 at 4:20 PM, Chris Barker <***@noaa.gov>
> wrote:
> > On Fri, Jan 17, 2014 at 12:36 PM, <***@gmail.com> wrote:
> >>
> >> > ('S' ?) -- which is probably not what you want particularly if you
> >> > specify
> >> > an encoding. Though I can't figure out at the moment why the previous
> >> > one
> >> > failed -- where did the bytes object come from when the encoding was
> >> > specified?
> >>
> >> Yes, it's a utf-8 file with nonascii.
> >>
> >> I don't know what I **should** want.
> >
> >
> > well, you **should** want:
> >
> > The numbers parsed out for you (Other wise, why use recfromtxt), and the
> > text as properly decoded unicode strings.
> >
> > Python does very well with unicode -- and you are MUCH happier if you do
> the
> > encoding/decoding as close to I/O as possible. recfromtxt is, in a way,
> > decoding already, converting ascii representation of numbers to an
> internal
> > binary representation -- why not handle the text at the same time.
> >
> > There certainly are use cases for keeping the text as encoded bytes, but
> I'd
> > say those fall into the categories of:
> >
> > 1) Special case
> > 2) You should know what you are doing.
> >
> > So having recfromtxt auto-determine that for you makes little sense.
> >
> > Note that if you don't know the file encoding, this is tricky. My
> thoughts:
> >
> > 1) don't use the system default encoding!!! (see my other note on that!)
> >
> > 2) Either:
> > a) open as a binary file and use bytes for anything that doesn't
> parse
> > as text -- this means that the user will need to do the conversion to
> text
> > themselves
> >
> > b) decode as latin-1: this would work well for ascii and _some_
> non-ascii
> > text, and would be recoverable for ALL text.
> >
> > I prefer (b). The point here is that if the user gets bytes, then they
> will
> > either have to assume ascii, or need to hand-decode it, and if they just
> > want assume ascii, they have a bytes object with limited text
> functionality
> > so will probably need to decode it anyway (unless they are just passing
> it
> > through)
> >
> > If the user gets unicode objects that are may not properly decoded, they
> can
> > either assume it was ascii, and if they only do ascii-compatible things
> with
> > it, it will work, or they can encode/decode it and get the proper stuff
> > back, but only if they know the encoding, and if that's the case, why did
> > they not specify that in the first place?
> >
> >>
> >> For now I do want bytes, because that's how I changed statsmodels in
> >> the py3 conversion.
> >>
> >> This was just based on the fact that recfromtxt doesn't work with
> >> strings on python 3, so I switched to using bytes following the lead
> >> of numpy.
> >
> >
> > Well, that's really too bad -- it doesn't sound like you wanted bytes, it
> > sounds like you wanted something that didn't crash -- fair enough. But
> the
> > "proper" solution is for recfromtext to support text....
>
> But also solution 2a) is fine for most of the code
> Often it doesn't really matter
>
> >>> dta_4
> array([(1, 2, 3, b'hello', 'hello'),
> (5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'),
> (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar',
> 'Õscar')],
> dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3',
> 'S10'), ('f4', '<U9')])
>
> >>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int)
> array([[1, 0, 0],
> [0, 0, 1],
> [1, 0, 0],
> [0, 1, 0]])
> >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int)
> array([[1, 0, 0],
> [0, 0, 1],
> [1, 0, 0],
> [0, 1, 0]])
>
> similar doing a for loop comparing to the uniques.
> bytes are fine and nobody has to tell me what encoding they are using.
>

j***@gmail.com

2014-01-18 02:15:51 UTC

Permalink

It looks like both recfromtxt and loadtxt are flexible enough to
handle string/bytes en/decoding, - with a bit of work and using enough
information

>>> dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<U9')]

>>> data = numpy.recfromtxt(open('Õscar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')})
>>> data['f3'] == 'Õscar'
array([False, False, False, True], dtype=bool)
>>> data
rec.array([(1, 2, 3, 'hello'), (5, 6, 7, 'Õscarscar'), (15, 2, 3, 'hello'),
(20, 2, 3, 'Õscar')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<U9')])

>>> data = numpy.loadtxt(open('Õscar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')})
>>> data
array([(1, 2, 3, 'hello'), (5, 6, 7, 'Õscarscar'), (15, 2, 3, 'hello'),
(20, 2, 3, 'Õscar')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<U9')])
>>>

Josef

Randewijk, PJ, Dr

2014-01-18 10:40:28 UTC

Permalink

Gestuur vanaf my Samsung S3 Mini

-------- Original message --------
From: ***@gmail.com
Date: 18/01/2014 04:16 (GMT+02:00)
To: Discussion of Numerical Python <numpy-***@scipy.org>
Subject: Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

It looks like both recfromtxt and loadtxt are flexible enough to
handle string/bytes en/decoding, - with a bit of work and using enough
information

>>> dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<U9')]

>>> data = numpy.recfromtxt(open('Õscar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')})
>>> data['f3'] == 'Õscar'
array([False, False, False, True], dtype=bool)
>>> data
rec.array([(1, 2, 3, 'hello'), (5, 6, 7, 'Õscarscar'), (15, 2, 3, 'hello'),
(20, 2, 3, 'Õscar')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<U9')])

>>> data = numpy.loadtxt(open('Õscar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')})
>>> data
array([(1, 2, 3, 'hello'), (5, 6, 7, 'Õscarscar'), (15, 2, 3, 'hello'),
(20, 2, 3, 'Õscar')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<U9')])
>>>

Josef
_______________________________________________
NumPy-Discussion mailing list
NumPy-***@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

________________________________

E-pos vrywaringsklousule Hierdie e-pos mag vertroulike inligting bevat en mag regtens geprivilegeerd wees en is slegs bedoel vir die persoon aan wie dit geadresseer is. Indien u nie die bedoelde ontvanger is nie, word u hiermee in kennis gestel dat u hierdie dokument geensins mag gebruik, versprei of kopieer nie. Stel ook asseblief die sender onmiddellik per telefoon in kennis en vee die e-pos uit. Die Universiteit aanvaar nie aanspreeklikheid vir enige skade, verlies of uitgawe wat voortspruit uit hierdie e-pos en/of die oopmaak van enige lêers aangeheg by hierdie e-pos nie. E-mail disclaimer This e-mail may contain confidential information and may be legally privileged and is intended only for the person to whom it is addressed. If you are not the intended recipient, you are notified that you may not use, distribute or copy this document in any manner whatsoever. Kindly also notify the sender immediately by telephone, and delete the e-mail. The University does not accept liability for any damage, loss or expense arising from this e-mail and/or accessing any files attached to this e-mail.

Aldcroft, Thomas

2014-01-17 13:09:00 UTC

Permalink

On Fri, Jan 17, 2014 at 5:59 AM, Pauli Virtanen <***@iki.fi> wrote:

> Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> [clip]
> > - inconvenience in dealing with strings in python 3.
> >
> > bytes are not strings in python3 which means ascii data is either a byte
> > array which can be inconvenient to deal with or 4 byte unicode which
> > wastes space.
> >
> > A proposal to fix this would be to add a one or two byte dtype with a
> specific
> > encoding that behaves similar to bytes but converts to string when
> outputting
> > to python for comparisons etc.
> >
> > For backward compatibility we *cannot* change S. Maybe we could change
> > the meaning of 'a' but it would be safer to add a new dtype, possibly
> > 'S' can be deprecated in favor of 'B' when we have a specific encoding
> dtype.
> >
> > The main issue is probably: is it worth it and who does the work?
>
> I don't think this is a good idea: the bytes vs. unicode separation in
> Python 3 exists for a good reason. If unicode is not needed, why not just
> use the bytes data type throughout the program?
>

I've been playing around with porting a stack of analysis libraries to
Python 3 and this is a very timely thread and comment. What I discovered
right away is that all the string data coming from binary HDF5 files show
up (as expected) as 'S' type,, but that trying to make everything actually
work in Python 3 without converting to 'U' is a big mess of whack-a-mole.

Yes, it's possible to change my libraries to use bytestring literals
everywhere, but the Python 3 user experience becomes horrible because to
interact with the data all downstream applications need to use bytestring
literals everywhere. E.g. doing a simple filter like `string_array ==
'foo'` doesn't work, and this will break all existing code when trying to
run in Python 3. And every time you try to print something it has this
horrible "b" in front. Ugly, and it just won't work well in the end.

Following the excellent advice at http://nedbatchelder.com/text/unipain.html,
I've come to the conclusion that the only way to support Python 3 is to
bite the bullet and do the "unicode sandwich". That is to say convert all
external bytestring values to 'U' arrays for internal (and user)
manipulation, and back to 'S' for delivery to files / network etc. This is
a pain and very inefficient, but at least the the Python 3 user experience
is natural and pleasant. I figure if you are manipulating anything less
than ~Gb of text data then it won't be a disaster.

The upshot from this is that I would be very much in favor of solutions
that address the inefficiency issue of using 4 bytes / character in the
common use-case of pure-ASCII strings. Right now this is the single
biggest issue I see for migrating to Python 3. Otherwise making the code
python 2 / 3 compatible wasn't too difficult.

- Tom

>
> (Also, assuming that ASCII is in general good for text-format data is
> quite US-centric.)
>
> Christopher Barker wrote:
> >
> > How do you spell the dtype that 'S' give you????
> >
>
> 'S' is bytes.
>
> dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent.
>
> --
> Pauli Virtanen
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Freddie Witherden

2014-01-17 13:18:38 UTC

Permalink

On 17/01/14 13:09, Aldcroft, Thomas wrote:
> I've been playing around with porting a stack of analysis libraries to
> Python 3 and this is a very timely thread and comment. What I
> discovered right away is that all the string data coming from binary
> HDF5 files show up (as expected) as 'S' type,, but that trying to make
> everything actually work in Python 3 without converting to 'U' is a big
> mess of whack-a-mole.
>
> Yes, it's possible to change my libraries to use bytestring literals
> everywhere, but the Python 3 user experience becomes horrible because to
> interact with the data all downstream applications need to use
> bytestring literals everywhere. E.g. doing a simple filter like
> `string_array == 'foo'` doesn't work, and this will break all existing
> code when trying to run in Python 3. And every time you try to print
> something it has this horrible "b" in front. Ugly, and it just won't
> work well in the end.

In terms of HDF5 it is interesting to look at how h5py -- which has to
go between NumPy types and HDF5 conventions -- handles the problem as
described here:

http://www.h5py.org/docs/topics/strings.html

which IMHO got it about right.

Regards, Freddie.

Chris Barker

2014-01-17 20:30:06 UTC

Permalink

On Fri, Jan 17, 2014 at 5:18 AM, Freddie Witherden <***@witherden.org>wrote:

> In terms of HDF5 it is interesting to look at how h5py -- which has to
> go between NumPy types and HDF5 conventions -- handles the problem as
> described here:
>
> http://www.h5py.org/docs/topics/strings.html

from that:
"""All strings in HDF5 hold encoded text.

You canât store arbitrary binary data in HDF5 strings.
"""

This is actually the same as a py3 string (though the mechanism may be
completely different), and the problem with numpy's 'S' - is it text or
bytes? Given the name and history, it should be text, but apparently people
have been using t for bytes, so we have to keep that meaning/use case. But
I suggest, that like Python3 -- we official declare that you should not
consider it text, and not do any implicite conversions.

Which means we could use a one-byte-per-character text dtype.

"""At the high-level interface, h5py exposes three kinds of strings. Each
maps to a specific type within Python (but see str_py3 below):

Fixed-length ASCII (NumPy S type)
....
"""
This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is
not an ASCII string (even though I wish it were...). But clearly the HDF
folsk think we need one!

"""
Fixed-length ASCII

These are created when you use numpy.string_:

>>> dset.attrs["name"] = numpy.string_("Hello")

or the S dtype:

>>> dset = f.create_dataset("string_ds", (100,), dtype="S10")
"""
Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3?
Form another post, I thought you'd need to use numpy.bytes_ (which is the
same on py2)

"""Variable-length ASCII

These are created when you assign a byte string to an attribute:

>>> dset.attrs["attr"] = b"Hello"
or when you create a dataset with an explicit âbytesâ vlen type:

>>> dt = h5py.special_dtype(vlen=bytes)
>>> dset = f.create_dataset("name", (100,), dtype=dt)

Note that theyâre not fully identical to Python byte strings.
"""
This implies that HDF would be well served by an ascii text type.

"""
What about NumPyâs U type?

NumPy also has a Unicode type, a UTF-32 fixed-width format (4-byte
characters). HDF5 has no support for wide characters. Rather than trying to
hack around this and âpretendâ to support it, h5py will raise an error when
attempting to create datasets or attributes of this type.
"""

Interesting, though I think irrelevant to this conversation but it would
be nice if HDFpy would encode/decode to/from utf-8 for these.

-Chris

> which IMHO got it about right.
>
> Regards, Freddie.
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Chris Barker

2014-01-17 20:56:42 UTC

Permalink

Small note:

Being an English speaker I don't normally use non-ascii characters in
> filenames but my system (Ubuntu Linux) still uses utf-8 rather than
> latin-1 or
> (and rightly so!).

just to be really clear -- encoding for filenames and encoding for
file content have nothing to do with each-other. sys.getdefaultencoding()
is _supposed_ to be a default encoding for file content -- not file names.

And of course you need to use the system file name encoding for file names!

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Andrew Collette

2014-01-21 23:22:50 UTC

Permalink

Hi Chris,

Just stumbled on this discussion (I'm the lead author of h5py).

We would be overjoyed if there were a 1-byte text type available in
NumPy. String handling is the source of major pain right now in the
HDF5 world. All HDF5 strings are text (opaque types are used for
binary data), but we're forced into using the "S" type most of the
time because (1) the "U" type doesn't round-trip between HDF5 and
NumPy, as there's no fixed-width wide-character string type in HDF5,
and (2) "U" takes 4x the space, which is a problem for big scientific
datasets.

ASCII-only would be preferable, partly for selfish reasons (HDF5's
default is ASCII only), and partly to make it possible to copy them
into containers labelled "UTF-8" without manually inspecting every
value.

> """At the high-level interface, h5py exposes three kinds of strings. Each
> maps to a specific type within Python (but see str_py3 below):
>
> Fixed-length ASCII (NumPy S type)
> ....
> """
> This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is not
> an ASCII string (even though I wish it were...). But clearly the HDF folsk
> think we need one!

Yes, this was intended to state that the HDF5 "Fixed-width ASCII" type
maps to NumPy "S" at conversion time, which is obviously a wretched
solution on Py3.

>>>> dset = f.create_dataset("string_ds", (100,), dtype="S10")
> """
> Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3? Form
> another post, I thought you'd need to use numpy.bytes_ (which is the same on
> py2)

It does produce an instance of 'numpy.bytes_', although I think the
h5py docs should be changed to use bytes_ explicitly.

Andrew

Chris Barker

2014-01-22 00:30:23 UTC

Permalink

On Tue, Jan 21, 2014 at 3:22 PM, Andrew Collette
<***@gmail.com>wrote:

> Just stumbled on this discussion (I'm the lead author of h5py).
>
> We would be overjoyed if there were a 1-byte text type available in
> NumPy.

cool -- it looks like someone is going to get a draft PEP going -- so stay
tuned, and add you comments when there is something to add them too..

String handling is the source of major pain right now in the
> HDF5 world. All HDF5 strings are text (opaque types are used for
> binary data), but we're forced into using the "S" type most of the
> time because (1) the "U" type doesn't round-trip between HDF5 and
> NumPy, as there's no fixed-width wide-character string type in HDF5,
>

it looks from here:
http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html

that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a
lot of calls to encode/decode -- which could be pretty slow, compared to
other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by
"doesn't round trip".

This may be a good case for a numpy utf-8 dtype, I suppose (or a
arbitrary encoding dtype, anyway).
But: How does hdf handle the fact that utf-8 is not a fixed length encoding?

ASCII-only would be preferable, partly for selfish reasons (HDF5's
> default is ASCII only), and partly to make it possible to copy them
> into containers labelled "UTF-8" without manually inspecting every
> value.
>

hmm -- ascii does have those advantages, but I'm not sure its worth the
restriction on what can be encoded. But you're quite right, you could dump
asciii straight into something expecting utf-8, whereas you could not do
that with latin-1, for instance. But you can't go the other way -- does it
help much to avoided encoding in one direction?

But maybe we can have a any-one-byte-per-char encoding option, in which
case hdfpy could use ascii, but we wouldn't have to everywhere.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Andrew Collette

2014-01-22 01:54:33 UTC

Permalink

Hi Chris,

> it looks from here:
> http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html
>
> that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a
> lot of calls to encode/decode -- which could be pretty slow, compared to
> other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by
> "doesn't round trip".

HDF5 does have variable-length string support for UTF-8, so we map
that directly to the unicode type (str on Py3) exactly as you
describe, by encoding when we write to the file. But there's no way
to round-trip with *fixed-width* strings. You can go from e.g. a 10
byte ASCII string to "U10", but going the other way fails if there are
characters which take more than 1 byte to represent. We don't always
get to choose the destination type, when e.g. writing into an existing
dataset, so we can't always write vlen strings.

> This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
> encoding dtype, anyway).
> But: How does hdf handle the fact that utf-8 is not a fixed length encoding?

With fixed-width strings it doesn't, really. If you use vlen strings
it's fine, but otherwise there's just a fixed-width buffer labelled
"UTF-8". Presumably you're supposed to be careful when writing not to
chop the string off in the middle of a multibyte character. We could
truncate strings on their way to the file, but the risk of data
loss/corruption led us to simply not support it at all.

> hmm -- ascii does have those advantages, but I'm not sure its worth the
> restriction on what can be encoded. But you're quite right, you could dump
> asciii straight into something expecting utf-8, whereas you could not do
> that with latin-1, for instance. But you can't go the other way -- does it
> help much to avoided encoding in one direction?

It would help for h5py specifically because most HDF5 strings are
labelled "ASCII". But it's a question for the community which is more
important: the high-bit characters in latin-1, or write-compatibility
with UTF-8.

Andrew

Oscar Benjamin

2014-01-22 10:46:49 UTC

Permalink

On Tue, Jan 21, 2014 at 06:54:33PM -0700, Andrew Collette wrote:
> Hi Chris,
>
> > it looks from here:
> > http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html
> >
> > that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a
> > lot of calls to encode/decode -- which could be pretty slow, compared to
> > other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by
> > "doesn't round trip".
>
> HDF5 does have variable-length string support for UTF-8, so we map
> that directly to the unicode type (str on Py3) exactly as you
> describe, by encoding when we write to the file. But there's no way
> to round-trip with *fixed-width* strings. You can go from e.g. a 10
> byte ASCII string to "U10", but going the other way fails if there are
> characters which take more than 1 byte to represent. We don't always
> get to choose the destination type, when e.g. writing into an existing
> dataset, so we can't always write vlen strings.

Is it fair to say that people should really be using vlen utf-8 strings for
text? Is it problematic because of the need to interface with non-Python
libraries using the same hdf5 file?

> > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
> > encoding dtype, anyway).

That's what I was thinking. A ragged utf-8 array could map to an array of vlen
strings. Or am I misunderstanding how hdf5 works?

Looking here:
http://www.h5py.org/docs/topics/special.html

'''
HDF5 supports a few types which have no direct NumPy equivalent.
Among the most useful and widely used are variable-length (VL) types, and
enumerated types. As of version 1.2, h5py fully supports HDF5 enums, and has
partial support for VL types.
'''

So that seems to suggests that h5py already has a use for a variable length
string dtype.

BTW, as much as the fixed-width 'S' dtype doesn't really work for str in
Python 3 it's also a poor fit for bytes since it strips trailing nulls:

>>> a = np.array(['a\0s\0', 'qwert'], dtype='S')
>>> a
array([b'a\x00s', b'qwert'],
dtype='|S5')
>>> a[0]
b'a\x00s'

> > But: How does hdf handle the fact that utf-8 is not a fixed length encoding?
>
> With fixed-width strings it doesn't, really. If you use vlen strings
> it's fine, but otherwise there's just a fixed-width buffer labelled
> "UTF-8". Presumably you're supposed to be careful when writing not to
> chop the string off in the middle of a multibyte character. We could
> truncate strings on their way to the file, but the risk of data
> loss/corruption led us to simply not support it at all.

Truncating utf-8 is never a good idea. Throwing an error message when it would
truncate is okay though. Presumably you already do this when someone tries to
assign an ASCII string that's too long right?

Oscar

Andrew Collette

2014-01-22 17:45:56 UTC

Permalink

Hi Oscar,

> Is it fair to say that people should really be using vlen utf-8 strings for
> text? Is it problematic because of the need to interface with non-Python
> libraries using the same hdf5 file?

The general recommendation has been to use fixed-width strings for
exactly that reason; FORTRAN programs can't handle vlens, and older
versions of IDL would refuse to deal with anything labelled utf-8,
even fixed-width.

>> > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
>> > encoding dtype, anyway).
>
> That's what I was thinking. A ragged utf-8 array could map to an array of vlen
> strings. Or am I misunderstanding how hdf5 works?

Yes, that's exactly how HDF5 works for this; at the moment, we handle
vlens with the NumPy object ("O") type storing regular Python strings.
A native variable-length NumPy equivalent would also be appreciated,
although I suspect it's a lot of work.

> Truncating utf-8 is never a good idea. Throwing an error message when it would
> truncate is okay though. Presumably you already do this when someone tries to
> assign an ASCII string that's too long right?

We advertise that HDF5 datasets work identically (as closely as
practical) to NumPy arrays; in this case, NumPy truncates and doesn't
warn, so we do the same.

The concern with "U" is more that someone would write a "U10" string
into a 10-byte HDF5 buffer and lose data, even though the advertised
widths were the same. As an observation, a pure-ASCII NumPy type like
the proposed "s" would avoid that completely. With a latin-1 type, it
could still happen as certain characters would become 2 UTF-8 bytes.

Andrew

Chris Barker

2014-01-22 20:07:28 UTC

Permalink

On Wed, Jan 22, 2014 at 2:46 AM, Oscar Benjamin
<***@gmail.com>wrote:

> BTW, as much as the fixed-width 'S' dtype doesn't really work for str in
> Python 3 it's also a poor fit for bytes since it strips trailing nulls:
>
> >>> a = np.array(['a\0s\0', 'qwert'], dtype='S')
> >>> a
> array([b'a\x00s', b'qwert'],
> dtype='|S5')
> >>> a[0]
> b'a\x00s'

WHOOA! Good catch, Oscar.

This conversation started with me suggesting that 'S' on py3 should mean
"ascii string" (or latin-1 string).

Then it was pointed out that it was already being used for arbitrary bytes,
and thus could not be changed to mean a string without breaking already
working code.

However, if 'S' is assigning meaning to null bytes, and doing something
with that, then it is, indeed being treated as an ANSI string (or the old c
string "type", anyway). And any code that is expecting it to be arbitrary
bytes is already broken, and in a way that could result in pretty subtle,
hard to find bugs in the future.

I think we really need a proper bytes dtype (which could be 'S' with the
null byte thing removed), and a proper one-byte-per-character string type.

Though I still don't know the use case for the fixed-length bytes type that
can't be satisfied with the other numeric types, maybe:

In [58]: bytes_15 = np.dtype(('B', 15))

though that doesn't in fact do what I expect:

In [59]: arr = np.zeros((5,), dtype = bytes_15)

In [60]: arr
Out[60]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

shouldn't I get a shape (5,) array, with each element a compound dtype with
15 bytes in it???

How would I spell that?

By the way, from the docs for dtypes:

http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

"""
The first character specifies the kind of data and the remaining characters
specify how many bytes of data. The supported kinds are

'b' Boolean
'i' (signed) integer
'u' unsigned integer
'f' floating-point
'c' complex-floating point
'S', 'a', string
'U' unicode
'V' raw data (void)
"""
Could we use the 'a' for ascii string? (even though in now mapps directly
to 'S')

And by the way, the docs clearly say "string" there -- not bytes, so at the
very least we need to update the docs...

-Chris

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Oscar Benjamin

2014-01-22 21:13:32 UTC

Permalink

On Wed, Jan 22, 2014 at 12:07:28PM -0800, Chris Barker wrote:
> On Wed, Jan 22, 2014 at 2:46 AM, Oscar Benjamin
> <***@gmail.com>wrote:
>
> > BTW, as much as the fixed-width 'S' dtype doesn't really work for str in
> > Python 3 it's also a poor fit for bytes since it strips trailing nulls:
> >
> > >>> a = np.array(['a\0s\0', 'qwert'], dtype='S')
> > >>> a
> > array([b'a\x00s', b'qwert'],
> > dtype='|S5')
> > >>> a[0]
> > b'a\x00s'
>
>
> WHOOA! Good catch, Oscar.
>
> This conversation started with me suggesting that 'S' on py3 should mean
> "ascii string" (or latin-1 string).
>
> Then it was pointed out that it was already being used for arbitrary bytes,
> and thus could not be changed to mean a string without breaking already
> working code.
>
> However, if 'S' is assigning meaning to null bytes, and doing something
> with that, then it is, indeed being treated as an ANSI string (or the old c
> string "type", anyway). And any code that is expecting it to be arbitrary
> bytes is already broken, and in a way that could result in pretty subtle,
> hard to find bugs in the future.
>
> I think we really need a proper bytes dtype (which could be 'S' with the
> null byte thing removed), and a proper one-byte-per-character string type.

It's not safe to stop removing the null bytes. This is how numpy determines
the length of the strings in a dtype='S' array. The strings are not
"fixed-width" but rather have a maximum width. Aything shorter gets padded
with nulls. This is transparent if you index strings from the array:

>>> a = np.array(b'a string of different length words'.split(), dtype='S')
>>> a
array([b'a', b'string', b'of', b'different', b'length', b'words'],
dtype='|S9')
>>> a[0]
b'a'
>>> len(a[0])
1
>>> a.tostring()
b'a\x00\x00\x00\x00\x00\x00\x00\x00string\x00\x00\x00of\x00\x00\x00\x00\x00\x00\x00differentlength\x00\x00\x00words\x00\x00\x00\x00'o

If the trailing nulls are not removed then you would get:

>>> a[0]
b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> len(a[0])
9

And I'm sure that someone would get upset about that.

> Though I still don't know the use case for the fixed-length bytes type that
> can't be satisfied with the other numeric types,

Having the null bytes removed and a str (on Py2) object returned is precisely
the use case that distinguishes it from np.uint8. The other differences are the
removal of arithmetic operations.

Some more oddities:

>>> a[0] = 1
>>> a
array([b'1', b'string', b'of', b'different', b'length', b'words'],
dtype='|S9')
>>> a[0] = None
>>> a
array([b'None', b'string', b'of', b'different', b'length', b'words'],
dtype='|S9')
>>> a[0] = range(1, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: cannot set an array element with a sequence
>>> a[0] = (x for x in range(2))
>>> a
array([b'<generato', b'string', b'of', b'different', b'length', b'words'],
dtype='|S9')

Oscar

Chris Barker - NOAA Federal

2014-01-23 01:53:26 UTC

Permalink

On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <***@gmail.com> wrote:

>
> It's not safe to stop removing the null bytes. This is how numpy determines
> the length of the strings in a dtype='S' array. The strings are not
> "fixed-width" but rather have a maximum width.

Exactly--but folks have told us on this list that they want (and are)
using the 'S' style for arbitrary bytes, NOT for text. In which case
you wouldn't want to remove null bytes. This is more evidence that 'S'
was designed to handle c-style one-byte-per-char strings, and NOT
arbitrary bytes, and thus not to map directly to the py2 string type
(you can store null bytes in a py2 string"

Which brings me back to my original proposal: properly map the 'S'
type to the py3 data model, and maybe add some kind of fixed width
bytes style of there is a use case for that. I still have no idea what
the use case might be.

> If the trailing nulls are not removed then you would get:
>
>>>> a[0]
> b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>>> len(a[0])
> 9
>
> And I'm sure that someone would get upset about that.

Only if they are using it for text-which you "should not" do with py3.

> Having the null bytes removed and a str (on Py2) object returned is precisely
> the use case that distinguishes it from np.uint8.

But that was because it was designed to be used with text . And if you
want text, then you should use py3 strings, not bytes. And if you
really want bytes, then you wouldn't want null bytes removed.

> The other differences are the
> removal of arithmetic operations.

And 'S' is treated as an atomic element, I'm not sure how you can do
that cleanly with uint8.

> Some more oddities:
>
>>>> a[0] = 1
>>>> a
> array([b'1', b'string', b'of', b'different', b'length', b'words'],
> dtype='|S9')
>>>> a[0] = None
>>>> a
> array([b'None', b'string', b'of', b'different', b'length', b'words'],
> dtype='|S9')

More evidence that this is a text type.....

-Chris

Oscar Benjamin

2014-01-23 10:45:22 UTC

Permalink

On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <***@gmail.com> wrote:
>
> >
> > It's not safe to stop removing the null bytes. This is how numpy determines
> > the length of the strings in a dtype='S' array. The strings are not
> > "fixed-width" but rather have a maximum width.
>
> Exactly--but folks have told us on this list that they want (and are)
> using the 'S' style for arbitrary bytes, NOT for text. In which case
> you wouldn't want to remove null bytes. This is more evidence that 'S'
> was designed to handle c-style one-byte-per-char strings, and NOT
> arbitrary bytes, and thus not to map directly to the py2 string type
> (you can store null bytes in a py2 string"

You can store null bytes in a Py2 string but you normally wouldn't if it was
supposed to be text.

>
> Which brings me back to my original proposal: properly map the 'S'
> type to the py3 data model, and maybe add some kind of fixed width
> bytes style of there is a use case for that. I still have no idea what
> the use case might be.
>

There would definitely be a use case for a fixed-byte-width
bytes-representing-text dtype in record arrays to read from a binary file:

dt = np.dtype([
('name', '|b8:utf-8'),
('param1', '<i4'),
('param2', '<i4')
...
])

with open('binaryfile', 'rb') as fin:
a = np.fromfile(fin, dtype=dt)

You could also use this for ASCII if desired. I don't think it really matters
that utf-8 uses variable width as long as a too long byte string throws an
error (and does not truncate).

For non 8-bit encodings there would have to be some way to handle endianness
without a BOM, but otherwise I think that it's always possible to pad with zero
*bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
null *characters* after decoding. i.e.:

$ cat tmp.py
import encodings

def test_encoding(s1, enc):
b = s1.encode(enc).ljust(32, b'\0')
s2 = b.decode(enc)
index = s2.find('\0')
if index != -1:
s2 = s2[:index]
assert s1 == s2, enc

encodings_set = set(encodings.aliases.aliases.values())

for N, enc in enumerate(encodings_set):
try:
test_encoding('qwe', enc)
except LookupError:
pass

print('Tested %d encodings without error' % N)
$ python3 tmp.py
Tested 88 encodings without error

> > If the trailing nulls are not removed then you would get:
> >
> >>>> a[0]
> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> >>>> len(a[0])
> > 9
> >
> > And I'm sure that someone would get upset about that.
>
> Only if they are using it for text-which you "should not" do with py3.

But people definitely are using it for text on Python 3. It should be
deprecated in favour of something new but breaking it is just gratuitous.
Numpy doesn't have the option to make a clean break with Python 3 precisely
because it needs to straddle 2.x and 3.x while numpy-based applications are
ported to 3.x.

> > Some more oddities:
> >
> >>>> a[0] = 1
> >>>> a
> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
> > dtype='|S9')
> >>>> a[0] = None
> >>>> a
> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
> > dtype='|S9')
>
> More evidence that this is a text type.....

And the big one:

$ python3
Python 3.2.3 (default, Sep 25 2013, 18:22:43)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>> a
array([b'asd', b'zxc'],
dtype='|S3')
>>> a[0] = 'qwer' # Unicode string again
>>> a
array([b'qwe', b'zxc'],
dtype='|S3')
>>> a[0] = 'Õscar'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)

The analogous behaviour was very deliberately removed from Python 3:

>>> a[0] == 'qwe'
False
>>> a[0] == b'qwe'
True

Oscar

j***@gmail.com

2014-01-23 15:37:02 UTC

Permalink

On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
<***@gmail.com> wrote:
> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <***@gmail.com> wrote:
>>
>> >
>> > It's not safe to stop removing the null bytes. This is how numpy determines
>> > the length of the strings in a dtype='S' array. The strings are not
>> > "fixed-width" but rather have a maximum width.
>>
>> Exactly--but folks have told us on this list that they want (and are)
>> using the 'S' style for arbitrary bytes, NOT for text. In which case
>> you wouldn't want to remove null bytes. This is more evidence that 'S'
>> was designed to handle c-style one-byte-per-char strings, and NOT
>> arbitrary bytes, and thus not to map directly to the py2 string type
>> (you can store null bytes in a py2 string"
>
> You can store null bytes in a Py2 string but you normally wouldn't if it was
> supposed to be text.
>
>>
>> Which brings me back to my original proposal: properly map the 'S'
>> type to the py3 data model, and maybe add some kind of fixed width
>> bytes style of there is a use case for that. I still have no idea what
>> the use case might be.
>>
>
> There would definitely be a use case for a fixed-byte-width
> bytes-representing-text dtype in record arrays to read from a binary file:
>
> dt = np.dtype([
> ('name', '|b8:utf-8'),
> ('param1', '<i4'),
> ('param2', '<i4')
> ...
> ])
>
> with open('binaryfile', 'rb') as fin:
> a = np.fromfile(fin, dtype=dt)
>
> You could also use this for ASCII if desired. I don't think it really matters
> that utf-8 uses variable width as long as a too long byte string throws an
> error (and does not truncate).
>
> For non 8-bit encodings there would have to be some way to handle endianness
> without a BOM, but otherwise I think that it's always possible to pad with zero
> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
> null *characters* after decoding. i.e.:
>
> $ cat tmp.py
> import encodings
>
> def test_encoding(s1, enc):
> b = s1.encode(enc).ljust(32, b'\0')
> s2 = b.decode(enc)
> index = s2.find('\0')
> if index != -1:
> s2 = s2[:index]
> assert s1 == s2, enc
>
> encodings_set = set(encodings.aliases.aliases.values())
>
> for N, enc in enumerate(encodings_set):
> try:
> test_encoding('qwe', enc)
> except LookupError:
> pass
>
> print('Tested %d encodings without error' % N)
> $ python3 tmp.py
> Tested 88 encodings without error
>
>> > If the trailing nulls are not removed then you would get:
>> >
>> >>>> a[0]
>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>> >>>> len(a[0])
>> > 9
>> >
>> > And I'm sure that someone would get upset about that.
>>
>> Only if they are using it for text-which you "should not" do with py3.
>
> But people definitely are using it for text on Python 3. It should be
> deprecated in favour of something new but breaking it is just gratuitous.
> Numpy doesn't have the option to make a clean break with Python 3 precisely
> because it needs to straddle 2.x and 3.x while numpy-based applications are
> ported to 3.x.
>
>> > Some more oddities:
>> >
>> >>>> a[0] = 1
>> >>>> a
>> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
>> > dtype='|S9')
>> >>>> a[0] = None
>> >>>> a
>> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
>> > dtype='|S9')
>>
>> More evidence that this is a text type.....
>
> And the big one:
>
> $ python3
> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import numpy as np
>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>>> a
> array([b'asd', b'zxc'],
> dtype='|S3')
>>>> a[0] = 'qwer' # Unicode string again
>>>> a
> array([b'qwe', b'zxc'],
> dtype='|S3')
>>>> a[0] = 'Õscar'
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)

>
> The analogous behaviour was very deliberately removed from Python 3:
>
>>>> a[0] == 'qwe'
> False
>>>> a[0] == b'qwe'
> True
>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

j***@gmail.com

2014-01-23 15:41:30 UTC

Permalink

On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
<***@gmail.com> wrote:
> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <***@gmail.com> wrote:
>>
>> >
>> > It's not safe to stop removing the null bytes. This is how numpy determines
>> > the length of the strings in a dtype='S' array. The strings are not
>> > "fixed-width" but rather have a maximum width.
>>
>> Exactly--but folks have told us on this list that they want (and are)
>> using the 'S' style for arbitrary bytes, NOT for text. In which case
>> you wouldn't want to remove null bytes. This is more evidence that 'S'
>> was designed to handle c-style one-byte-per-char strings, and NOT
>> arbitrary bytes, and thus not to map directly to the py2 string type
>> (you can store null bytes in a py2 string"
>
> You can store null bytes in a Py2 string but you normally wouldn't if it was
> supposed to be text.
>
>>
>> Which brings me back to my original proposal: properly map the 'S'
>> type to the py3 data model, and maybe add some kind of fixed width
>> bytes style of there is a use case for that. I still have no idea what
>> the use case might be.
>>
>
> There would definitely be a use case for a fixed-byte-width
> bytes-representing-text dtype in record arrays to read from a binary file:
>
> dt = np.dtype([
> ('name', '|b8:utf-8'),
> ('param1', '<i4'),
> ('param2', '<i4')
> ...
> ])
>
> with open('binaryfile', 'rb') as fin:
> a = np.fromfile(fin, dtype=dt)
>
> You could also use this for ASCII if desired. I don't think it really matters
> that utf-8 uses variable width as long as a too long byte string throws an
> error (and does not truncate).
>
> For non 8-bit encodings there would have to be some way to handle endianness
> without a BOM, but otherwise I think that it's always possible to pad with zero
> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
> null *characters* after decoding. i.e.:
>
> $ cat tmp.py
> import encodings
>
> def test_encoding(s1, enc):
> b = s1.encode(enc).ljust(32, b'\0')
> s2 = b.decode(enc)
> index = s2.find('\0')
> if index != -1:
> s2 = s2[:index]
> assert s1 == s2, enc
>
> encodings_set = set(encodings.aliases.aliases.values())
>
> for N, enc in enumerate(encodings_set):
> try:
> test_encoding('qwe', enc)
> except LookupError:
> pass
>
> print('Tested %d encodings without error' % N)
> $ python3 tmp.py
> Tested 88 encodings without error
>
>> > If the trailing nulls are not removed then you would get:
>> >
>> >>>> a[0]
>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>> >>>> len(a[0])
>> > 9
>> >
>> > And I'm sure that someone would get upset about that.
>>
>> Only if they are using it for text-which you "should not" do with py3.
>
> But people definitely are using it for text on Python 3. It should be
> deprecated in favour of something new but breaking it is just gratuitous.
> Numpy doesn't have the option to make a clean break with Python 3 precisely
> because it needs to straddle 2.x and 3.x while numpy-based applications are
> ported to 3.x.
>
>> > Some more oddities:
>> >
>> >>>> a[0] = 1
>> >>>> a
>> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
>> > dtype='|S9')
>> >>>> a[0] = None
>> >>>> a
>> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
>> > dtype='|S9')
>>
>> More evidence that this is a text type.....
>
> And the big one:
>
> $ python3
> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import numpy as np
>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>>> a
> array([b'asd', b'zxc'],
> dtype='|S3')
>>>> a[0] = 'qwer' # Unicode string again
>>>> a
> array([b'qwe', b'zxc'],
> dtype='|S3')
>>>> a[0] = 'Õscar'
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)

looks mostly like casting rules to me, which looks like ASCII based
instead of an arbitrary encoding.

>>> a = np.array(['asd', 'zxc'], dtype='S')
>>> b = a.astype('U')
>>> b[0] = 'Õscar'
>>> a[0] = 'Õscar'
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
a[0] = 'Õscar'
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
>>> b
array(['Õsc', 'zxc'],
dtype='<U3')
>>> b.astype('S')
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
b.astype('S')
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
>>> b.view('S4')
array([b'\xd5', b's', b'c', b'z', b'x', b'c'],
dtype='|S4')

>>> a.astype('U').astype('S')
array([b'asd', b'zxc'],
dtype='|S3')

Josef

>
> The analogous behaviour was very deliberately removed from Python 3:
>
>>>> a[0] == 'qwe'
> False
>>>> a[0] == b'qwe'
> True
>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

j***@gmail.com

2014-01-23 16:23:09 UTC

Permalink

On Thu, Jan 23, 2014 at 10:41 AM, <***@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
> <***@gmail.com> wrote:
>> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
>>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <***@gmail.com> wrote:
>>>
>>> >
>>> > It's not safe to stop removing the null bytes. This is how numpy determines
>>> > the length of the strings in a dtype='S' array. The strings are not
>>> > "fixed-width" but rather have a maximum width.
>>>
>>> Exactly--but folks have told us on this list that they want (and are)
>>> using the 'S' style for arbitrary bytes, NOT for text. In which case
>>> you wouldn't want to remove null bytes. This is more evidence that 'S'
>>> was designed to handle c-style one-byte-per-char strings, and NOT
>>> arbitrary bytes, and thus not to map directly to the py2 string type
>>> (you can store null bytes in a py2 string"
>>
>> You can store null bytes in a Py2 string but you normally wouldn't if it was
>> supposed to be text.
>>
>>>
>>> Which brings me back to my original proposal: properly map the 'S'
>>> type to the py3 data model, and maybe add some kind of fixed width
>>> bytes style of there is a use case for that. I still have no idea what
>>> the use case might be.
>>>
>>
>> There would definitely be a use case for a fixed-byte-width
>> bytes-representing-text dtype in record arrays to read from a binary file:
>>
>> dt = np.dtype([
>> ('name', '|b8:utf-8'),
>> ('param1', '<i4'),
>> ('param2', '<i4')
>> ...
>> ])
>>
>> with open('binaryfile', 'rb') as fin:
>> a = np.fromfile(fin, dtype=dt)
>>
>> You could also use this for ASCII if desired. I don't think it really matters
>> that utf-8 uses variable width as long as a too long byte string throws an
>> error (and does not truncate).
>>
>> For non 8-bit encodings there would have to be some way to handle endianness
>> without a BOM, but otherwise I think that it's always possible to pad with zero
>> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
>> null *characters* after decoding. i.e.:
>>
>> $ cat tmp.py
>> import encodings
>>
>> def test_encoding(s1, enc):
>> b = s1.encode(enc).ljust(32, b'\0')
>> s2 = b.decode(enc)
>> index = s2.find('\0')
>> if index != -1:
>> s2 = s2[:index]
>> assert s1 == s2, enc
>>
>> encodings_set = set(encodings.aliases.aliases.values())
>>
>> for N, enc in enumerate(encodings_set):
>> try:
>> test_encoding('qwe', enc)
>> except LookupError:
>> pass
>>
>> print('Tested %d encodings without error' % N)
>> $ python3 tmp.py
>> Tested 88 encodings without error
>>
>>> > If the trailing nulls are not removed then you would get:
>>> >
>>> >>>> a[0]
>>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> >>>> len(a[0])
>>> > 9
>>> >
>>> > And I'm sure that someone would get upset about that.
>>>
>>> Only if they are using it for text-which you "should not" do with py3.
>>
>> But people definitely are using it for text on Python 3. It should be
>> deprecated in favour of something new but breaking it is just gratuitous.
>> Numpy doesn't have the option to make a clean break with Python 3 precisely
>> because it needs to straddle 2.x and 3.x while numpy-based applications are
>> ported to 3.x.
>>
>>> > Some more oddities:
>>> >
>>> >>>> a[0] = 1
>>> >>>> a
>>> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
>>> > dtype='|S9')
>>> >>>> a[0] = None
>>> >>>> a
>>> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
>>> > dtype='|S9')
>>>
>>> More evidence that this is a text type.....
>>
>> And the big one:
>>
>> $ python3
>> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
>> [GCC 4.6.3] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import numpy as np
>>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>>>> a
>> array([b'asd', b'zxc'],
>> dtype='|S3')
>>>>> a[0] = 'qwer' # Unicode string again
>>>>> a
>> array([b'qwe', b'zxc'],
>> dtype='|S3')
>>>>> a[0] = 'Õscar'
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>
> looks mostly like casting rules to me, which looks like ASCII based
> instead of an arbitrary encoding.
>
>>>> a = np.array(['asd', 'zxc'], dtype='S')
>>>> b = a.astype('U')
>>>> b[0] = 'Õscar'
>>>> a[0] = 'Õscar'
> Traceback (most recent call last):
> File "<pyshell#17>", line 1, in <module>
> a[0] = 'Õscar'
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
>>>> b
> array(['Õsc', 'zxc'],
> dtype='<U3')
>>>> b.astype('S')
> Traceback (most recent call last):
> File "<pyshell#19>", line 1, in <module>
> b.astype('S')
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
>>>> b.view('S4')
> array([b'\xd5', b's', b'c', b'z', b'x', b'c'],
> dtype='|S4')
>
>>>> a.astype('U').astype('S')
> array([b'asd', b'zxc'],
> dtype='|S3')

another curious example, encode utf-8 to latin-1 bytes

>>> b
array(['Õsc', 'zxc'],
dtype='<U3')
>>> b[0].encode('utf8')
b'\xc3\x95sc'
>>> b[0].encode('latin1')
b'\xd5sc'
>>> b.astype('S')
Traceback (most recent call last):
File "<pyshell#40>", line 1, in <module>
b.astype('S')
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
>>> c = b.view('S4').astype('S1').view('S3')
>>> c
array([b'\xd5sc', b'zxc'],
dtype='|S3')
>>> c[0].decode('latin1')
'Õsc'

--------
The original numpy py3 conversion used latin-1 as default
(It's still used in statsmodels, and I haven't looked at the structure
under the common py2-3 codebase)

if sys.version_info[0] >= 3:
import io
bytes = bytes
unicode = str
asunicode = str
def asbytes(s):
if isinstance(s, bytes):
return s
return s.encode('latin1')
def asstr(s):
if isinstance(s, str):
return s
return s.decode('latin1')

--------------

Josef

>
> Josef
>
>>
>> The analogous behaviour was very deliberately removed from Python 3:
>>
>>>>> a[0] == 'qwe'
>> False
>>>>> a[0] == b'qwe'
>> True
>>
>>
>> Oscar
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-***@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Oscar Benjamin

2014-01-23 16:43:09 UTC

Permalink

On Thu, Jan 23, 2014 at 11:23:09AM -0500, ***@gmail.com wrote:
>
> another curious example, encode utf-8 to latin-1 bytes
>
> >>> b
> array(['Õsc', 'zxc'],
> dtype='<U3')
> >>> b[0].encode('utf8')
> b'\xc3\x95sc'
> >>> b[0].encode('latin1')
> b'\xd5sc'
> >>> b.astype('S')
> Traceback (most recent call last):
> File "<pyshell#40>", line 1, in <module>
> b.astype('S')
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
> >>> c = b.view('S4').astype('S1').view('S3')
> >>> c
> array([b'\xd5sc', b'zxc'],
> dtype='|S3')
> >>> c[0].decode('latin1')
> 'Õsc'

Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
ascii:

>>> np.array(['Õsc']).astype('S4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>> np.array(['Õsc']).view('S4')
array([b'\xd5', b's', b'c'],
dtype='|S4')

> --------
> The original numpy py3 conversion used latin-1 as default
> (It's still used in statsmodels, and I haven't looked at the structure
> under the common py2-3 codebase)
>
> if sys.version_info[0] >= 3:
> import io
> bytes = bytes
> unicode = str
> asunicode = str

These two functions are an abomination:

> def asbytes(s):
> if isinstance(s, bytes):
> return s
> return s.encode('latin1')
> def asstr(s):
> if isinstance(s, str):
> return s
> return s.decode('latin1')

Oscar

j***@gmail.com

2014-01-23 16:58:38 UTC

Permalink

On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
<***@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 11:23:09AM -0500, ***@gmail.com wrote:
>>
>> another curious example, encode utf-8 to latin-1 bytes
>>
>> >>> b
>> array(['Õsc', 'zxc'],
>> dtype='<U3')
>> >>> b[0].encode('utf8')
>> b'\xc3\x95sc'
>> >>> b[0].encode('latin1')
>> b'\xd5sc'
>> >>> b.astype('S')
>> Traceback (most recent call last):
>> File "<pyshell#40>", line 1, in <module>
>> b.astype('S')
>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
>> position 0: ordinal not in range(128)
>> >>> c = b.view('S4').astype('S1').view('S3')
>> >>> c
>> array([b'\xd5sc', b'zxc'],
>> dtype='|S3')
>> >>> c[0].decode('latin1')
>> 'Õsc'
>
> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
> ascii:
>
>>>> np.array(['Õsc']).astype('S4')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>>> np.array(['Õsc']).view('S4')
> array([b'\xd5', b's', b'c'],
> dtype='|S4')

No, a view doesn't change the memory, it just changes the
interpretation and there shouldn't be any conversion involved.
astype does type conversion, but it goes through ascii encoding which fails.

>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>> b.tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>> b.view('S12')
array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
dtype='|S12')

The conversion happens somewhere in the array creation, but I have no
idea about the memory encoding for uc2 and the low level layouts.

Josef

>
>> --------
>> The original numpy py3 conversion used latin-1 as default
>> (It's still used in statsmodels, and I haven't looked at the structure
>> under the common py2-3 codebase)
>>
>> if sys.version_info[0] >= 3:
>> import io
>> bytes = bytes
>> unicode = str
>> asunicode = str
>
> These two functions are an abomination:
>
>> def asbytes(s):
>> if isinstance(s, bytes):
>> return s
>> return s.encode('latin1')
>> def asstr(s):
>> if isinstance(s, str):
>> return s
>> return s.decode('latin1')
>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

j***@gmail.com

2014-01-23 17:13:55 UTC

Permalink

On Thu, Jan 23, 2014 at 11:58 AM, <***@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
> <***@gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 11:23:09AM -0500, ***@gmail.com wrote:
>>>
>>> another curious example, encode utf-8 to latin-1 bytes
>>>
>>> >>> b
>>> array(['Õsc', 'zxc'],
>>> dtype='<U3')
>>> >>> b[0].encode('utf8')
>>> b'\xc3\x95sc'
>>> >>> b[0].encode('latin1')
>>> b'\xd5sc'
>>> >>> b.astype('S')
>>> Traceback (most recent call last):
>>> File "<pyshell#40>", line 1, in <module>
>>> b.astype('S')
>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
>>> position 0: ordinal not in range(128)
>>> >>> c = b.view('S4').astype('S1').view('S3')
>>> >>> c
>>> array([b'\xd5sc', b'zxc'],
>>> dtype='|S3')
>>> >>> c[0].decode('latin1')
>>> 'Õsc'
>>
>> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
>> ascii:
>>
>>>>> np.array(['Õsc']).astype('S4')
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>>>> np.array(['Õsc']).view('S4')
>> array([b'\xd5', b's', b'c'],
>> dtype='|S4')
>
>
> No, a view doesn't change the memory, it just changes the
> interpretation and there shouldn't be any conversion involved.
> astype does type conversion, but it goes through ascii encoding which fails.
>
>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>> b.tostring()
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>> b.view('S12')
> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
> dtype='|S12')
>
> The conversion happens somewhere in the array creation, but I have no
> idea about the memory encoding for uc2 and the low level layouts.

utf8 encoded bytes

>>> a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S')
>>> a
array([b'\xc3\x95sc', b'zxc'],
dtype='|S4')
>>> a.tostring()
b'\xc3\x95sczxc\x00'
>>> a.view('S8')
array([b'\xc3\x95sczxc'],
dtype='|S8')

>>> a[0].decode('latin1')
'Ã\x95sc'
>>> a[0].decode('utf8')
'Õsc'

Josef

>
> Josef
>
>>
>>> --------
>>> The original numpy py3 conversion used latin-1 as default
>>> (It's still used in statsmodels, and I haven't looked at the structure
>>> under the common py2-3 codebase)
>>>
>>> if sys.version_info[0] >= 3:
>>> import io
>>> bytes = bytes
>>> unicode = str
>>> asunicode = str
>>
>> These two functions are an abomination:
>>
>>> def asbytes(s):
>>> if isinstance(s, bytes):
>>> return s
>>> return s.encode('latin1')
>>> def asstr(s):
>>> if isinstance(s, str):
>>> return s
>>> return s.decode('latin1')
>>
>>
>> Oscar
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-***@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion

j***@gmail.com

2014-01-23 17:42:13 UTC

Permalink

On Thu, Jan 23, 2014 at 12:13 PM, <***@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 11:58 AM, <***@gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
>> <***@gmail.com> wrote:
>>> On Thu, Jan 23, 2014 at 11:23:09AM -0500, ***@gmail.com wrote:
>>>>
>>>> another curious example, encode utf-8 to latin-1 bytes
>>>>
>>>> >>> b
>>>> array(['Õsc', 'zxc'],
>>>> dtype='<U3')
>>>> >>> b[0].encode('utf8')
>>>> b'\xc3\x95sc'
>>>> >>> b[0].encode('latin1')
>>>> b'\xd5sc'
>>>> >>> b.astype('S')
>>>> Traceback (most recent call last):
>>>> File "<pyshell#40>", line 1, in <module>
>>>> b.astype('S')
>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
>>>> position 0: ordinal not in range(128)
>>>> >>> c = b.view('S4').astype('S1').view('S3')
>>>> >>> c
>>>> array([b'\xd5sc', b'zxc'],
>>>> dtype='|S3')
>>>> >>> c[0].decode('latin1')
>>>> 'Õsc'
>>>
>>> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
>>> ascii:
>>>
>>>>>> np.array(['Õsc']).astype('S4')
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1, in <module>
>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>>>>> np.array(['Õsc']).view('S4')
>>> array([b'\xd5', b's', b'c'],
>>> dtype='|S4')
>>
>>
>> No, a view doesn't change the memory, it just changes the
>> interpretation and there shouldn't be any conversion involved.
>> astype does type conversion, but it goes through ascii encoding which fails.
>>
>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>> b.tostring()
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>> b.view('S12')
>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>> dtype='|S12')
>>
>> The conversion happens somewhere in the array creation, but I have no
>> idea about the memory encoding for uc2 and the low level layouts.

>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>> b[0].tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>> 'Õsc'.encode('utf-32LE')
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'

Is that the encoding for 'U' ?

---
another sideeffect of null truncation: cannot decode truncated data

>>> b.view('S4').tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>> b.view('S4')[0]
b'\xd5'
>>> b.view('S4')[0].tostring()
b'\xd5'
>>> b.view('S4')[:1].tostring()
b'\xd5\x00\x00\x00'

>>> b.view('S4')[0].decode('utf-32LE')
Traceback (most recent call last):
File "<pyshell#101>", line 1, in <module>
b.view('S4')[0].decode('utf-32LE')
File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode
return codecs.utf_32_le_decode(input, errors, True)
UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
0: truncated data

>>> b.view('S4')[:1].tostring().decode('utf-32LE')
'Õ'

numpy arrays need a decode and encode method

Josef

>
> utf8 encoded bytes
>
>>>> a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S')
>>>> a
> array([b'\xc3\x95sc', b'zxc'],
> dtype='|S4')
>>>> a.tostring()
> b'\xc3\x95sczxc\x00'
>>>> a.view('S8')
> array([b'\xc3\x95sczxc'],
> dtype='|S8')
>
>>>> a[0].decode('latin1')
> 'Ã\x95sc'
>>>> a[0].decode('utf8')
> 'Õsc'
>
> Josef
>
>>
>> Josef
>>
>>>
>>>> --------
>>>> The original numpy py3 conversion used latin-1 as default
>>>> (It's still used in statsmodels, and I haven't looked at the structure
>>>> under the common py2-3 codebase)
>>>>
>>>> if sys.version_info[0] >= 3:
>>>> import io
>>>> bytes = bytes
>>>> unicode = str
>>>> asunicode = str
>>>
>>> These two functions are an abomination:
>>>
>>>> def asbytes(s):
>>>> if isinstance(s, bytes):
>>>> return s
>>>> return s.encode('latin1')
>>>> def asstr(s):
>>>> if isinstance(s, str):
>>>> return s
>>>> return s.decode('latin1')
>>>
>>>
>>> Oscar
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-***@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Oscar Benjamin

2014-01-23 18:36:57 UTC

Permalink

On 23 January 2014 17:42, <***@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 12:13 PM, <***@gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 11:58 AM, <***@gmail.com> wrote:
>>>
>>> No, a view doesn't change the memory, it just changes the
>>> interpretation and there shouldn't be any conversion involved.
>>> astype does type conversion, but it goes through ascii encoding which fails.
>>>
>>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>>> b.tostring()
>>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>>> b.view('S12')
>>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>>> dtype='|S12')
>>>
>>> The conversion happens somewhere in the array creation, but I have no
>>> idea about the memory encoding for uc2 and the low level layouts.
>
>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>> b[0].tostring()
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>>> 'Õsc'.encode('utf-32LE')
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>
> Is that the encoding for 'U' ?

On a little-endian system, yes. I realise what' happening now. 'U'
represents unicode characters as a 32-bit unsigned integer giving the
code point of the character. The first 256 code points are exactly the
256 characters representable with latin-1 in the same order.

So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in
latin-1. As a 32 bit integer the code point is 0x000000d5 but in
little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So
when you reinterpret that as 'S4' it strips the remaining nulls to get
the byte string b'\xd5'. Which is the latin-1 encoding for the
character. The same will happen for any string of latin-1 characters.
However if you do have a code point of 256 or greater then you'll get
a byte strings of length 2 or more.

On a big-endian system I think you'd get b'\x00\x00\x00\xd5'.

> another sideeffect of null truncation: cannot decode truncated data
>
>>>> b.view('S4').tostring()
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>> b.view('S4')[0]
> b'\xd5'
>>>> b.view('S4')[0].tostring()
> b'\xd5'
>>>> b.view('S4')[:1].tostring()
> b'\xd5\x00\x00\x00'
>
>>>> b.view('S4')[0].decode('utf-32LE')
> Traceback (most recent call last):
> File "<pyshell#101>", line 1, in <module>
> b.view('S4')[0].decode('utf-32LE')
> File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode
> return codecs.utf_32_le_decode(input, errors, True)
> UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
> 0: truncated data
>
>>>> b.view('S4')[:1].tostring().decode('utf-32LE')
> 'Õ'
>
> numpy arrays need a decode and encode method

I'm not sure that they do. Rather there needs to be a text dtype that
knows what encoding to use in order to have a binary interface as
exposed by .tostring() and friends and but produce unicode strings
when indexed from Python code. Having both a text and a binary
interface to the same data implies having an encoding.

Oscar

j***@gmail.com

2014-01-23 20:18:18 UTC

Permalink

On Thu, Jan 23, 2014 at 1:36 PM, Oscar Benjamin
<***@gmail.com> wrote:
> On 23 January 2014 17:42, <***@gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 12:13 PM, <***@gmail.com> wrote:
>>> On Thu, Jan 23, 2014 at 11:58 AM, <***@gmail.com> wrote:
>>>>
>>>> No, a view doesn't change the memory, it just changes the
>>>> interpretation and there shouldn't be any conversion involved.
>>>> astype does type conversion, but it goes through ascii encoding which fails.
>>>>
>>>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>>>> b.tostring()
>>>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>>>> b.view('S12')
>>>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>>>> dtype='|S12')
>>>>
>>>> The conversion happens somewhere in the array creation, but I have no
>>>> idea about the memory encoding for uc2 and the low level layouts.
>>
>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>> b[0].tostring()
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>>>> 'Õsc'.encode('utf-32LE')
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>
>> Is that the encoding for 'U' ?
>
> On a little-endian system, yes. I realise what' happening now. 'U'
> represents unicode characters as a 32-bit unsigned integer giving the
> code point of the character. The first 256 code points are exactly the
> 256 characters representable with latin-1 in the same order.
>
> So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in
> latin-1. As a 32 bit integer the code point is 0x000000d5 but in
> little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So
> when you reinterpret that as 'S4' it strips the remaining nulls to get
> the byte string b'\xd5'. Which is the latin-1 encoding for the
> character. The same will happen for any string of latin-1 characters.
> However if you do have a code point of 256 or greater then you'll get
> a byte strings of length 2 or more.
>
> On a big-endian system I think you'd get b'\x00\x00\x00\xd5'.

I curious consequence of this, if we have only 1 character elements:

>>> a = np.array([si.encode('utf-16LE') for si in ['Õ', 'z']], dtype='S')
>>> a32 = np.array([si.encode('utf-32LE') for si in ['Õ', 'z']], dtype='S')
>>> a[0], a32[0]
(b'\xd5', b'\xd5')
>>> a[0] == a32[0]
True

>>> a32 = np.array([si.encode('utf-32BE') for si in ['Õ', 'z']], dtype='S')
>>> a = np.array([si.encode('utf-16BE') for si in ['Õ', 'z']], dtype='S')
>>> a[0], a32[0]
(b'\x00\xd5', b'\x00\x00\x00\xd5')
>>> a[0] == a32[0]
False

Josef

>
>> another sideeffect of null truncation: cannot decode truncated data
>>
>>>>> b.view('S4').tostring()
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>> b.view('S4')[0]
>> b'\xd5'
>>>>> b.view('S4')[0].tostring()
>> b'\xd5'
>>>>> b.view('S4')[:1].tostring()
>> b'\xd5\x00\x00\x00'
>>
>>>>> b.view('S4')[0].decode('utf-32LE')
>> Traceback (most recent call last):
>> File "<pyshell#101>", line 1, in <module>
>> b.view('S4')[0].decode('utf-32LE')
>> File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode
>> return codecs.utf_32_le_decode(input, errors, True)
>> UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
>> 0: truncated data
>>
>>>>> b.view('S4')[:1].tostring().decode('utf-32LE')
>> 'Õ'
>>
>> numpy arrays need a decode and encode method
>
> I'm not sure that they do. Rather there needs to be a text dtype that
> knows what encoding to use in order to have a binary interface as
> exposed by .tostring() and friends and but produce unicode strings
> when indexed from Python code. Having both a text and a binary
> interface to the same data implies having an encoding.
>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Chris Barker

2014-01-23 18:49:42 UTC

Permalink

Thanks for poking into this all. I've lost track a bit, but I think:

The 'S' type is clearly broken on py3 (at least). I think that gives us
room to change it, and backward compatibly is less of an issue because it's
broken already -- do we need to preserve bug-for-bug compatibility? Maybe,
but I suspect in this case, not -- the code the "works fine" on py3 with
the 'S' type is probably only lucky that it hasn't encountered the issues
yet.

And no matter how you slice it, code being ported to py3 needs to deal with
text handling issues.

But here is where we stand:

The 'S' dtype:

- was designed for one-byte-per-char text data.
- was mapped to the py2 string type.
- used the classic C null-terminated approach.
- can be used for arbitrary bytes (as the py2 string type can), but not
quite, as it truncates null bytes -- so it really a bad idea to use it that
way.

Under py3:
The 'S' type maps to the py3 bytes type, because that's the closest to
the py2 string type. But it also does some inconsistent things with
encoding, and does treat a lot of other things as text. But the py3 bytes
type does not have the same text handling as the py2 string type, so things
like:

s = 'a string'
np.array((s,), dtype='S')[0] == s

Gives you False, rather than True on py2. This is because a py3 string is
translated to the 'S' type (presumable with the default encoding, another
maybe not a good idea, but returns a bytes object, which does not compare
true to a py3 string. YOu can work aroudn this with varios calls to
encode() and decode, and/or using b'a string', but that is ugly, kludgy,
and doesn't work well with the py3 text model.

The py2 => py3 transition separated bytes and strings: strings are unicode,
and bytes are not to be used for text (directly). While there is some
text-related functionality still in bytes, the core devs are quite clear
that that is for special cases only, and not for general text processing.

I don't think numpy should fight this, but rather embrace the py3 text
model. The most natural way to do that is to use the existing 'U' dtype for
text. Really the best solution for most cases. (Like the above case)

However, there is a use case for a more efficient way to deal with text.
There are a couple ways to go about that that have been brought up here:

1: have a more efficient unicode dtype: variable length,
multiple encoding options, etc....
- This is a fine idea that would support better text handling in numpy,
and _maybe_ better interaction with external libraries (HDF, etc...)

2: Have a one-byte-per-char text dtype:
- This would be much easier to implement fit into the current numpy
model, and satisfy a lot of common use cases for scientific data sets.

We could certainly do both, but I'd like to see (2) get done sooner than
later....

A related issue is whether numpy needs a dtype analogous to py3 bytes --
I'm still not sure of the use-case there, so can't comment -- would it need
to be fixed length (fitting into the numpy data model better) or variable
length, or ??? Some folks are (apparently) using the current 'S' type in
this way, but I think that's ripe for errors, due to the null bytes issue.
Though maybe there is a null-bytes-are-special binary format that isn't
text -- I have no idea.

So what do we do with 'S'? It really is pretty broken, so we have a couple
choices:

(1) depricate it, so that it stays around for backward compatibility
but encourage people to either use 'U' for text, or one of the new dtypes
that are yet to be implemented (maybe 's' for a one-byte-per-char dtype),
and use either uint8 or the new bytes dtype that is yet to be implemented.

(2) fix it -- in this case, I think we need to be clear what it is:
-- A one-byte-char-text type? If so, it should map to a py3 string,
and have a defined encoding (ascii or latin-1, probably), or even better a
settable encoding (but only for one-byte-per-char encodings -- I don't
think utf-8 is a good idea here, as a utf-8 encoded string is of unknown
length. (there is some room for debate here, as the 'S' type is fixed
length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as
long as it doesn't partially truncate in teh middle of a charactor)

-- a bytes type? in which case, we should clean out all teh
automatic conversion to-from text that iare in it now.

I vote for it being our one-byte text type -- it almost is already, and it
would make the easiest transition for folks from py2 to py3. But backward
compatibility is backward compatibility.

> numpy arrays need a decode and encode method

I'm not sure that they do. Rather there needs to be a text dtype that
> knows what encoding to use in order to have a binary interface as
> exposed by .tostring() and friends and but produce unicode strings
> when indexed from Python code. Having both a text and a binary
> interface to the same data implies having an encoding.

I agree with Oscar here -- let's not conflate encode and decoded data --
the py3 text model is a fine one, we should work with it as much
as practical.

UNLESS: if we do add a bytes dtype, then it would be a reasonable use case
to use it to store encoded text (just like the py3 bytes types), in which
case it would be good to have encode() and decode() methods or ufuncs --
probably ufuncs. But that should be for special purpose, at the I/O
interface kind of stuff.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

j***@gmail.com

2014-01-23 19:18:20 UTC

Permalink

On Thu, Jan 23, 2014 at 1:49 PM, Chris Barker <***@noaa.gov> wrote:

>
> s = 'a string'
> np.array((s,), dtype='S')[0] == s
>
> Gives you False, rather than True on py2. This is because a py3 string is
> translated to the 'S' type (presumable with the default encoding, another
> maybe not a good idea, but returns a bytes object, which does not compare
> true to a py3 string. YOu can work aroudn this with varios calls to encode()
> and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't
> work well with the py3 text model.

I think this is just inconsistent casting rules in numpy,

numpy should either refuse to assign the wrong type, instead of using
the repr as in some of the earlier examples of Oscar

>>> s = np.inf
>>> np.array((s,), dtype=int)[0] == s
Traceback (most recent call last):
File "<pyshell#126>", line 1, in <module>
np.array((s,), dtype=int)[0] == s
OverflowError: cannot convert float infinity to integer

or use the **same** conversion/casting rules also during the
interaction with python as are used in assignments and array creation.

Josef

Chris Barker

2014-01-23 19:45:40 UTC

Permalink

On Thu, Jan 23, 2014 at 11:18 AM, <***@gmail.com> wrote:

> I think this is just inconsistent casting rules in numpy,
>
> numpy should either refuse to assign the wrong type, instead of using
> the repr as in some of the earlier examples of Oscar
>
> >>> s = np.inf
> >>> np.array((s,), dtype=int)[0] == s
> Traceback (most recent call last):
> File "<pyshell#126>", line 1, in <module>
> np.array((s,), dtype=int)[0] == s
> OverflowError: cannot convert float infinity to integer
>
> or use the **same** conversion/casting rules also during the
> interaction with python as are used in assignments and array creation.
>

Exactly -- but what should those conversion/casting rules be? We can't
decide that unless we decide if 'S' is for text or for arbitrary bytes --
it can't be both. I say text, that's what it's mostly trying to do already.
But if it's bytes, fine, then some things still need cleaning up, and we
could really use a one-byte-text type. and if it's text, then we may need
a bytes dtype.

Key here is that we don't have the option of not breaking anything,
because there is a lot already broken.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

j***@gmail.com

2014-01-23 20:10:34 UTC

Permalink

On Thu, Jan 23, 2014 at 2:45 PM, Chris Barker <***@noaa.gov> wrote:
> On Thu, Jan 23, 2014 at 11:18 AM, <***@gmail.com> wrote:
>
>>
>> I think this is just inconsistent casting rules in numpy,
>>
>> numpy should either refuse to assign the wrong type, instead of using
>> the repr as in some of the earlier examples of Oscar
>>
>> >>> s = np.inf
>> >>> np.array((s,), dtype=int)[0] == s
>> Traceback (most recent call last):
>> File "<pyshell#126>", line 1, in <module>
>> np.array((s,), dtype=int)[0] == s
>> OverflowError: cannot convert float infinity to integer
>>
>> or use the **same** conversion/casting rules also during the
>> interaction with python as are used in assignments and array creation.
>
>
> Exactly -- but what should those conversion/casting rules be? We can't
> decide that unless we decide if 'S' is for text or for arbitrary bytes -- it
> can't be both. I say text, that's what it's mostly trying to do already. But
> if it's bytes, fine, then some things still need cleaning up, and we could
> really use a one-byte-text type. and if it's text, then we may need a bytes
> dtype.

(remember I'm just a balcony muppet)

As far as I understand all codecs have the same ascii part. So I would
cast on ascii and raise on anything else.

or follow whatever the convention of numpy is:

>>> s = -256
>>> np.array((s,), dtype=np.uint8)[0] == s
False
>>> s = -1
>>> np.array((s,), dtype=np.uint8)[0] == s
False

Josef

>
> Key here is that we don't have the option of not breaking anything, because
> there is a lot already broken.
>
> -Chris
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> ***@noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Chris Barker

2014-01-23 21:51:14 UTC

Permalink

On Thu, Jan 23, 2014 at 12:10 PM, <***@gmail.com> wrote:

> > Exactly -- but what should those conversion/casting rules be? We can't
> > decide that unless we decide if 'S' is for text or for arbitrary bytes
> -- it
> > can't be both. I say text, that's what it's mostly trying to do already.
> But
> > if it's bytes, fine, then some things still need cleaning up, and we
> could
> > really use a one-byte-text type. and if it's text, then we may need a
> bytes
> > dtype.
>
> (remember I'm just a balcony muppet)
>

me too ;-)

> As far as I understand all codecs have the same ascii part.

nope -- certainly not multi-byte codecs. And one of the key points of utf-8
is that the ascii part is compatible -- none of teh other full-unicode
encoding are.

many of the one-byte-per-char ones do share the ascii part, but not all, or
not completely.

So I would
> cast on ascii and raise on anything else.
>

still a fine option -- clearly defined and quite useful for scientific
text. However, I would prefer latin-1 -- that way you might get garbage
for the non-ascii parts, but it wouldn't raise an exception and it
round-trips through encoding/decoding. And you would have a somewhat more
useful subset -- including the latin-language character and symbols like
the degree symbol, etc.

> or follow whatever the convention of numpy is:
>
> >>> s = -256
> >>> np.array((s,), dtype=np.uint8)[0] == s
> False
> >>> s = -1
> >>> np.array((s,), dtype=np.uint8)[0] == s
> False
>

I think text is distinct enough from numbers that we don't need to do
that same thing -- and this is result of well-defined casting rules built
into the compiler (and hardware?) for the numeric types. I dont hink we
have either the standard or compiler support for text conversions like that.

-CHB

PS: this is interesting, on py2:

In [176]: a = np.array((2222,), dtype='S')

In [177]: a
Out[177]:
array(['2'],
dtype='|S1')

It converts it to a string, but only grabs the first character? (is
it determining the size before converting to a string?

and this:

In [182]: a = np.array(2222, dtype='S')

In [183]: a
Out[183]:
array('2222',
dtype='|S24')

24 ? where did that come from?

>
> Josef
>
> >
> > Key here is that we don't have the option of not breaking anything,
> because
> > there is a lot already broken.
> >
> > -Chris
> >
> >
> > --
> >
> > Christopher Barker, Ph.D.
> > Oceanographer
> >
> > Emergency Response Division
> > NOAA/NOS/OR&R (206) 526-6959 voice
> > 7600 Sand Point Way NE (206) 526-6329 fax
> > Seattle, WA 98115 (206) 526-6317 main reception
> >
> > ***@noaa.gov
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-***@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

j***@gmail.com

2014-01-23 23:56:36 UTC

Permalink

On Thu, Jan 23, 2014 at 4:51 PM, Chris Barker <***@noaa.gov> wrote:
> On Thu, Jan 23, 2014 at 12:10 PM, <***@gmail.com> wrote:
>>
>> > Exactly -- but what should those conversion/casting rules be? We can't
>> > decide that unless we decide if 'S' is for text or for arbitrary bytes
>> > -- it
>> > can't be both. I say text, that's what it's mostly trying to do already.
>> > But
>> > if it's bytes, fine, then some things still need cleaning up, and we
>> > could
>> > really use a one-byte-text type. and if it's text, then we may need a
>> > bytes
>> > dtype.
>>
>> (remember I'm just a balcony muppet)
>
>
> me too ;-)
>
>
>>
>> As far as I understand all codecs have the same ascii part.
>
>
> nope -- certainly not multi-byte codecs. And one of the key points of utf-8
> is that the ascii part is compatible -- none of teh other full-unicode
> encoding are.
>
> many of the one-byte-per-char ones do share the ascii part, but not all, or
> not completely.
>
>> So I would
>> cast on ascii and raise on anything else.
>
>
> still a fine option -- clearly defined and quite useful for scientific text.
> However, I would prefer latin-1 -- that way you might get garbage for the
> non-ascii parts, but it wouldn't raise an exception and it round-trips
> through encoding/decoding. And you would have a somewhat more useful subset
> -- including the latin-language character and symbols like the degree
> symbol, etc.

I'm not sure anymore, after all these threads I think bytes should be
bytes and strings should be strings

>>> x = np.array(['hugo'], 'S')
Traceback (most recent call last):
File "<pyshell#61>", line 1, in <module>
x = np.array(['hugo'], float)
ValueError: could not convert string to bytes: 'hugo'

>>> x = np.array([b'hugo'], 'S')
>>>

but with support for textarrays as Oscars showed, to make it easy to
convert between the 'S' and 'S:encoding' or use either view on the
memory.
I like the idea of an `encoding_view` on some 'S' bytes, and once we
have a view like that there is no reason to pretend 'S' bytes are
text.

>
>>
>> or follow whatever the convention of numpy is:
>>
>> >>> s = -256
>> >>> np.array((s,), dtype=np.uint8)[0] == s
>> False
>> >>> s = -1
>> >>> np.array((s,), dtype=np.uint8)[0] == s
>> False
>
>
> I think text is distinct enough from numbers that we don't need to do that
> same thing -- and this is result of well-defined casting rules built into
> the compiler (and hardware?) for the numeric types. I dont hink we have
> either the standard or compiler support for text conversions like that.
>
> -CHB
>
> PS: this is interesting, on py2:
>
>
> In [176]: a = np.array((2222,), dtype='S')
>
> In [177]: a
> Out[177]:
> array(['2'],
> dtype='|S1')
>
> It converts it to a string, but only grabs the first character? (is it
> determining the size before converting to a string?

I recently fixed a bug in statsmodels based on this. I don't know why
the code worked before, I assume it used string integers instead of
integers at some point when it was written

>
> and this:
>
> In [182]: a = np.array(2222, dtype='S')
>
> In [183]: a
> Out[183]:
> array('2222',
> dtype='|S24')
>
> 24 ? where did that come from?

No idea.

Unless I missed something when I didn't pay attention, there never
before was any discussion on the mailing list about bytes versus
strings in python 3 in numpy (I don't follow numpy's "issues").
And I neither remember (m)any public complaints about the behavior of
the 'S' type in strange cases.

maybe I didn't pay attention because I didn't care, until we ran into
the python 3 problems. maybe nobody else did either.

Josef

>
>
>
>
>
>
>
>
>
>
>
>>
>>
>> Josef
>>
>> >
>> > Key here is that we don't have the option of not breaking anything,
>> > because
>> > there is a lot already broken.
>> >
>> > -Chris
>> >
>> >
>> > --
>> >
>> > Christopher Barker, Ph.D.
>> > Oceanographer
>> >
>> > Emergency Response Division
>> > NOAA/NOS/OR&R (206) 526-6959 voice
>> > 7600 Sand Point Way NE (206) 526-6329 fax
>> > Seattle, WA 98115 (206) 526-6317 main reception
>> >
>> > ***@noaa.gov
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-***@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-***@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> ***@noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-***@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Chris Barker

2014-01-24 01:12:35 UTC

Permalink

On Thu, Jan 23, 2014 at 3:56 PM, <***@gmail.com> wrote:

>
> I'm not sure anymore, after all these threads I think bytes should be
> bytes and strings should be strings
>

exactly -- that's the py3 model, and I think we really soudl try to conform
to it, it's really the only way to have a robust solution.

> I like the idea of an `encoding_view` on some 'S' bytes, and once we
> have a view like that there is no reason to pretend 'S' bytes are
> text.

right, then they are bytes, not text. period.

I'm not sure if we should conflate encoded text and arbitrary bytes, but it
does make sense to build encoded text on a bytes object.

maybe I didn't pay attention because I didn't care, until we ran into
> the python 3 problems. maybe nobody else did either.
>

yup -- I think this didn't get a whole lot of review or testing....

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Oscar Benjamin

2014-01-24 00:02:26 UTC

Permalink

On 23 January 2014 21:51, Chris Barker <***@noaa.gov> wrote:
>
> However, I would prefer latin-1 -- that way you might get garbage for the
> non-ascii parts, but it wouldn't raise an exception and it round-trips
> through encoding/decoding. And you would have a somewhat more useful subset
> -- including the latin-language character and symbols like the degree
> symbol, etc.

Exceptions and error messages are a good thing! Garbage is not!!! :)

Oscar

Chris Barker

2014-01-24 01:09:28 UTC

Permalink

On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin
<***@gmail.com>wrote:

> On 23 January 2014 21:51, Chris Barker <***@noaa.gov> wrote:
> >
> > However, I would prefer latin-1 -- that way you might get garbage for
> the
> > non-ascii parts, but it wouldn't raise an exception and it round-trips
> > through encoding/decoding. And you would have a somewhat more useful
> subset
> > -- including the latin-language character and symbols like the degree
> > symbol, etc.
>
> Exceptions and error messages are a good thing! Garbage is not!!! :)
>

in principle, I agree with you, but sometime practicality beets purity.

in py2 there is a lot of implicit encoding/decoding going on, using the
system encoding. That is ascii on a lot of systems. The result is that
there is a lot of code out there that folks have ported to use unicode, but
missed a few corners. If that code is only testes with ascii, it all seems
o be working but then out in the wild someone
puts another character in there and presto -- a crash.

Also, there are places where the inability to encode makes silent message
-- for instance if an Exception is raised with a unicode message, it will
get silently dropped when it comes time to display on the terminal. I spent
quite a wile banging my head against that one recently when I tried to
update some code to read unicode files. I would have been MUCH happier with
a bit of garbage in the mesae than having it drop (or raise
an encoding error in the middle of the error...)

I think this is a bad thing.

The advantage of latin-1 is that while you might get something that
doesn't print right, it won't crash, and it won't contaminate the data, so
comparisons, etc, will still work. kind of like using utf-8 in an old-style
c char array -- you can still passi t around and copare it, even if the
bytes dont mean what you think they do.

-CHB

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Oscar Benjamin

2014-01-24 01:41:24 UTC

Permalink

On 24 January 2014 01:09, Chris Barker <***@noaa.gov> wrote:
> On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin <***@gmail.com>
> wrote:
>>
>> On 23 January 2014 21:51, Chris Barker <***@noaa.gov> wrote:
>> >
>> > However, I would prefer latin-1 -- that way you might get garbage for
>> > the
>> > non-ascii parts, but it wouldn't raise an exception and it round-trips
>> > through encoding/decoding. And you would have a somewhat more useful
>> > subset
>> > -- including the latin-language character and symbols like the degree
>> > symbol, etc.
>>
>> Exceptions and error messages are a good thing! Garbage is not!!! :)
>
> in principle, I agree with you, but sometime practicality beets purity.
>
> in py2 there is a lot of implicit encoding/decoding going on, using the
> system encoding. That is ascii on a lot of systems. The result is that there
> is a lot of code out there that folks have ported to use unicode, but missed
> a few corners. If that code is only testes with ascii, it all seems o be
> working but then out in the wild someone puts another character in there and
> presto -- a crash.

Precisely. The Py3 text model uses TypeErrors to warn early against
this kind of thing. No longer do you have code that seems to work
until the wrong character goes in. You get the error straight away
when you try to mix bytes and text. You still have the option to
silence those errors: it just needs to be done explicitly:

>>> s = 'Õscar'
>>> s.encode('ascii', errors='replace')
b'?scar'

> Also, there are places where the inability to encode makes silent message --
> for instance if an Exception is raised with a unicode message, it will get
> silently dropped when it comes time to display on the terminal. I spent
> quite a wile banging my head against that one recently when I tried to
> update some code to read unicode files. I would have been MUCH happier with
> a bit of garbage in the mesae than having it drop (or raise an encoding
> error in the middle of the error...)

Yeah, that's just a bug in CPython. I think it's fixed now but either
way you're right: for the particular case of displaying error messages
the interpreter should do whatever it takes to get some kind of error
message out even if it's a bit garbled. I disagree that this should be
the basis for ordinary data processing with numpy though.

> I think this is a bad thing.
>
> The advantage of latin-1 is that while you might get something that doesn't
> print right, it won't crash, and it won't contaminate the data, so
> comparisons, etc, will still work. kind of like using utf-8 in an old-style
> c char array -- you can still passi t around and copare it, even if the
> bytes dont mean what you think they do.

It round trips okay as long as you don't try to do anything else with
the string. So does the textarray class I proposed in a new thread: If
you just use fromfile and tofile it works fine for any input (except
for trailing nulls) but if you try to decode invalid bytes it will
throw errors. It wouldn't be hard to add configurable error-handling
there either.

Oscar

j***@gmail.com

2014-01-23 19:49:17 UTC

Permalink

>> > numpy arrays need a decode and encode method
>
>
>> I'm not sure that they do. Rather there needs to be a text dtype that
>> knows what encoding to use in order to have a binary interface as
>> exposed by .tostring() and friends and but produce unicode strings
>> when indexed from Python code. Having both a text and a binary
>> interface to the same data implies having an encoding.
>
>
> I agree with Oscar here -- let's not conflate encode and decoded data --
> the py3 text model is a fine one, we should work with it as much as
> practical.
>
> UNLESS: if we do add a bytes dtype, then it would be a reasonable use case
> to use it to store encoded text (just like the py3 bytes types), in which
> case it would be good to have encode() and decode() methods or ufuncs --
> probably ufuncs. But that should be for special purpose, at the I/O
> interface kind of stuff.
>

I think we need both things changing the memory and changing the view.

The same way we can convert between int and float and complex (trunc,
astype, real, ...) we should be able to convert between bytes and any
string (text) dtypes, i.e. decode and encode.

I'm reading a file in binary and then want to convert it to unicode,
only I realize I have only ascii and want to convert to something less
memory hungry.

views don't care about what the content means, it just has to be
memory compatible, I can view anything as an 'S' or a 'uint' (I
think).
What we currently don't have is a string/text view on S that would
interact with python as string.
(that's a vote in favor of a minimal one char string dtype that would
work for a limited number of encodings.)

Josef

Charles R Harris

2014-01-25 16:33:40 UTC

Permalink

On Thu, Jan 23, 2014 at 11:49 AM, Chris Barker <***@noaa.gov>wrote:

> Thanks for poking into this all. I've lost track a bit, but I think:
>
> The 'S' type is clearly broken on py3 (at least). I think that gives us
> room to change it, and backward compatibly is less of an issue because it's
> broken already -- do we need to preserve bug-for-bug compatibility? Maybe,
> but I suspect in this case, not -- the code the "works fine" on py3 with
> the 'S' type is probably only lucky that it hasn't encountered the issues
> yet.
>
> And no matter how you slice it, code being ported to py3 needs to deal
> with text handling issues.
>
> But here is where we stand:
>
> The 'S' dtype:
>
> - was designed for one-byte-per-char text data.
> - was mapped to the py2 string type.
> - used the classic C null-terminated approach.
> - can be used for arbitrary bytes (as the py2 string type can), but not
> quite, as it truncates null bytes -- so it really a bad idea to use it that
> way.
>
> Under py3:
> The 'S' type maps to the py3 bytes type, because that's the closest to
> the py2 string type. But it also does some inconsistent things with
> encoding, and does treat a lot of other things as text. But the py3 bytes
> type does not have the same text handling as the py2 string type, so things
> like:
>
> s = 'a string'
> np.array((s,), dtype='S')[0] == s
>
> Gives you False, rather than True on py2. This is because a py3 string is
> translated to the 'S' type (presumable with the default encoding, another
> maybe not a good idea, but returns a bytes object, which does not compare
> true to a py3 string. YOu can work aroudn this with varios calls to
> encode() and decode, and/or using b'a string', but that is ugly, kludgy,
> and doesn't work well with the py3 text model.
>
>
> The py2 => py3 transition separated bytes and strings: strings are
> unicode, and bytes are not to be used for text (directly). While there is
> some text-related functionality still in bytes, the core devs are quite
> clear that that is for special cases only, and not for general text
> processing.
>
> I don't think numpy should fight this, but rather embrace the py3 text
> model. The most natural way to do that is to use the existing 'U' dtype for
> text. Really the best solution for most cases. (Like the above case)
>
> However, there is a use case for a more efficient way to deal with text.
> There are a couple ways to go about that that have been brought up here:
>
> 1: have a more efficient unicode dtype: variable length,
> multiple encoding options, etc....
> - This is a fine idea that would support better text handling in
> numpy, and _maybe_ better interaction with external libraries (HDF, etc...)
>
> 2: Have a one-byte-per-char text dtype:
> - This would be much easier to implement fit into the current numpy
> model, and satisfy a lot of common use cases for scientific data sets.
>
>
We could certainly do both, but I'd like to see (2) get done sooner than
> later....
>

This is pretty much my sense of things at the moment. I think 1) is needed
in the long term but that 2) is a quick fix that solves most problems in
the short term.

>
> A related issue is whether numpy needs a dtype analogous to py3 bytes --
> I'm still not sure of the use-case there, so can't comment -- would it need
> to be fixed length (fitting into the numpy data model better) or variable
> length, or ??? Some folks are (apparently) using the current 'S' type in
> this way, but I think that's ripe for errors, due to the null bytes issue.
> Though maybe there is a null-bytes-are-special binary format that isn't
> text -- I have no idea.
>
> So what do we do with 'S'? It really is pretty broken, so we have a
> couple choices:
>
> (1) depricate it, so that it stays around for backward compatibility
> but encourage people to either use 'U' for text, or one of the new dtypes
> that are yet to be implemented (maybe 's' for a one-byte-per-char dtype),
> and use either uint8 or the new bytes dtype that is yet to be implemented.
>
> (2) fix it -- in this case, I think we need to be clear what it is:
> -- A one-byte-char-text type? If so, it should map to a py3 string,
> and have a defined encoding (ascii or latin-1, probably), or even better a
> settable encoding (but only for one-byte-per-char encodings -- I don't
> think utf-8 is a good idea here, as a utf-8 encoded string is of unknown
> length. (there is some room for debate here, as the 'S' type is fixed
> length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as
> long as it doesn't partially truncate in teh middle of a charactor)
>

I think we should make it a one character encoded type compatible with str
in python 2, and maybe latin-1 in python 3. I'm thinking latin-1 because of
pep 393 where it is effectively a UCS-1, but ascii might be a bit more
flexible because it is a subset of utf-8 and might serve better in python 2.

> -- a bytes type? in which case, we should clean out all teh
> automatic conversion to-from text that iare in it now.
>
>
I'm not sure what to do about a bytes type.

> I vote for it being our one-byte text type -- it almost is already, and it
> would make the easiest transition for folks from py2 to py3. But backward
> compatibility is backward compatibility.
>
>
Not sure what to do here. It would be nice if S was a string type of given
encoding. Might be worth an experiment to see how much breaks.

> > numpy arrays need a decode and encode method
>
>
> I'm not sure that they do. Rather there needs to be a text dtype that
>> knows what encoding to use in order to have a binary interface as
>> exposed by .tostring() and friends and but produce unicode strings
>> when indexed from Python code. Having both a text and a binary
>> interface to the same data implies having an encoding.
>
>
> I agree with Oscar here -- let's not conflate encode and decoded data --
> the py3 text model is a fine one, we should work with it as much
> as practical.
>
> UNLESS: if we do add a bytes dtype, then it would be a reasonable use case
> to use it to store encoded text (just like the py3 bytes types), in which
> case it would be good to have encode() and decode() methods or ufuncs --
> probably ufuncs. But that should be for special purpose, at the I/O
> interface kind of stuff.
>
>
Chuck

Pauli Virtanen

2014-01-17 18:40:41 UTC

Permalink

17.01.2014 15:09, Aldcroft, Thomas kirjoitti:
[clip]
> I've been playing around with porting a stack of analysis libraries
> to Python 3 and this is a very timely thread and comment. What I
> discovered right away is that all the string data coming from
> binary HDF5 files show up (as expected) as 'S' type,, but that
> trying to make everything actually work in Python 3 without
> converting to 'U' is a big mess of whack-a-mole.
>
> Yes, it's possible to change my libraries to use bytestring
> literals everywhere, but the Python 3 user experience becomes
> horrible because to interact with the data all downstream
> applications need to use bytestring literals everywhere. E.g.
> doing a simple filter like `string_array == 'foo'` doesn't work,
> and this will break all existing code when trying to run in Python
> 3. And every time you try to print something it has this horrible
> "b" in front. Ugly, and it just won't work well in the end.
[clip]

Ok, I see your point.

Having additional Unicode data types with smaller widths could be
useful. On Python 2, they would then be Unicode strings, right? Thanks
to Py2 automatic Unicode encoding/decoding, they might also be usable
in interactive etc. use on Py2.

Adding new data types in Numpy codebase takes some work, but it's
possible to do.

There's also an issue (as noted in the Github ticket) that
array([u'foo'], dtype=bytes) encodes silently via the ASCII codec.
This is probably not how it should be.

--
Pauli Virtanen

Pauli Virtanen

2014-01-17 12:17:28 UTC

Permalink

Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
[clip]
> For backward compatibility we *cannot* change S.
> Maybe we could change the meaning of 'a' but it would be safer
> to add a new dtype, possibly 'S' can be deprecated in favor
> of 'B' when we have a specific encoding dtype.

Note that the rename 'S' -> 'B' was not done in the Python 3 port,
because 'B' already denotes uint8,

>>> np.array([1], dtype='B')
array([1], dtype=uint8)

--
Pauli Virtanen

Chris Barker

2014-01-17 20:02:52 UTC

Permalink

On Fri, Jan 17, 2014 at 1:38 AM, Julian Taylor <
***@googlemail.com> wrote:

>
> This thread is getting a little out of hand which is my fault for
> initially mixing different topics in one mail,
>

still a bit mixed ;-) -- but I think the loadtxt issue requires a lot
less discussion, so we're OK there.

There have been a lot of notes here since I last commented, so I'm going
stick with the loadtxt issues in this note:

- no possibility to specify the encoding of a file in loadtxt
> this is a missing feature, currently it uses the system default which is
> good and should stay that way.
>

I disagree -- I think using the "system encoding" is a bad idea for a
default -- I certainly am far more likely to get data files from some other
system than my own -- and really unlikely to use the "system encoding" for
any data files I write, either.

And I'm not begin english-centered here -- my data files commonly do have
non-ascii code in there, though frankly, they are either a mess or I know
the encoding.

What should be the default?

latin-1

Why? Despite our desire to be non-english-focuses, most of what loadtxt
does is parse files for numbers, maybe with a bit of text. Numbers are
virtually always ascii-compatible (am I wrong about that? -- if so you'd
damn well better know your encoding!). So it should be an ascii-compatible
encoding.

Why not ascii? -- because then it would barf on non-ascii text in the file
-- really bad idea there.

Why not utf-8 -- this is being *nic centric -- and utf-8 will wrk fine on
ascii, but corrupt non-asci,, non-utf-8 data (i.e. any other encoding.) and
may barf on some of ti too (not sure about that).

latin-1 will never barf on any binary data, will successfully parse any
numeric data (plus spaces, commas, etc.), and will preserve the bytes of an
non-ascii content in the file.

If you can set the encoding it's not a huge deal what the default is, but I
will recommend that everyone always either sets it to a known encoding or
uses latin-1 -- never the system encoding.

One more point: on my system right now:

In [15]: sys.getdefaultencoding()
Out[15]: 'ascii'

please don't make loadttxt start barfing on files I've been reading just
fine for years....

It is only missing an option to tell it to treat it differently.
> There should be little debate about changing the default, especially not
> using latin1. The system default exists for a good reason.
>

Maybe, maybe not, but I submit that whatever that "good reason" is, it does
not apply here! This is kin dof like datetime64 using the localle timezone
-- makes it useless!

> Note on linux it is UTF-8 which is a good choice. I'm not familiar with
> windows but all programs should at least have the option to use UTF-8 as
> output too.
>

should, yes, so, maybe, but:

a) not all text data files are written recently or by recently updated
software.

b) This is kind of like saying we should have loadtxt default to utf-8,
which wouldn't be the worst idea -- better than system default, but still
not as good as latin-1

This is a simple question: Should the exact same file read fine with the
exact same code on one machine, but not another? I don't think so.

This has nothing to do with indexing or any kind of processing of the numpy
> arrays.
>

agreed.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Oscar Benjamin

2014-01-16 00:06:22 UTC

Permalink

On 15 January 2014 12:38, Julian Taylor <***@googlemail.com> wrote:
> On 01/15/2014 11:25 AM, Daπid wrote:
>> On 15 January 2014 11:12, Hedieh Ebrahimi <***@amphos21.com
>> <mailto:***@amphos21.com>> wrote:
>>
>> I try to print my fileContent array after I read it and it looks
>> like this :
>>
>> ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
>> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
>> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>>
>> Why is this happening and how can I prevent it ?
>> Also if I have a line that starts like this in my file, python will
>> crash on me. how can i fix this ?
>>
>>
>> What is wrong with this case? If you are concerned about the multiple
>> backslashes, they are there because they are special symbols, and so
>> they have to be escaped (you actually want a backslash, not whatever
>> else they could mean).
>>
>
> you have the bytes representation and a duplicate slash in it.
> Its due to unicode strings in python3.

So why does the array store the repr of a bytes string?

Surely that's just a loadtxt bug and no one is actually depending on
that behaviour.

Oscar

Chris Barker

2014-01-15 23:42:35 UTC

Permalink

bump back to the OP:
On Wed, Jan 15, 2014 at 2:12 AM, Hedieh Ebrahimi <
***@amphos21.com> wrote:

> fileContent=loadtxt(filePath,dtype=str)
>

do either of these work for you?

fileContent=loadtxt(filePath,dtype='S')

or

fileContent=loadtxt(filePath,dtype=np.unicode)

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Julian Taylor

2014-01-15 23:58:25 UTC

Permalink

On 16.01.2014 00:42, Chris Barker wrote:
> bump back to the OP:
> On Wed, Jan 15, 2014 at 2:12 AM, Hedieh Ebrahimi
> <***@amphos21.com <mailto:***@amphos21.com>> wrote:
>
> fileContent=loadtxt(filePath,dtype=str)
>
>
> do either of these work for you?
>
> fileContent=loadtxt(filePath,dtype='S')

this gives you bytes not a string, this can only be fixed by adding new
dtypes, see the other thread about that.

>
> or
>
> fileContent=loadtxt(filePath,dtype=np.unicode)
>

same as using python str you get the output originally posted, bytes
representation with duplicated slashes.
This is a bug in loadtxt we need to fix independent of adding new dtypes.
It is also independent of the encoding of the text file, loadtxt doesn't
seem to be able to open other encodings than ascii/utf8 at all and has
no option to tell it what the file is.

as mentioned in my earlier mail this works for ascii:

np.loadtxt('test.txt',dtype=bytes).astype(str)

or of course looping and decoding explicitly.

Chris Barker

2014-01-16 01:10:07 UTC

Permalink

On Wed, Jan 15, 2014 at 3:58 PM, Julian Taylor <
***@googlemail.com> wrote:

> > fileContent=loadtxt(filePath,dtype='S')
>
> this gives you bytes not a string, this can only be fixed by adding new
> dtypes,

or changing the behavior or dtype 'S', but yes, the other thread.

But the OP's problem was not that s/he got bytes, but that the content was
wrong -- he got the repr of bytes in a py3 string.

-

> same as using python str you get the output originally posted, bytes
> representation with duplicated slashes.
> This is a bug in loadtxt we need to fix independent of adding new dtypes.
>

yup.

> It is also independent of the encoding of the text file, loadtxt doesn't
> seem to be able to open other encodings than ascii/utf8 at all and has
> no option to tell it what the file is.
>

a key missing feature -- and I doubt it does utf-8 right, either.

as mentioned in my earlier mail this works for ascii:
>
> np.loadtxt('test.txt',dtype=bytes).astype(str)
>

thanks -- I wasn't sure what astype would do for that. and what are you
getting then, unicode or ascii?

Thanks,
-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov