[Numpy-discussion] A one-byte string dtype?

Post by Chris Barker
Folks,
I've been blathering away on the related threads a lot -- sorry if it's
too much. It's gotten a bit tangled up, so I thought I'd start a new one to
Would it be a good thing for numpy to have a one-byte--per-character
string type?
We did have that with the 'S' type in py2, but the changes in py3 have
made it not quite the right thing. And it appears that enough people use
'S' in py3 to mean 'bytes', so that we can't change that now.
The only difference may be that 'S' currently auto translates to a bytes
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the
resulting bytes object, either. (and it probably used the default encoding
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and now
that I think about it, I have no idea what encoding you'd need to use in
the general case.
So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.
However, the 'U' dtype is 4 bytes per character, and that may be "too big"
for some use-cases. And there is a lot of text in scientific data sets that
are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a one-byte-per
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I
personally don't think that's worth it, but maybe that's because I'm an
english speaker...)
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input
a) bytes objects would be passed through as-is
b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it
would provide an nice efficiency in a fairly common case]
* It would create python unicode text objects, decoded as latin-1.
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.
(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)
Bad idea all around: shut up already!
or
Fine idea, but who's going to write the code? not me!
or
We really should do this.

As evident from what I said in the previous thread, YES, this should really
be done!

One important feature would be changing the dtype from 'S' to 's' without
any memory copies, so that conversion would be very cheap. Maybe this
would essentially come for free with something like astype('s', copy=False).

- Tom

Post by Chris Barker
(of course, with the options of amending the above not-very-fleshed out
proposal)
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Oscar Benjamin

2014-01-20 10:11:15 UTC

If you mean a string type that can only hold latin-1 characters then I think
that this is a step backwards.

If you mean a dtype that holds bytes in a known, specifiable encoding and
automatically decodes them to unicode strings when you call .item() and has a
friendly repr() then that may be a good idea.

So for example you could have dtype='S:utf-8' which would store strings

text = array(['foo', 'bar'], dtype='S:utf-8')
text

array(['foo', 'bar'], dtype='|S3:utf-8')

print(a)

['foo', 'bar']

a[0]

'foo'

a.nbytes

Post by Chris Barker
We did have that with the 'S' type in py2, but the changes in py3 have made
it not quite the right thing. And it appears that enough people use 'S' in
py3 to mean 'bytes', so that we can't change that now.

It wasn't really the right thing before either. That's why Python 3 has
changed all of this.

Post by Chris Barker
The only difference may be that 'S' currently auto translates to a bytes
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the
resulting bytes object, either. (and it probably used the default encoding
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and now
that I think about it, I have no idea what encoding you'd need to use in
the general case.

You should let the user specify the encoding or otherwise require them to use
the 'U' dtype.

Post by Chris Barker
So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.

Absolutely. Embrace the Python 3 text model. Once you understand the how, what
and why of it you'll see that it really is a good thing!

Post by Chris Barker
However, the 'U' dtype is 4 bytes per character, and that may be "too big"
for some use-cases. And there is a lot of text in scientific data sets that
are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a one-byte-per
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally
don't think that's worth it, but maybe that's because I'm an english
speaker...)

You could just use a 2-byte encoding with the S dtype e.g.
dtype='S:utf-16-le'.

Post by Chris Barker
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input
a) bytes objects would be passed through as-is
b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it
would provide an nice efficiency in a fairly common case]

I think it would be a bad idea to accept bytes here. There are good reasons
that Python 3 creates a barrier between the two worlds of text and bytes.
Allowing implicit mixing of bytes and text is a recipe for mojibake. The
TypeErrors in Python 3 are used to guard against conceptual errors that lead
to data corruption. Attempting to undermine that barrier in numpy would be a
backward step.

I apologise if this is misplaced but there seems to be an attitude that
scientific programming isn't really affected by the issues that have lead to
the Python 3 text model. I think that's ridiculous; data corruption is a
problem in scientific programming just as it is anywhere else.

Post by Chris Barker
* It would create python unicode text objects, decoded as latin-1.

Don't try to bless a particular encoding and stop trying to pretend that it's
possible to write a sensible system where end users don't need to worry about
and specify the encoding of their data.

Post by Chris Barker
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.

If the encoding cannot be specified then the whole idea is misguided.

Post by Chris Barker
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.

... text = numpy.fromfile(fin, dtype='s')

text[0] # Decodes as latin-1 leading to mojibake.

... text = numpy.fromfile(fin, dtype='s:utf-8')

There's really no way to get around the fact that users need to specify the
encoding of their text files.

Post by Chris Barker
(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)

You could if you use 'utf-8'. It would be one-byte-per-char for text that only
contains ascii characters. However it would still support every character that
the unicode consortium can dream up.

The only possible advantage here is as a memory optimisation (potentially
having a speed impact too although it could equally be a speed regression).
Otherwise it just adds needless complexity to numpy and to the code that uses
the new dtype as well as limiting its ability to handle unicode.

How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?

Oscar

Aldcroft, Thomas

2014-01-20 15:00:55 UTC

On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin

Post by Chris Barker
Folks,
I've been blathering away on the related threads a lot -- sorry if it's

too

Post by Chris Barker
much. It's gotten a bit tangled up, so I thought I'd start a new one to
Would it be a good thing for numpy to have a one-byte--per-character

string

Post by Chris Barker
type?

If you mean a string type that can only hold latin-1 characters then I think
that this is a step backwards.
If you mean a dtype that holds bytes in a known, specifiable encoding and
automatically decodes them to unicode strings when you call .item() and has a
friendly repr() then that may be a good idea.
So for example you could have dtype='S:utf-8' which would store strings

text = array(['foo', 'bar'], dtype='S:utf-8')
text

array(['foo', 'bar'], dtype='|S3:utf-8')

print(a)

['foo', 'bar']

a[0]

'foo'

a.nbytes

Post by Chris Barker
We did have that with the 'S' type in py2, but the changes in py3 have

made

Post by Chris Barker
it not quite the right thing. And it appears that enough people use 'S'

Post by Chris Barker
py3 to mean 'bytes', so that we can't change that now.

It wasn't really the right thing before either. That's why Python 3 has
changed all of this.

encoding

Post by Chris Barker
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and

now

Post by Chris Barker
that I think about it, I have no idea what encoding you'd need to use in
the general case.

You should let the user specify the encoding or otherwise require them to use
the 'U' dtype.

Post by Chris Barker
So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.

Absolutely. Embrace the Python 3 text model. Once you understand the how, what
and why of it you'll see that it really is a good thing!

Post by Chris Barker
However, the 'U' dtype is 4 bytes per character, and that may be "too

big"

Post by Chris Barker
for some use-cases. And there is a lot of text in scientific data sets

that

Post by Chris Barker
are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a

one-byte-per

Post by Chris Barker
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I

personally

Post by Chris Barker
don't think that's worth it, but maybe that's because I'm an english
speaker...)

You could just use a 2-byte encoding with the S dtype e.g.
dtype='S:utf-16-le'.

I think it would be a bad idea to accept bytes here. There are good reasons
that Python 3 creates a barrier between the two worlds of text and bytes.
Allowing implicit mixing of bytes and text is a recipe for mojibake. The
TypeErrors in Python 3 are used to guard against conceptual errors that lead
to data corruption. Attempting to undermine that barrier in numpy would be a
backward step.
I apologise if this is misplaced but there seems to be an attitude that
scientific programming isn't really affected by the issues that have lead to
the Python 3 text model. I think that's ridiculous; data corruption is a
problem in scientific programming just as it is anywhere else.

Post by Chris Barker
* It would create python unicode text objects, decoded as latin-1.

Don't try to bless a particular encoding and stop trying to pretend that it's
possible to write a sensible system where end users don't need to worry about
and specify the encoding of their data.

Post by Chris Barker
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.

If the encoding cannot be specified then the whole idea is misguided.

Post by Chris Barker
I've explained the latin-1 thing on other threads, but the short version
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.

... text = numpy.fromfile(fin, dtype='s')

text[0] # Decodes as latin-1 leading to mojibake.

... text = numpy.fromfile(fin, dtype='s:utf-8')
There's really no way to get around the fact that users need to specify the
encoding of their text files.

Post by Chris Barker
(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)

You could if you use 'utf-8'. It would be one-byte-per-char for text that only
contains ascii characters. However it would still support every character that
the unicode consortium can dream up.
The only possible advantage here is as a memory optimisation (potentially
having a speed impact too although it could equally be a speed regression).
Otherwise it just adds needless complexity to numpy and to the code that uses
the new dtype as well as limiting its ability to handle unicode.
How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?

I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.

As discussed in the previous parent thread, using the bytes dtype is really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big mess.

Oscar Benjamin

2014-01-20 15:40:42 UTC

Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin

Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?

I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big mess.

Aldcroft, Thomas

2014-01-20 17:12:06 UTC

Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin

Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use

numpy

Post by Oscar Benjamin
for
this kind of text handling? If you really are operating on gigantic

text

Post by Oscar Benjamin
arrays of ascii characters then is it so bad to just use the bytes

dtype

Post by Oscar Benjamin
and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using

the

Post by Oscar Benjamin
'U'
dtype?

I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is

really

Post by Aldcroft, Thomas
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations

Post by Aldcroft, Thomas
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big

mess.

Charles R Harris

2014-01-20 17:21:27 UTC

On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <

On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <

Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin

Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use

numpy

Post by Oscar Benjamin
for
this kind of text handling? If you really are operating on gigantic

text

Post by Oscar Benjamin
arrays of ascii characters then is it so bad to just use the bytes

dtype

Post by Oscar Benjamin
and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using

the

Post by Oscar Benjamin
'U'
dtype?

I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is

really

Post by Aldcroft, Thomas
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other

manipulations in

mess.

Oscar Benjamin

2014-01-20 18:40:32 UTC

Post by Charles R Harris
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <

On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <

Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin

And why are you needing to write .decode('ascii') everywhere?

print("The first value is {}".format(bytestring_array[0]))

On Python 2 this gives "The first value is string_value", while on

Python 3 this gives "The first value is b'string_value'".

Post by Charles R Harris
As Nathaniel has mentioned, this is a known problem with Python 3 and the

developers are trying to come up with a solution. Python 3.4 solves some
existing problems, but this one remains. It's not just numpy here, it's
that python itself needs to provide some help.

If you think that anything in core Python will change so that you can mix
text and bytes as above then I think you are very much mistaken. If you're
referring to PEP 460/461 then you have misunderstood the purpose of those
PEPs. The authors and reviewers will carefully ensure that nothing changes
to make the above work the way that it did in 2.x.

Oscar

Charles R Harris

2014-01-20 20:34:56 UTC

Post by Charles R Harris
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <

On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <

Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin

And why are you needing to write .decode('ascii') everywhere?

print("The first value is {}".format(bytestring_array[0]))

On Python 2 this gives "The first value is string_value", while on

Python 3 this gives "The first value is b'string_value'".

Post by Charles R Harris
As Nathaniel has mentioned, this is a known problem with Python 3 and

the developers are trying to come up with a solution. Python 3.4 solves
some existing problems, but this one remains. It's not just numpy here,
it's that python itself needs to provide some help.
If you think that anything in core Python will change so that you can mix
text and bytes as above then I think you are very much mistaken. If you're
referring to PEP 460/461 then you have misunderstood the purpose of those
PEPs. The authors and reviewers will carefully ensure that nothing changes
to make the above work the way that it did in 2.x.

I think we may want something like PEP
393<http://www.python.org/dev/peps/pep-0393/>.
The S datatype may be the wrong place to look, we might want a modification
of U instead so as to transparently get the benefit of python strings.

Chuck

Oscar Benjamin

2014-01-20 21:27:48 UTC

I think we may want something like PEP 393. The S datatype may be the

wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.

Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.

Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.

If someone can call buffer on an array then the FSR is a semantic change.

If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They
would be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and
vice versa.

I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both
a binary and a text interface.

Oscar

Charles R Harris

2014-01-20 22:28:09 UTC

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin

I think we may want something like PEP 393. The S datatype may be the

wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They
would be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and
vice versa.
I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both
a binary and a text interface.

I didn't say we should change the S type, but that we should have
something, say 's', that appeared to python as a string. I think if we want
transparent string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal with
the difficulties of utf-8. That means raising errors if the string doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.

Chuck

Nathaniel Smith

2014-01-20 22:35:12 UTC

On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris

I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They would
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and vice
versa.
I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both a
binary and a text interface.

I didn't say we should change the S type, but that we should have something,
say 's', that appeared to python as a string. I think if we want transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal with
the difficulties of utf-8. That means raising errors if the string doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.

If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...

Though, IMO any new dtype here would need a cleanup of the dtype code
first so that it doesn't require yet more massive special cases all
over umath.so.

-n

--
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Charles R Harris

2014-01-20 22:58:26 UTC

Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <

I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as

transparently get the benefit of python strings.

Post by Oscar Benjamin
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.
If someone can call buffer on an array then the FSR is a semantic

change.

Post by Oscar Benjamin
If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you

put

Post by Oscar Benjamin
a higher code point in? The buffer needs to be resized and the data

copied

Post by Oscar Benjamin
over. But then what happens to any buffer objects or array views? They

would

Post by Oscar Benjamin
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and

vice

Post by Oscar Benjamin
versa.
I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I

suggest a

Post by Oscar Benjamin
dtype that has an encoding. Only in that way can it consistently have

both a

Post by Oscar Benjamin
binary and a text interface.

I didn't say we should change the S type, but that we should have

something,

say 's', that appeared to python as a string. I think if we want

transparent

string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal

with

the difficulties of utf-8. That means raising errors if the string

doesn't

fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.

If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
Though, IMO any new dtype here would need a cleanup of the dtype code
first so that it doesn't require yet more massive special cases all
over umath.so.

Worth thinking about. As another alternative, what is the minimum we need
to make a restricted encoding, say latin-1, appear transparently as a
unicode string to python? I know the python folks don't like this much, but
I suspect something along that line will eventually be required for the
http folks.

Chuck

Charles R Harris

2014-01-20 23:12:20 UTC

Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <

I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as

transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str than

Post by Oscar Benjamin
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can

get a

change.

Post by Oscar Benjamin
If a numpy 'U' array used the FSR and consisted only of ASCII

characters

Post by Oscar Benjamin
then it would have a one byte per char buffer. What then happens if

you put

Post by Oscar Benjamin
a higher code point in? The buffer needs to be resized and the data

copied

Post by Oscar Benjamin
over. But then what happens to any buffer objects or array views? They

would

Post by Oscar Benjamin
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views

and vice

Post by Oscar Benjamin
versa.
I don't think that this can be done transparently since users of a

numpy

Post by Oscar Benjamin
array need to know about the binary representation. That's why I

suggest a

Post by Oscar Benjamin
dtype that has an encoding. Only in that way can it consistently have

both a

Post by Oscar Benjamin
binary and a text interface.

I didn't say we should change the S type, but that we should have

something,

say 's', that appeared to python as a string. I think if we want

transparent

string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal

with

the difficulties of utf-8. That means raising errors if the string

doesn't

fit in the allotted size, etc. Mind, this is a workaround for the mass

ascii data that is already out there, not a substitute for 'U'.

The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.

<snip>

Chuck

Oscar Benjamin

2014-01-21 11:13:36 UTC

Post by Charles R Harris
I didn't say we should change the S type, but that we should have

something,

Post by Charles R Harris
say 's', that appeared to python as a string. I think if we want

transparent

Post by Charles R Harris
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal

with

Post by Charles R Harris
the difficulties of utf-8. That means raising errors if the string

doesn't

Post by Charles R Harris
fit in the allotted size, etc. Mind, this is a workaround for the mass

Post by Charles R Harris
ascii data that is already out there, not a substitute for 'U'.

This wouldn't necessarily help for the gigarows of short text strings use case
(depending on what "short" means). Also even if it technically saves memory
you may have a greater overhead from fragmenting your array all over the heap.

On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII
characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory
saving over dtype='U' only if the strings are 17 characters or more. To get a
50% saving over dtype='U' you'd need strings of at least 49 characters.

If the Numpy array would manage the buffers itself then that per string memory
overhead would be eliminated in exchange for an 8 byte pointer and at least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.

Using utf-8 in the buffers eliminates the need to go around checking maximum
code points etc. so I would guess that would be simpler to implement (CPython
has now had to triple all of it's code paths that actually access the string
buffer).

Oscar

Nathaniel Smith

2014-01-21 11:41:30 UTC

Post by Oscar Benjamin
If the Numpy array would manage the buffers itself then that per string memory
overhead would be eliminated in exchange for an 8 byte pointer and at least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.

There are various optimisations possible as well.

For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
"pointer" itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)

In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.

-n

Oscar Benjamin

2014-01-21 12:30:08 UTC

Post by Oscar Benjamin
If the Numpy array would manage the buffers itself then that per string

memory

Post by Oscar Benjamin
overhead would be eliminated in exchange for an 8 byte pointer and at

least 1

Post by Oscar Benjamin
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives

Post by Oscar Benjamin
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save

memory

Post by Oscar Benjamin
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.

There are various optimisations possible as well.
For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
"pointer" itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)
In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.

Would this new dtype have an opaque memory representation? What would happen

a = numpy.array(['CGA', 'GAT'], dtype='s')
memoryview(a)

... a.tofile(fout)
... a = numpy.fromfile(fin, dtype='s')

Should there be a different function for creating such an array from reading a
... a = numpy.fromiter(fin, dtype='s')
... fout.writelines(line + '\n' for line in a)

(Note that the above would not be reversible if the strings contain newlines)

I think it Would be less confusing to use dtype='u' than dtype='U' in order to
signify that it is an optimised form of the 'U' dtype as far as access from
Python code is concerned? Calling it 's' only really makes sense if there is a
plan to deprecate dtype='S'.

How would it behave in Python 2? Would it return unicode strings there as
well?

Oscar

Aldcroft, Thomas

2014-01-21 12:54:21 UTC

On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <

Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <

On Jan 20, 2014 8:35 PM, "Charles R Harris" <

I think we may want something like PEP 393. The S datatype may be

the

wrong place to look, we might want a modification of U instead so

as to

transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str than

does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can

get a

pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any

expected

semantics.
If someone can call buffer on an array then the FSR is a semantic

change.

If a numpy 'U' array used the FSR and consisted only of ASCII

characters

then it would have a one byte per char buffer. What then happens if

you put

a higher code point in? The buffer needs to be resized and the data

copied

over. But then what happens to any buffer objects or array views?

They would

be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views

and vice

versa.
I don't think that this can be done transparently since users of a

numpy

array need to know about the binary representation. That's why I

suggest a

dtype that has an encoding. Only in that way can it consistently have

both a

binary and a text interface.

I didn't say we should change the S type, but that we should have

something,

say 's', that appeared to python as a string. I think if we want

transparent

string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal

with

the difficulties of utf-8. That means raising errors if the string

doesn't

fit in the allotted size, etc. Mind, this is a workaround for the mass

ascii data that is already out there, not a substitute for 'U'.

Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?

- Tom

<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

2014-01-21 13:55:29 UTC

On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <

On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <

On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <

Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <

On Jan 20, 2014 8:35 PM, "Charles R Harris" <

I think we may want something like PEP 393. The S datatype may be

the

wrong place to look, we might want a modification of U instead so

as to

transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str

than it

get a

pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a

particular

binary format so it is free to choose without compromising any

expected

semantics.
If someone can call buffer on an array then the FSR is a semantic

change.

If a numpy 'U' array used the FSR and consisted only of ASCII

characters

then it would have a one byte per char buffer. What then happens if

you put

a higher code point in? The buffer needs to be resized and the data

copied

over. But then what happens to any buffer objects or array views?

They would

be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views

and vice

versa.
I don't think that this can be done transparently since users of a

numpy

array need to know about the binary representation. That's why I

suggest a

dtype that has an encoding. Only in that way can it consistently

have both a

binary and a text interface.

I didn't say we should change the S type, but that we should have

something,

say 's', that appeared to python as a string. I think if we want

transparent

string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to

deal with

the difficulties of utf-8. That means raising errors if the string

doesn't

fit in the allotted size, etc. Mind, this is a workaround for the

mass of

ascii data that is already out there, not a substitute for 'U'.

Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?

Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.

Chuck

Aldcroft, Thomas

2014-01-21 14:37:11 UTC

Post by Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <

On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <

On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <

Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <

On Jan 20, 2014 8:35 PM, "Charles R Harris" <

I think we may want something like PEP 393. The S datatype may be

the

wrong place to look, we might want a modification of U instead so

as to

transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str

than it

get a

pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a

particular

binary format so it is free to choose without compromising any

expected

semantics.
If someone can call buffer on an array then the FSR is a semantic

change.

If a numpy 'U' array used the FSR and consisted only of ASCII

characters

then it would have a one byte per char buffer. What then happens if

you put

a higher code point in? The buffer needs to be resized and the data

copied

over. But then what happens to any buffer objects or array views?

They would

be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views

and vice

versa.
I don't think that this can be done transparently since users of a

numpy

array need to know about the binary representation. That's why I

suggest a

dtype that has an encoding. Only in that way can it consistently

have both a

binary and a text interface.

I didn't say we should change the S type, but that we should have

something,

say 's', that appeared to python as a string. I think if we want

transparent

string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to

deal with

the difficulties of utf-8. That means raising errors if the string

doesn't

fit in the allotted size, etc. Mind, this is a workaround for the

mass of

ascii data that is already out there, not a substitute for 'U'.

Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?

Since it's open for discussion, I'll put in my vote for implementing the
easier latin-1 version in the short term to facilitate Python 2 / 3
interoperability. This would solve my use-case (giga-rows of short fixed
length strings), and presumably allow things like memory mapping of large
data files (like for FITS files in astropy.io.fits).

I don't have a clue how the current 'U' dtype works under the hood, but
from my user perspective it seems to work just fine in terms of interacting
with Python 3 strings. Is there a technical problem with doing basically
the same thing for an 's' dtype, but using latin-1 instead of UCS-4?

Thanks,
Tom

Post by Charles R Harris
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

2014-01-21 14:48:11 UTC

On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas <

On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <

Post by Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <

On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <

On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <

Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <

On Jan 20, 2014 8:35 PM, "Charles R Harris" <

I think we may want something like PEP 393. The S datatype may

be the

wrong place to look, we might want a modification of U instead

so as to

transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str

than it

does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else

can get a

pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a

particular

binary format so it is free to choose without compromising any

expected

semantics.
If someone can call buffer on an array then the FSR is a semantic

change.

If a numpy 'U' array used the FSR and consisted only of ASCII

characters

then it would have a one byte per char buffer. What then happens

if you put

a higher code point in? The buffer needs to be resized and the

data copied

over. But then what happens to any buffer objects or array views?

They would

be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other

views and vice

versa.
I don't think that this can be done transparently since users of a

numpy

array need to know about the binary representation. That's why I

suggest a

dtype that has an encoding. Only in that way can it consistently

have both a

binary and a text interface.

I didn't say we should change the S type, but that we should have

something,

say 's', that appeared to python as a string. I think if we want

transparent

string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to

deal with

the difficulties of utf-8. That means raising errors if the string

doesn't

fit in the allotted size, etc. Mind, this is a workaround for the

mass of

ascii data that is already out there, not a substitute for 'U'.

The more I think about it, the more I think we may need to do that.
Note that dynd has ragged arrays and I think they are implemented as
pointers to buffers. The easy way for us to do that would be a
specialization of object arrays to string types only as you suggest.

Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?

Well, that's open for discussion. The problem is to have something that
is both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.

Since it's open for discussion, I'll put in my vote for implementing the
easier latin-1 version in the short term to facilitate Python 2 / 3
interoperability. This would solve my use-case (giga-rows of short fixed
length strings), and presumably allow things like memory mapping of large
data files (like for FITS files in astropy.io.fits).
I don't have a clue how the current 'U' dtype works under the hood, but
from my user perspective it seems to work just fine in terms of interacting
with Python 3 strings. Is there a technical problem with doing basically
the same thing for an 's' dtype, but using latin-1 instead of UCS-4?

I think there is a technical problem. We may be able masquerade latin-1 as
utf-8 for some subset of characters or fool python 3 in some other way.
But in anycase, I think it needs some research to see what the
possibilities are.

Chuck

Sebastian Berg

2014-01-21 15:10:01 UTC

Post by Charles R Harris
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas
On Mon, Jan 20, 2014 at 6:12 PM, Charles R
On Mon, Jan 20, 2014 at 3:58 PM,
Charles R Harris
On Mon, Jan 20, 2014 at 3:35
PM, Nathaniel Smith
On Mon, Jan 20, 2014
at 10:28 PM, Charles R
Harris

Post by Aldcroft, Thomas
On Mon, Jan 20, 2014

at 2:27 PM, Oscar
Benjamin

On Jan 20, 2014

8:35 PM, "Charles R
Harris"

Post by Charles R Harris
I think we may

want something like
PEP 393. The S
datatype may be the

Post by Charles R Harris
wrong place to

look, we might want a
modification of U
instead so as to

Post by Charles R Harris
transparently get

the benefit of python
strings.

The approach taken

in PEP 393 (the FSR)
makes more sense for
str than it

does for numpy

arrays for two
reasons: str is
immutable and opaque.

Since str is

immutable the maximum
code point in the
string can be

determined once

when the string is
created before
anything else can get
a

pointer to the

string buffer.

Since it is opaque

no one can rightly
expect it to expose a
particular

binary format so it

is free to choose
without compromising
any expected

semantics.
If someone can call

buffer on an array
then the FSR is a
semantic change.

If a numpy 'U'

array used the FSR and
consisted only of
ASCII characters

then it would have

a one byte per char
buffer. What then
happens if you put

a higher code point

in? The buffer needs
to be resized and the
data copied

over. But then what

happens to any buffer
objects or array
views? They would

be pointing at the

old buffer from before
the resize. Subsequent

modifications to

the resized array
would not show up in
other views and vice

versa.
I don't think that

this can be done
transparently since
users of a numpy

array need to know

about the binary
representation. That's
why I suggest a

dtype that has an

encoding. Only in that
way can it
consistently have both
a

binary and a text

interface.

Post by Aldcroft, Thomas
I didn't say we

should change the S
type, but that we
should have something,

Post by Aldcroft, Thomas
say 's', that

appeared to python as
a string. I think if
we want transparent

Post by Aldcroft, Thomas
string

interoperability with
python together with a
compressed

Post by Aldcroft, Thomas
representation, and

I think we need both,
we are going to have
to deal with

Post by Aldcroft, Thomas
the difficulties of

utf-8. That means
raising errors if the
string doesn't

Post by Aldcroft, Thomas
fit in the allotted

size, etc. Mind, this
is a workaround for
the mass of

Post by Aldcroft, Thomas
ascii data that is

already out there, not
a substitute for 'U'.
If we're going to be
taking that much
trouble, I'd suggest
going ahead
and adding a
variable-length string
type (where the array
itself
contains a pointer to
a lookaside buffer,
maybe with an
optimization
for stashing short
strings directly). The
fixed-length
requirement is
pretty onerous for
lots of applications
(e.g., pandas always
uses
dtype="O" for strings
-- and that might be a
good workaround for
some
people in this thread
for now). The use of a
lookaside buffer would
also make it practical
to resize the buffer
when the maximum code
point changed, for
that matter...
The more I think about it, the more I
think we may need to do that. Note
that dynd has ragged arrays and I
think they are implemented as pointers
to buffers. The easy way for us to do
that would be a specialization of
object arrays to string types only as
you suggest.
Is this approach intended to be in *addition
to* the latin-1 "s" type originally proposed
by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to
have something that is both compact (latin-1) and
interoperates transparently with python 3 strings
(utf-8). A latin-1 type would be easier to implement
and would probably be a better choice for something
available in both python 2 and python 3, but unless
the python 3 developers come up with something clever
I don't see how to make it behave transparently as a
string in python 3. OTOH, it's not clear to me how to
make utf-8 operate transparently with python 2
strings, especially as the unicode representation
choices in python 2 are ucs-2 or ucs-4 and the python
3 work adding utf-16 and utf-8 is unlikely to be
backported. The problem may be unsolvable in a
completely satisfactory way.
Since it's open for discussion, I'll put in my vote for
implementing the easier latin-1 version in the short term to
facilitate Python 2 / 3 interoperability. This would solve my
use-case (giga-rows of short fixed length strings), and
presumably allow things like memory mapping of large data
files (like for FITS files in astropy.io.fits).
I don't have a clue how the current 'U' dtype works under the
hood, but from my user perspective it seems to work just fine
in terms of interacting with Python 3 strings. Is there a
technical problem with doing basically the same thing for an
's' dtype, but using latin-1 instead of UCS-4?
I think there is a technical problem. We may be able masquerade
latin-1 as utf-8 for some subset of characters or fool python 3 in
some other way. But in anycase, I think it needs some research to see
what the possibilities are.

I am not quite sure, but shouldn't it be even possible to tag on a
possible encoding into the metadata of the string dtype and allow this
to be set to all 1-byte wide encodings that python understands. If the
metadata is not None, all entry points to and from the array
(Object->string, string->Object conversions) would then de- or encode
using the usual python string de- and encode.

Of course it would still be a lot of work, since the string comparisons
would need to know about comparing different encodings and dtype
equivalence is wrong and all the conversions need to be carefully
checked... Most string tools though probably don't care about encoding
as long as it is fixed 1-byte width, though one would have to check that
they don't lose the encoding information by creating a new "S" array...

- Sebastian

Post by Charles R Harris
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Oscar Benjamin

2014-01-21 14:43:31 UTC

Post by Charles R Harris
Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4

On Python 2, unicode strings can operate transparently with byte strings:

$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import numpy as bnp
import numpy as np
a = np.array([u'\xd5scar'], dtype='U')
a

array([u'\xd5scar'],
dtype='<U5')

a[0]

u'\xd5scar'

import sys
sys.stdout.encoding

'UTF-8'

print(a[0]) # Encodes as 'utf-8'

Õscar

'My name is %s' % a[0] # Decodes as ASCII

u'My name is \xd5scar'

print('My name is %s' % a[0]) # Encodes as UTF-8

My name is Õscar

This is no better worse than the rest of the Py2 text model. So if the new
dtype always returns a unicode string under Py2 it should work (as well as the
Py2 text model ever does).

Post by Charles R Harris
and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.

What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it
always uses a fixed-width encoding.

You can just use the CPython C-API to create the unicode strings. The simplest
way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and
PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x
and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds
and post-3.3 FSR formats.

Unlike Python's str there isn't much need to be able to efficiently slice or
index within the string array element. Indexing into the array to get the
string requires creating a new object, so you may as well just decode from
utf-8 at that point [it's big-O(num chars) either way]. There's no need to
constrain it to fixed-width encodings like the FSR in which case utf-8 is
clearly the best choice as:

1) It covers the whole unicode spectrum.
2) It uses 1 byte-per-char for ASCII.
3) UTF-8 is a big optimisation target for CPython (so it's fast).

Oscar

j***@gmail.com

2014-01-20 20:13:08 UTC

On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas

On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin

Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin

Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use
numpy
for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes
dtype
and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using
the
'U'
dtype?

I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big mess.

Charles R Harris

2014-01-20 17:17:21 UTC

On Mon, Jan 20, 2014 at 8:00 AM, Aldcroft, Thomas <

On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <

Post by Chris Barker
Folks,
I've been blathering away on the related threads a lot -- sorry if it's

too

Post by Chris Barker
much. It's gotten a bit tangled up, so I thought I'd start a new one to
Would it be a good thing for numpy to have a one-byte--per-character

string

Post by Chris Barker
type?

If you mean a string type that can only hold latin-1 characters then I think
that this is a step backwards.
If you mean a dtype that holds bytes in a known, specifiable encoding and
automatically decodes them to unicode strings when you call .item() and has a
friendly repr() then that may be a good idea.
So for example you could have dtype='S:utf-8' which would store strings

text = array(['foo', 'bar'], dtype='S:utf-8')
text

array(['foo', 'bar'], dtype='|S3:utf-8')

print(a)

['foo', 'bar']

a[0]

'foo'

a.nbytes

Post by Chris Barker
We did have that with the 'S' type in py2, but the changes in py3 have

made

Post by Chris Barker
it not quite the right thing. And it appears that enough people use 'S'

Post by Chris Barker
py3 to mean 'bytes', so that we can't change that now.

It wasn't really the right thing before either. That's why Python 3 has
changed all of this.

the

Post by Chris Barker
resulting bytes object, either. (and it probably used the default

encoding

Post by Chris Barker
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and

now

Post by Chris Barker
that I think about it, I have no idea what encoding you'd need to use in
the general case.

You should let the user specify the encoding or otherwise require them to use
the 'U' dtype.

Post by Chris Barker
So the correct solution is (particularly on py3) to use the 'U'

(unicode)

Post by Chris Barker
dtype for text in numpy arrays.

Absolutely. Embrace the Python 3 text model. Once you understand the how, what
and why of it you'll see that it really is a good thing!

Post by Chris Barker
However, the 'U' dtype is 4 bytes per character, and that may be "too

big"

Post by Chris Barker
for some use-cases. And there is a lot of text in scientific data sets

that

one-byte-per

Post by Chris Barker
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I

personally

Post by Chris Barker
don't think that's worth it, but maybe that's because I'm an english
speaker...)

You could just use a 2-byte encoding with the S dtype e.g.
dtype='S:utf-16-le'.

Post by Chris Barker
would provide an nice efficiency in a fairly common case]

I think it would be a bad idea to accept bytes here. There are good reasons
that Python 3 creates a barrier between the two worlds of text and bytes.
Allowing implicit mixing of bytes and text is a recipe for mojibake. The
TypeErrors in Python 3 are used to guard against conceptual errors that lead
to data corruption. Attempting to undermine that barrier in numpy would be a
backward step.
I apologise if this is misplaced but there seems to be an attitude that
scientific programming isn't really affected by the issues that have lead to
the Python 3 text model. I think that's ridiculous; data corruption is a
problem in scientific programming just as it is anywhere else.

Post by Chris Barker
* It would create python unicode text objects, decoded as latin-1.

Don't try to bless a particular encoding and stop trying to pretend that it's
possible to write a sensible system where end users don't need to worry about
and specify the encoding of their data.

Post by Chris Barker
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.

If the encoding cannot be specified then the whole idea is misguided.

Post by Chris Barker
I've explained the latin-1 thing on other threads, but the short
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.

... text = numpy.fromfile(fin, dtype='s')

text[0] # Decodes as latin-1 leading to mojibake.

... text = numpy.fromfile(fin, dtype='s:utf-8')
There's really no way to get around the fact that users need to specify the
encoding of their text files.

Post by Chris Barker
(it still wouldn't allow you to store arbitrary unicode -- but that's

the

Post by Chris Barker
limitation of one-byte per character...)

You could if you use 'utf-8'. It would be one-byte-per-char for text that only
contains ascii characters. However it would still support every character that
the unicode consortium can dream up.
The only possible advantage here is as a memory optimisation (potentially
having a speed impact too although it could equally be a speed regression).
Otherwise it just adds needless complexity to numpy and to the code that uses
the new dtype as well as limiting its ability to handle unicode.
How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?

I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is
really a problem because users of a text array want to do things like
filtering (`match_rows = text_array == 'match'`), printing, or other
manipulations in a natural way without having to continually use bytestring
literals or `.decode('ascii')` everywhere. I tried converting a few
packages while leaving the arrays as bytestrings and it just ended up as a
very big mess.

David Goldsmith

2014-01-21 17:28:19 UTC

Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

DG

Nathaniel Smith

2014-01-21 17:35:26 UTC

Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,

not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

Sounds plausible, perhaps you could write up such a page?

-n

Chris Barker

2014-01-21 17:46:41 UTC

Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

Or maybe a NEP?

https://github.com/numpy/numpy/tree/master/doc/neps

sorry -- really swamped this week, so I won't be writing it...

-Chris
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Chris Barker

2014-01-21 18:00:19 UTC

A lot of good discussion here -- to much to comment individually, but it
seems we can boil it down to a couple somewhat distinct proposals:

1) a one-byte-per-char dtype:

This would provide compact, high efficiency storage for common text
for scientific computing. It is analogous to a lower-precision numeric type
-- i.e. it could not store any unicode strings -- only the subset that are
compatible the suggested encoding.
Suggested encoding: latin-1
Other options:
- ascii only.
- settable to any one-byte per char encoding supported by python
I like this IFF it's pretty easy, but it may
add significant complications (and overhead) for comparisons, etc....

NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
back to the py2 mojibake hell" -- the goal here is to very clearly have
this be text data, and have a clearly defined encoding. Which is why we
can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
to conveniently and efficiently use numpy for text that is ansi compatible.

2) a utf-8 dtype:
NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
per char encoding, so would not snuggly into the numpy data model.
It would give compact memory use for mostly-ascii data, so that would be
nice.

3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
This would get us the advantages of the new py3 unicode model -- compact
and efficient when it can be, but also supporting all of unicode. Honestly,
this seems like more work than it's worth to me, at least given the current
numpy dtype model -- maybe a nice addition to dynd. YOu can, after
all, simply use an object array with py3 strings in it. Though perhaps
using the py3 unicode type, but having a dtype that specifically links to
that, rather than a generic python object would be a good compromise.

Hmm -- I guess despite what I said, I just write the starting pint for a
NEP...

(or two, actually...)

-Chris

Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

Or maybe a NEP?
https://github.com/numpy/numpy/tree/master/doc/neps
sorry -- really swamped this week, so I won't be writing it...
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Charles R Harris

2014-01-21 18:14:28 UTC

Post by Chris Barker
A lot of good discussion here -- to much to comment individually, but it
This would provide compact, high efficiency storage for common text
for scientific computing. It is analogous to a lower-precision numeric type
-- i.e. it could not store any unicode strings -- only the subset that are
compatible the suggested encoding.
Suggested encoding: latin-1
- ascii only.
- settable to any one-byte per char encoding supported by python
I like this IFF it's pretty easy, but it may
add significant complications (and overhead) for comparisons, etc....
NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
back to the py2 mojibake hell" -- the goal here is to very clearly have
this be text data, and have a clearly defined encoding. Which is why we
can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
to conveniently and efficiently use numpy for text that is ansi compatible.
NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
per char encoding, so would not snuggly into the numpy data model.
It would give compact memory use for mostly-ascii data, so that would
be nice.
3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
This would get us the advantages of the new py3 unicode model -- compact
and efficient when it can be, but also supporting all of unicode. Honestly,
this seems like more work than it's worth to me, at least given the current
numpy dtype model -- maybe a nice addition to dynd. YOu can, after
all, simply use an object array with py3 strings in it. Though perhaps
using the py3 unicode type, but having a dtype that specifically links to
that, rather than a generic python object would be a good compromise.
Hmm -- I guess despite what I said, I just write the starting pint for a
NEP...

Should also mention the reasons for adding a new data type.

<snip>

Chuck

David Goldsmith

2014-01-21 17:53:25 UTC

Date: Tue, 21 Jan 2014 17:35:26 +0000
Subject: Re: [Numpy-discussion] A one-byte string dtype?
<CAPJVwB=+47ofYvnvN76=
Content-Type: text/plain; charset="utf-8"

Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,

not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?
Sounds plausible, perhaps you could write up such a page?
-n

I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).

DG

David Goldsmith

2014-01-21 18:34:38 UTC

Date: Tue, 21 Jan 2014 09:53:25 -0800
Subject: Re: [Numpy-discussion] A one-byte string dtype?
<CAFtPsZqRrDxrshBMVyS+Z=
Content-Type: text/plain; charset="iso-8859-1"

Date: Tue, 21 Jan 2014 17:35:26 +0000
Subject: Re: [Numpy-discussion] A one-byte string dtype?
<CAPJVwB=+47ofYvnvN76=
Content-Type: text/plain; charset="utf-8"

Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being

sincere,

not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?
Sounds plausible, perhaps you could write up such a page?
-n

I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).
DG

OK, I'm "lost" already: is there general agreement that this should "jump"
straight to one or more NEP's? If not (or if there should be a Wiki page
for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login? I guess starting such a page is not as trivial as I had
assumed.

DG

Robert Kern

2014-01-21 19:20:12 UTC

Post by David Goldsmith

Post by David Goldsmith
I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).
DG

OK, I'm "lost" already: is there general agreement that this should

"jump" straight to one or more NEP's? If not (or if there should be a Wiki
page for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login? I guess starting such a page is not as trivial as I had
assumed.

The wiki is frozen. Please do not add anything to it. It plays no role in
our current development workflow. Drafting a NEP or two and iterating on
them would be the next step.

--
Robert Kern

David Goldsmith

2014-01-22 00:58:30 UTC

Date: Tue, 21 Jan 2014 19:20:12 +0000

Subject: Re: [Numpy-discussion] A one-byte string dtype?
The wiki is frozen. Please do not add anything to it. It plays no role in
our current development workflow. Drafting a NEP or two and iterating on
them would be the next step.
--
Robert Kern

OK, well that's definitely beyond my level of expertise.

DG

Chris Barker - NOAA Federal

2014-01-22 01:46:42 UTC