Discussion:
[Numpy-discussion] A one-byte string dtype?
Chris Barker
2014-01-17 22:30:19 UTC
Permalink
Folks,

I've been blathering away on the related threads a lot -- sorry if it's too
much. It's gotten a bit tangled up, so I thought I'd start a new one to
address this one question (i.e. dont bring up genfromtext here):

Would it be a good thing for numpy to have a one-byte--per-character string
type?

We did have that with the 'S' type in py2, but the changes in py3 have made
it not quite the right thing. And it appears that enough people use 'S' in
py3 to mean 'bytes', so that we can't change that now.

The only difference may be that 'S' currently auto translates to a bytes
object, resulting in things like:

np.array(['some text',], dtype='S')[0] == 'some text'

yielding False on Py3. And you can't do all the usual text stuff with the
resulting bytes object, either. (and it probably used the default encoding
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and now
that I think about it, I have no idea what encoding you'd need to use in
the general case.

So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.

However, the 'U' dtype is 4 bytes per character, and that may be "too big"
for some use-cases. And there is a lot of text in scientific data sets that
are pure ascii, or at least some 1-byte-per-character encoding.

So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a one-byte-per
character dtype would be nice:

(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally
don't think that's worth it, but maybe that's because I'm an english
speaker...)

It could use the 's' (lower-case s) type identifier.

For passing to/from python built-in objects, it would

* Allow either Python bytes objects or Python unicode objects as input
a) bytes objects would be passed through as-is
b) unicode objects would be encoded as latin-1

[note: I'm not entirely sure that bytes objects should be allowed, but it
would provide an nice efficiency in a fairly common case]

* It would create python unicode text objects, decoded as latin-1.

Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.

I've explained the latin-1 thing on other threads, but the short version is:

- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.

(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)

So:

Bad idea all around: shut up already!

or

Fine idea, but who's going to write the code? not me!

or

We really should do this.

(of course, with the options of amending the above not-very-fleshed out
proposal)

-Chris
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Aldcroft, Thomas
2014-01-17 23:05:16 UTC
Permalink
Post by Chris Barker
Folks,
I've been blathering away on the related threads a lot -- sorry if it's
too much. It's gotten a bit tangled up, so I thought I'd start a new one to
Would it be a good thing for numpy to have a one-byte--per-character
string type?
We did have that with the 'S' type in py2, but the changes in py3 have
made it not quite the right thing. And it appears that enough people use
'S' in py3 to mean 'bytes', so that we can't change that now.
The only difference may be that 'S' currently auto translates to a bytes
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the
resulting bytes object, either. (and it probably used the default encoding
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and now
that I think about it, I have no idea what encoding you'd need to use in
the general case.
So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.
However, the 'U' dtype is 4 bytes per character, and that may be "too big"
for some use-cases. And there is a lot of text in scientific data sets that
are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a one-byte-per
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I
personally don't think that's worth it, but maybe that's because I'm an
english speaker...)
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input
a) bytes objects would be passed through as-is
b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it
would provide an nice efficiency in a fairly common case]
* It would create python unicode text objects, decoded as latin-1.
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.
(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)
Bad idea all around: shut up already!
or
Fine idea, but who's going to write the code? not me!
or
We really should do this.
As evident from what I said in the previous thread, YES, this should really
be done!

One important feature would be changing the dtype from 'S' to 's' without
any memory copies, so that conversion would be very cheap. Maybe this
would essentially come for free with something like astype('s', copy=False).

- Tom
Post by Chris Barker
(of course, with the options of amending the above not-very-fleshed out
proposal)
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Oscar Benjamin
2014-01-20 10:11:15 UTC
Permalink
Post by Chris Barker
Folks,
I've been blathering away on the related threads a lot -- sorry if it's too
much. It's gotten a bit tangled up, so I thought I'd start a new one to
Would it be a good thing for numpy to have a one-byte--per-character string
type?
If you mean a string type that can only hold latin-1 characters then I think
that this is a step backwards.

If you mean a dtype that holds bytes in a known, specifiable encoding and
automatically decodes them to unicode strings when you call .item() and has a
friendly repr() then that may be a good idea.

So for example you could have dtype='S:utf-8' which would store strings
Post by Chris Barker
text = array(['foo', 'bar'], dtype='S:utf-8')
text
array(['foo', 'bar'], dtype='|S3:utf-8')
Post by Chris Barker
print(a)
['foo', 'bar']
Post by Chris Barker
a[0]
'foo'
Post by Chris Barker
a.nbytes
6
Post by Chris Barker
We did have that with the 'S' type in py2, but the changes in py3 have made
it not quite the right thing. And it appears that enough people use 'S' in
py3 to mean 'bytes', so that we can't change that now.
It wasn't really the right thing before either. That's why Python 3 has
changed all of this.
Post by Chris Barker
The only difference may be that 'S' currently auto translates to a bytes
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the
resulting bytes object, either. (and it probably used the default encoding
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and now
that I think about it, I have no idea what encoding you'd need to use in
the general case.
You should let the user specify the encoding or otherwise require them to use
the 'U' dtype.
Post by Chris Barker
So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.
Absolutely. Embrace the Python 3 text model. Once you understand the how, what
and why of it you'll see that it really is a good thing!
Post by Chris Barker
However, the 'U' dtype is 4 bytes per character, and that may be "too big"
for some use-cases. And there is a lot of text in scientific data sets that
are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a one-byte-per
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally
don't think that's worth it, but maybe that's because I'm an english
speaker...)
You could just use a 2-byte encoding with the S dtype e.g.
dtype='S:utf-16-le'.
Post by Chris Barker
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input
a) bytes objects would be passed through as-is
b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it
would provide an nice efficiency in a fairly common case]
I think it would be a bad idea to accept bytes here. There are good reasons
that Python 3 creates a barrier between the two worlds of text and bytes.
Allowing implicit mixing of bytes and text is a recipe for mojibake. The
TypeErrors in Python 3 are used to guard against conceptual errors that lead
to data corruption. Attempting to undermine that barrier in numpy would be a
backward step.

I apologise if this is misplaced but there seems to be an attitude that
scientific programming isn't really affected by the issues that have lead to
the Python 3 text model. I think that's ridiculous; data corruption is a
problem in scientific programming just as it is anywhere else.
Post by Chris Barker
* It would create python unicode text objects, decoded as latin-1.
Don't try to bless a particular encoding and stop trying to pretend that it's
possible to write a sensible system where end users don't need to worry about
and specify the encoding of their data.
Post by Chris Barker
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.
If the encoding cannot be specified then the whole idea is misguided.
Post by Chris Barker
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.
... text = numpy.fromfile(fin, dtype='s')
Post by Chris Barker
text[0] # Decodes as latin-1 leading to mojibake.
... text = numpy.fromfile(fin, dtype='s:utf-8')

There's really no way to get around the fact that users need to specify the
encoding of their text files.
Post by Chris Barker
(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)
You could if you use 'utf-8'. It would be one-byte-per-char for text that only
contains ascii characters. However it would still support every character that
the unicode consortium can dream up.

The only possible advantage here is as a memory optimisation (potentially
having a speed impact too although it could equally be a speed regression).
Otherwise it just adds needless complexity to numpy and to the code that uses
the new dtype as well as limiting its ability to handle unicode.

How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?


Oscar
Aldcroft, Thomas
2014-01-20 15:00:55 UTC
Permalink
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
Post by Chris Barker
Post by Chris Barker
Folks,
I've been blathering away on the related threads a lot -- sorry if it's
too
Post by Chris Barker
much. It's gotten a bit tangled up, so I thought I'd start a new one to
Would it be a good thing for numpy to have a one-byte--per-character
string
Post by Chris Barker
type?
If you mean a string type that can only hold latin-1 characters then I think
that this is a step backwards.
If you mean a dtype that holds bytes in a known, specifiable encoding and
automatically decodes them to unicode strings when you call .item() and has a
friendly repr() then that may be a good idea.
So for example you could have dtype='S:utf-8' which would store strings
Post by Chris Barker
text = array(['foo', 'bar'], dtype='S:utf-8')
text
array(['foo', 'bar'], dtype='|S3:utf-8')
Post by Chris Barker
print(a)
['foo', 'bar']
Post by Chris Barker
a[0]
'foo'
Post by Chris Barker
a.nbytes
6
Post by Chris Barker
We did have that with the 'S' type in py2, but the changes in py3 have
made
Post by Chris Barker
it not quite the right thing. And it appears that enough people use 'S'
in
Post by Chris Barker
py3 to mean 'bytes', so that we can't change that now.
It wasn't really the right thing before either. That's why Python 3 has
changed all of this.
Post by Chris Barker
The only difference may be that 'S' currently auto translates to a bytes
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the
resulting bytes object, either. (and it probably used the default
encoding
Post by Chris Barker
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and
now
Post by Chris Barker
that I think about it, I have no idea what encoding you'd need to use in
the general case.
You should let the user specify the encoding or otherwise require them to use
the 'U' dtype.
Post by Chris Barker
So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.
Absolutely. Embrace the Python 3 text model. Once you understand the how, what
and why of it you'll see that it really is a good thing!
Post by Chris Barker
However, the 'U' dtype is 4 bytes per character, and that may be "too
big"
Post by Chris Barker
for some use-cases. And there is a lot of text in scientific data sets
that
Post by Chris Barker
are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a
one-byte-per
Post by Chris Barker
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I
personally
Post by Chris Barker
don't think that's worth it, but maybe that's because I'm an english
speaker...)
You could just use a 2-byte encoding with the S dtype e.g.
dtype='S:utf-16-le'.
Post by Chris Barker
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input
a) bytes objects would be passed through as-is
b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it
would provide an nice efficiency in a fairly common case]
I think it would be a bad idea to accept bytes here. There are good reasons
that Python 3 creates a barrier between the two worlds of text and bytes.
Allowing implicit mixing of bytes and text is a recipe for mojibake. The
TypeErrors in Python 3 are used to guard against conceptual errors that lead
to data corruption. Attempting to undermine that barrier in numpy would be a
backward step.
I apologise if this is misplaced but there seems to be an attitude that
scientific programming isn't really affected by the issues that have lead to
the Python 3 text model. I think that's ridiculous; data corruption is a
problem in scientific programming just as it is anywhere else.
Post by Chris Barker
* It would create python unicode text objects, decoded as latin-1.
Don't try to bless a particular encoding and stop trying to pretend that it's
possible to write a sensible system where end users don't need to worry about
and specify the encoding of their data.
Post by Chris Barker
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.
If the encoding cannot be specified then the whole idea is misguided.
Post by Chris Barker
I've explained the latin-1 thing on other threads, but the short version
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.
... text = numpy.fromfile(fin, dtype='s')
Post by Chris Barker
text[0] # Decodes as latin-1 leading to mojibake.
... text = numpy.fromfile(fin, dtype='s:utf-8')
There's really no way to get around the fact that users need to specify the
encoding of their text files.
Post by Chris Barker
(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)
You could if you use 'utf-8'. It would be one-byte-per-char for text that only
contains ascii characters. However it would still support every character that
the unicode consortium can dream up.
The only possible advantage here is as a memory optimisation (potentially
having a speed impact too although it could equally be a speed regression).
Otherwise it just adds needless complexity to numpy and to the code that uses
the new dtype as well as limiting its ability to handle unicode.
How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?
I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.

As discussed in the previous parent thread, using the bytes dtype is really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big mess.
Oscar Benjamin
2014-01-20 15:40:42 UTC
Permalink
Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?
I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big mess.
Aldcroft, Thomas
2014-01-20 17:12:06 UTC
Permalink
Post by Chris Barker
Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use
numpy
Post by Aldcroft, Thomas
Post by Oscar Benjamin
for
this kind of text handling? If you really are operating on gigantic
text
Post by Aldcroft, Thomas
Post by Oscar Benjamin
arrays of ascii characters then is it so bad to just use the bytes
dtype
Post by Aldcroft, Thomas
Post by Oscar Benjamin
and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using
the
Post by Aldcroft, Thomas
Post by Oscar Benjamin
'U'
dtype?
I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is
really
Post by Aldcroft, Thomas
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations
in
Post by Aldcroft, Thomas
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big
mess.
Charles R Harris
2014-01-20 17:21:27 UTC
Permalink
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <
Post by Chris Barker
Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use
numpy
Post by Aldcroft, Thomas
Post by Oscar Benjamin
for
this kind of text handling? If you really are operating on gigantic
text
Post by Aldcroft, Thomas
Post by Oscar Benjamin
arrays of ascii characters then is it so bad to just use the bytes
dtype
Post by Aldcroft, Thomas
Post by Oscar Benjamin
and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using
the
Post by Aldcroft, Thomas
Post by Oscar Benjamin
'U'
dtype?
I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is
really
Post by Aldcroft, Thomas
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other
manipulations in
Post by Aldcroft, Thomas
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big
mess.
Oscar Benjamin
2014-01-20 18:40:32 UTC
Permalink
Post by Charles R Harris
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <
Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
And why are you needing to write .decode('ascii') everywhere?
Post by Aldcroft, Thomas
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on
Python 3 this gives "The first value is b'string_value'".
Post by Charles R Harris
As Nathaniel has mentioned, this is a known problem with Python 3 and the
developers are trying to come up with a solution. Python 3.4 solves some
existing problems, but this one remains. It's not just numpy here, it's
that python itself needs to provide some help.

If you think that anything in core Python will change so that you can mix
text and bytes as above then I think you are very much mistaken. If you're
referring to PEP 460/461 then you have misunderstood the purpose of those
PEPs. The authors and reviewers will carefully ensure that nothing changes
to make the above work the way that it did in 2.x.

Oscar
Charles R Harris
2014-01-20 20:34:56 UTC
Permalink
Post by Oscar Benjamin
Post by Charles R Harris
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <
Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
And why are you needing to write .decode('ascii') everywhere?
Post by Aldcroft, Thomas
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on
Python 3 this gives "The first value is b'string_value'".
Post by Charles R Harris
As Nathaniel has mentioned, this is a known problem with Python 3 and
the developers are trying to come up with a solution. Python 3.4 solves
some existing problems, but this one remains. It's not just numpy here,
it's that python itself needs to provide some help.
If you think that anything in core Python will change so that you can mix
text and bytes as above then I think you are very much mistaken. If you're
referring to PEP 460/461 then you have misunderstood the purpose of those
PEPs. The authors and reviewers will carefully ensure that nothing changes
to make the above work the way that it did in 2.x.
I think we may want something like PEP
393<http://www.python.org/dev/peps/pep-0393/>.
The S datatype may be the wrong place to look, we might want a modification
of U instead so as to transparently get the benefit of python strings.

Chuck
Oscar Benjamin
2014-01-20 21:27:48 UTC
Permalink
I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.

Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.

Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.

If someone can call buffer on an array then the FSR is a semantic change.

If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They
would be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and
vice versa.

I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both
a binary and a text interface.

Oscar
Charles R Harris
2014-01-20 22:28:09 UTC
Permalink
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin
Post by Oscar Benjamin
I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They
would be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and
vice versa.
I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both
a binary and a text interface.
I didn't say we should change the S type, but that we should have
something, say 's', that appeared to python as a string. I think if we want
transparent string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal with
the difficulties of utf-8. That means raising errors if the string doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.

Chuck
Nathaniel Smith
2014-01-20 22:35:12 UTC
Permalink
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
Post by Oscar Benjamin
I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They would
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and vice
versa.
I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both a
binary and a text interface.
I didn't say we should change the S type, but that we should have something,
say 's', that appeared to python as a string. I think if we want transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal with
the difficulties of utf-8. That means raising errors if the string doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...

Though, IMO any new dtype here would need a cleanup of the dtype code
first so that it doesn't require yet more massive special cases all
over umath.so.

-n
--
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
Charles R Harris
2014-01-20 22:58:26 UTC
Permalink
Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
Post by Oscar Benjamin
I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as
to
Post by Oscar Benjamin
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get
a
Post by Oscar Benjamin
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.
If someone can call buffer on an array then the FSR is a semantic
change.
Post by Oscar Benjamin
If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you
put
Post by Oscar Benjamin
a higher code point in? The buffer needs to be resized and the data
copied
Post by Oscar Benjamin
over. But then what happens to any buffer objects or array views? They
would
Post by Oscar Benjamin
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and
vice
Post by Oscar Benjamin
versa.
I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I
suggest a
Post by Oscar Benjamin
dtype that has an encoding. Only in that way can it consistently have
both a
Post by Oscar Benjamin
binary and a text interface.
I didn't say we should change the S type, but that we should have
something,
say 's', that appeared to python as a string. I think if we want
transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal
with
the difficulties of utf-8. That means raising errors if the string
doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
Though, IMO any new dtype here would need a cleanup of the dtype code
first so that it doesn't require yet more massive special cases all
over umath.so.
Worth thinking about. As another alternative, what is the minimum we need
to make a restricted encoding, say latin-1, appear transparently as a
unicode string to python? I know the python folks don't like this much, but
I suspect something along that line will eventually be required for the
http folks.

Chuck
Charles R Harris
2014-01-20 23:12:20 UTC
Permalink
Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
Post by Oscar Benjamin
I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as
to
Post by Oscar Benjamin
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than
it
Post by Oscar Benjamin
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can
get a
Post by Oscar Benjamin
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.
If someone can call buffer on an array then the FSR is a semantic
change.
Post by Oscar Benjamin
If a numpy 'U' array used the FSR and consisted only of ASCII
characters
Post by Oscar Benjamin
then it would have a one byte per char buffer. What then happens if
you put
Post by Oscar Benjamin
a higher code point in? The buffer needs to be resized and the data
copied
Post by Oscar Benjamin
over. But then what happens to any buffer objects or array views? They
would
Post by Oscar Benjamin
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views
and vice
Post by Oscar Benjamin
versa.
I don't think that this can be done transparently since users of a
numpy
Post by Oscar Benjamin
array need to know about the binary representation. That's why I
suggest a
Post by Oscar Benjamin
dtype that has an encoding. Only in that way can it consistently have
both a
Post by Oscar Benjamin
binary and a text interface.
I didn't say we should change the S type, but that we should have
something,
say 's', that appeared to python as a string. I think if we want
transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal
with
the difficulties of utf-8. That means raising errors if the string
doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass
of
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.

<snip>

Chuck
Oscar Benjamin
2014-01-21 11:13:36 UTC
Permalink
Post by Charles R Harris
Post by Charles R Harris
Post by Charles R Harris
I didn't say we should change the S type, but that we should have
something,
Post by Charles R Harris
say 's', that appeared to python as a string. I think if we want
transparent
Post by Charles R Harris
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal
with
Post by Charles R Harris
the difficulties of utf-8. That means raising errors if the string
doesn't
Post by Charles R Harris
fit in the allotted size, etc. Mind, this is a workaround for the mass
of
Post by Charles R Harris
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.
This wouldn't necessarily help for the gigarows of short text strings use case
(depending on what "short" means). Also even if it technically saves memory
you may have a greater overhead from fragmenting your array all over the heap.

On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII
characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory
saving over dtype='U' only if the strings are 17 characters or more. To get a
50% saving over dtype='U' you'd need strings of at least 49 characters.

If the Numpy array would manage the buffers itself then that per string memory
overhead would be eliminated in exchange for an 8 byte pointer and at least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.

Using utf-8 in the buffers eliminates the need to go around checking maximum
code points etc. so I would guess that would be simpler to implement (CPython
has now had to triple all of it's code paths that actually access the string
buffer).


Oscar
Nathaniel Smith
2014-01-21 11:41:30 UTC
Permalink
Post by Oscar Benjamin
If the Numpy array would manage the buffers itself then that per string memory
overhead would be eliminated in exchange for an 8 byte pointer and at least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.
There are various optimisations possible as well.

For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
"pointer" itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)

In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.

-n
Oscar Benjamin
2014-01-21 12:30:08 UTC
Permalink
Post by Chris Barker
Post by Oscar Benjamin
If the Numpy array would manage the buffers itself then that per string
memory
Post by Oscar Benjamin
overhead would be eliminated in exchange for an 8 byte pointer and at
least 1
Post by Oscar Benjamin
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives
an
Post by Oscar Benjamin
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
memory
Post by Oscar Benjamin
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.
There are various optimisations possible as well.
For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
"pointer" itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)
In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.
Would this new dtype have an opaque memory representation? What would happen
Post by Chris Barker
Post by Oscar Benjamin
a = numpy.array(['CGA', 'GAT'], dtype='s')
memoryview(a)
... a.tofile(fout)
... a = numpy.fromfile(fin, dtype='s')

Should there be a different function for creating such an array from reading a
... a = numpy.fromiter(fin, dtype='s')
... fout.writelines(line + '\n' for line in a)

(Note that the above would not be reversible if the strings contain newlines)

I think it Would be less confusing to use dtype='u' than dtype='U' in order to
signify that it is an optimised form of the 'U' dtype as far as access from
Python code is concerned? Calling it 's' only really makes sense if there is a
plan to deprecate dtype='S'.

How would it behave in Python 2? Would it return unicode strings there as
well?


Oscar
Aldcroft, Thomas
2014-01-21 12:54:21 UTC
Permalink
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
On Jan 20, 2014 8:35 PM, "Charles R Harris" <
I think we may want something like PEP 393. The S datatype may be
the
wrong place to look, we might want a modification of U instead so
as to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than
it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can
get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any
expected
semantics.
If someone can call buffer on an array then the FSR is a semantic
change.
If a numpy 'U' array used the FSR and consisted only of ASCII
characters
then it would have a one byte per char buffer. What then happens if
you put
a higher code point in? The buffer needs to be resized and the data
copied
over. But then what happens to any buffer objects or array views?
They would
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views
and vice
versa.
I don't think that this can be done transparently since users of a
numpy
array need to know about the binary representation. That's why I
suggest a
dtype that has an encoding. Only in that way can it consistently have
both a
binary and a text interface.
I didn't say we should change the S type, but that we should have
something,
say 's', that appeared to python as a string. I think if we want
transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal
with
the difficulties of utf-8. That means raising errors if the string
doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass
of
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?

- Tom
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Charles R Harris
2014-01-21 13:55:29 UTC
Permalink
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
On Jan 20, 2014 8:35 PM, "Charles R Harris" <
I think we may want something like PEP 393. The S datatype may be
the
wrong place to look, we might want a modification of U instead so
as to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str
than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can
get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a
particular
binary format so it is free to choose without compromising any
expected
semantics.
If someone can call buffer on an array then the FSR is a semantic
change.
If a numpy 'U' array used the FSR and consisted only of ASCII
characters
then it would have a one byte per char buffer. What then happens if
you put
a higher code point in? The buffer needs to be resized and the data
copied
over. But then what happens to any buffer objects or array views?
They would
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views
and vice
versa.
I don't think that this can be done transparently since users of a
numpy
array need to know about the binary representation. That's why I
suggest a
dtype that has an encoding. Only in that way can it consistently
have both a
binary and a text interface.
I didn't say we should change the S type, but that we should have
something,
say 's', that appeared to python as a string. I think if we want
transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to
deal with
the difficulties of utf-8. That means raising errors if the string
doesn't
fit in the allotted size, etc. Mind, this is a workaround for the
mass of
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.

Chuck
Aldcroft, Thomas
2014-01-21 14:37:11 UTC
Permalink
Post by Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
On Jan 20, 2014 8:35 PM, "Charles R Harris" <
I think we may want something like PEP 393. The S datatype may be
the
wrong place to look, we might want a modification of U instead so
as to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str
than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can
get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a
particular
binary format so it is free to choose without compromising any
expected
semantics.
If someone can call buffer on an array then the FSR is a semantic
change.
If a numpy 'U' array used the FSR and consisted only of ASCII
characters
then it would have a one byte per char buffer. What then happens if
you put
a higher code point in? The buffer needs to be resized and the data
copied
over. But then what happens to any buffer objects or array views?
They would
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views
and vice
versa.
I don't think that this can be done transparently since users of a
numpy
array need to know about the binary representation. That's why I
suggest a
dtype that has an encoding. Only in that way can it consistently
have both a
binary and a text interface.
I didn't say we should change the S type, but that we should have
something,
say 's', that appeared to python as a string. I think if we want
transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to
deal with
the difficulties of utf-8. That means raising errors if the string
doesn't
fit in the allotted size, etc. Mind, this is a workaround for the
mass of
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.
Since it's open for discussion, I'll put in my vote for implementing the
easier latin-1 version in the short term to facilitate Python 2 / 3
interoperability. This would solve my use-case (giga-rows of short fixed
length strings), and presumably allow things like memory mapping of large
data files (like for FITS files in astropy.io.fits).

I don't have a clue how the current 'U' dtype works under the hood, but
from my user perspective it seems to work just fine in terms of interacting
with Python 3 strings. Is there a technical problem with doing basically
the same thing for an 's' dtype, but using latin-1 instead of UCS-4?

Thanks,
Tom
Post by Charles R Harris
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Charles R Harris
2014-01-21 14:48:11 UTC
Permalink
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas <
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <
Post by Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
Post by Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
On Jan 20, 2014 8:35 PM, "Charles R Harris" <
I think we may want something like PEP 393. The S datatype may
be the
wrong place to look, we might want a modification of U instead
so as to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str
than it
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else
can get a
pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a
particular
binary format so it is free to choose without compromising any
expected
semantics.
If someone can call buffer on an array then the FSR is a semantic
change.
If a numpy 'U' array used the FSR and consisted only of ASCII
characters
then it would have a one byte per char buffer. What then happens
if you put
a higher code point in? The buffer needs to be resized and the
data copied
over. But then what happens to any buffer objects or array views?
They would
be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other
views and vice
versa.
I don't think that this can be done transparently since users of a
numpy
array need to know about the binary representation. That's why I
suggest a
dtype that has an encoding. Only in that way can it consistently
have both a
binary and a text interface.
I didn't say we should change the S type, but that we should have
something,
say 's', that appeared to python as a string. I think if we want
transparent
string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to
deal with
the difficulties of utf-8. That means raising errors if the string
doesn't
fit in the allotted size, etc. Mind, this is a workaround for the
mass of
ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...
The more I think about it, the more I think we may need to do that.
Note that dynd has ragged arrays and I think they are implemented as
pointers to buffers. The easy way for us to do that would be a
specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that
is both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.
Since it's open for discussion, I'll put in my vote for implementing the
easier latin-1 version in the short term to facilitate Python 2 / 3
interoperability. This would solve my use-case (giga-rows of short fixed
length strings), and presumably allow things like memory mapping of large
data files (like for FITS files in astropy.io.fits).
I don't have a clue how the current 'U' dtype works under the hood, but
from my user perspective it seems to work just fine in terms of interacting
with Python 3 strings. Is there a technical problem with doing basically
the same thing for an 's' dtype, but using latin-1 instead of UCS-4?
I think there is a technical problem. We may be able masquerade latin-1 as
utf-8 for some subset of characters or fool python 3 in some other way.
But in anycase, I think it needs some research to see what the
possibilities are.

Chuck
Sebastian Berg
2014-01-21 15:10:01 UTC
Permalink
Post by Charles R Harris
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas
On Mon, Jan 20, 2014 at 6:12 PM, Charles R
On Mon, Jan 20, 2014 at 3:58 PM,
Charles R Harris
On Mon, Jan 20, 2014 at 3:35
PM, Nathaniel Smith
On Mon, Jan 20, 2014
at 10:28 PM, Charles R
Harris
Post by Aldcroft, Thomas
On Mon, Jan 20, 2014
at 2:27 PM, Oscar
Benjamin
Post by Aldcroft, Thomas
On Jan 20, 2014
8:35 PM, "Charles R
Harris"
Post by Aldcroft, Thomas
Post by Charles R Harris
I think we may
want something like
PEP 393. The S
datatype may be the
Post by Aldcroft, Thomas
Post by Charles R Harris
wrong place to
look, we might want a
modification of U
instead so as to
Post by Aldcroft, Thomas
Post by Charles R Harris
transparently get
the benefit of python
strings.
Post by Aldcroft, Thomas
The approach taken
in PEP 393 (the FSR)
makes more sense for
str than it
Post by Aldcroft, Thomas
does for numpy
arrays for two
reasons: str is
immutable and opaque.
Post by Aldcroft, Thomas
Since str is
immutable the maximum
code point in the
string can be
Post by Aldcroft, Thomas
determined once
when the string is
created before
anything else can get
a
Post by Aldcroft, Thomas
pointer to the
string buffer.
Post by Aldcroft, Thomas
Since it is opaque
no one can rightly
expect it to expose a
particular
Post by Aldcroft, Thomas
binary format so it
is free to choose
without compromising
any expected
Post by Aldcroft, Thomas
semantics.
If someone can call
buffer on an array
then the FSR is a
semantic change.
Post by Aldcroft, Thomas
If a numpy 'U'
array used the FSR and
consisted only of
ASCII characters
Post by Aldcroft, Thomas
then it would have
a one byte per char
buffer. What then
happens if you put
Post by Aldcroft, Thomas
a higher code point
in? The buffer needs
to be resized and the
data copied
Post by Aldcroft, Thomas
over. But then what
happens to any buffer
objects or array
views? They would
Post by Aldcroft, Thomas
be pointing at the
old buffer from before
the resize. Subsequent
Post by Aldcroft, Thomas
modifications to
the resized array
would not show up in
other views and vice
Post by Aldcroft, Thomas
versa.
I don't think that
this can be done
transparently since
users of a numpy
Post by Aldcroft, Thomas
array need to know
about the binary
representation. That's
why I suggest a
Post by Aldcroft, Thomas
dtype that has an
encoding. Only in that
way can it
consistently have both
a
Post by Aldcroft, Thomas
binary and a text
interface.
Post by Aldcroft, Thomas
I didn't say we
should change the S
type, but that we
should have something,
Post by Aldcroft, Thomas
say 's', that
appeared to python as
a string. I think if
we want transparent
Post by Aldcroft, Thomas
string
interoperability with
python together with a
compressed
Post by Aldcroft, Thomas
representation, and
I think we need both,
we are going to have
to deal with
Post by Aldcroft, Thomas
the difficulties of
utf-8. That means
raising errors if the
string doesn't
Post by Aldcroft, Thomas
fit in the allotted
size, etc. Mind, this
is a workaround for
the mass of
Post by Aldcroft, Thomas
ascii data that is
already out there, not
a substitute for 'U'.
If we're going to be
taking that much
trouble, I'd suggest
going ahead
and adding a
variable-length string
type (where the array
itself
contains a pointer to
a lookaside buffer,
maybe with an
optimization
for stashing short
strings directly). The
fixed-length
requirement is
pretty onerous for
lots of applications
(e.g., pandas always
uses
dtype="O" for strings
-- and that might be a
good workaround for
some
people in this thread
for now). The use of a
lookaside buffer would
also make it practical
to resize the buffer
when the maximum code
point changed, for
that matter...
The more I think about it, the more I
think we may need to do that. Note
that dynd has ragged arrays and I
think they are implemented as pointers
to buffers. The easy way for us to do
that would be a specialization of
object arrays to string types only as
you suggest.
Is this approach intended to be in *addition
to* the latin-1 "s" type originally proposed
by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to
have something that is both compact (latin-1) and
interoperates transparently with python 3 strings
(utf-8). A latin-1 type would be easier to implement
and would probably be a better choice for something
available in both python 2 and python 3, but unless
the python 3 developers come up with something clever
I don't see how to make it behave transparently as a
string in python 3. OTOH, it's not clear to me how to
make utf-8 operate transparently with python 2
strings, especially as the unicode representation
choices in python 2 are ucs-2 or ucs-4 and the python
3 work adding utf-16 and utf-8 is unlikely to be
backported. The problem may be unsolvable in a
completely satisfactory way.
Since it's open for discussion, I'll put in my vote for
implementing the easier latin-1 version in the short term to
facilitate Python 2 / 3 interoperability. This would solve my
use-case (giga-rows of short fixed length strings), and
presumably allow things like memory mapping of large data
files (like for FITS files in astropy.io.fits).
I don't have a clue how the current 'U' dtype works under the
hood, but from my user perspective it seems to work just fine
in terms of interacting with Python 3 strings. Is there a
technical problem with doing basically the same thing for an
's' dtype, but using latin-1 instead of UCS-4?
I think there is a technical problem. We may be able masquerade
latin-1 as utf-8 for some subset of characters or fool python 3 in
some other way. But in anycase, I think it needs some research to see
what the possibilities are.
I am not quite sure, but shouldn't it be even possible to tag on a
possible encoding into the metadata of the string dtype and allow this
to be set to all 1-byte wide encodings that python understands. If the
metadata is not None, all entry points to and from the array
(Object->string, string->Object conversions) would then de- or encode
using the usual python string de- and encode.

Of course it would still be a lot of work, since the string comparisons
would need to know about comparing different encodings and dtype
equivalence is wrong and all the conversions need to be carefully
checked... Most string tools though probably don't care about encoding
as long as it is fixed 1-byte width, though one would have to check that
they don't lose the encoding information by creating a new "S" array...

- Sebastian
Post by Charles R Harris
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Oscar Benjamin
2014-01-21 14:43:31 UTC
Permalink
Post by Charles R Harris
Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4
On Python 2, unicode strings can operate transparently with byte strings:

$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Post by Charles R Harris
import numpy as bnp
import numpy as np
a = np.array([u'\xd5scar'], dtype='U')
a
array([u'\xd5scar'],
dtype='<U5')
Post by Charles R Harris
a[0]
u'\xd5scar'
Post by Charles R Harris
import sys
sys.stdout.encoding
'UTF-8'
Post by Charles R Harris
print(a[0]) # Encodes as 'utf-8'
Õscar
Post by Charles R Harris
'My name is %s' % a[0] # Decodes as ASCII
u'My name is \xd5scar'
Post by Charles R Harris
print('My name is %s' % a[0]) # Encodes as UTF-8
My name is Õscar

This is no better worse than the rest of the Py2 text model. So if the new
dtype always returns a unicode string under Py2 it should work (as well as the
Py2 text model ever does).
Post by Charles R Harris
and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.
What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it
always uses a fixed-width encoding.

You can just use the CPython C-API to create the unicode strings. The simplest
way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and
PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x
and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds
and post-3.3 FSR formats.

Unlike Python's str there isn't much need to be able to efficiently slice or
index within the string array element. Indexing into the array to get the
string requires creating a new object, so you may as well just decode from
utf-8 at that point [it's big-O(num chars) either way]. There's no need to
constrain it to fixed-width encodings like the FSR in which case utf-8 is
clearly the best choice as:

1) It covers the whole unicode spectrum.
2) It uses 1 byte-per-char for ASCII.
3) UTF-8 is a big optimisation target for CPython (so it's fast).


Oscar
j***@gmail.com
2014-01-20 20:13:08 UTC
Permalink
On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin
Post by Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
Post by Oscar Benjamin
How significant are the performance issues? Does anyone really use
numpy
for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes
dtype
and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using
the
'U'
dtype?
I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere. I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big mess.
Charles R Harris
2014-01-20 17:17:21 UTC
Permalink
On Mon, Jan 20, 2014 at 8:00 AM, Aldcroft, Thomas <
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <
Post by Chris Barker
Post by Chris Barker
Folks,
I've been blathering away on the related threads a lot -- sorry if it's
too
Post by Chris Barker
much. It's gotten a bit tangled up, so I thought I'd start a new one to
Would it be a good thing for numpy to have a one-byte--per-character
string
Post by Chris Barker
type?
If you mean a string type that can only hold latin-1 characters then I think
that this is a step backwards.
If you mean a dtype that holds bytes in a known, specifiable encoding and
automatically decodes them to unicode strings when you call .item() and has a
friendly repr() then that may be a good idea.
So for example you could have dtype='S:utf-8' which would store strings
Post by Chris Barker
text = array(['foo', 'bar'], dtype='S:utf-8')
text
array(['foo', 'bar'], dtype='|S3:utf-8')
Post by Chris Barker
print(a)
['foo', 'bar']
Post by Chris Barker
a[0]
'foo'
Post by Chris Barker
a.nbytes
6
Post by Chris Barker
We did have that with the 'S' type in py2, but the changes in py3 have
made
Post by Chris Barker
it not quite the right thing. And it appears that enough people use 'S'
in
Post by Chris Barker
py3 to mean 'bytes', so that we can't change that now.
It wasn't really the right thing before either. That's why Python 3 has
changed all of this.
Post by Chris Barker
The only difference may be that 'S' currently auto translates to a bytes
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with
the
Post by Chris Barker
resulting bytes object, either. (and it probably used the default
encoding
Post by Chris Barker
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and
now
Post by Chris Barker
that I think about it, I have no idea what encoding you'd need to use in
the general case.
You should let the user specify the encoding or otherwise require them to use
the 'U' dtype.
Post by Chris Barker
So the correct solution is (particularly on py3) to use the 'U'
(unicode)
Post by Chris Barker
dtype for text in numpy arrays.
Absolutely. Embrace the Python 3 text model. Once you understand the how, what
and why of it you'll see that it really is a good thing!
Post by Chris Barker
However, the 'U' dtype is 4 bytes per character, and that may be "too
big"
Post by Chris Barker
for some use-cases. And there is a lot of text in scientific data sets
that
Post by Chris Barker
are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a
one-byte-per
Post by Chris Barker
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I
personally
Post by Chris Barker
don't think that's worth it, but maybe that's because I'm an english
speaker...)
You could just use a 2-byte encoding with the S dtype e.g.
dtype='S:utf-16-le'.
Post by Chris Barker
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input
a) bytes objects would be passed through as-is
b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but
it
Post by Chris Barker
would provide an nice efficiency in a fairly common case]
I think it would be a bad idea to accept bytes here. There are good reasons
that Python 3 creates a barrier between the two worlds of text and bytes.
Allowing implicit mixing of bytes and text is a recipe for mojibake. The
TypeErrors in Python 3 are used to guard against conceptual errors that lead
to data corruption. Attempting to undermine that barrier in numpy would be a
backward step.
I apologise if this is misplaced but there seems to be an attitude that
scientific programming isn't really affected by the issues that have lead to
the Python 3 text model. I think that's ridiculous; data corruption is a
problem in scientific programming just as it is anywhere else.
Post by Chris Barker
* It would create python unicode text objects, decoded as latin-1.
Don't try to bless a particular encoding and stop trying to pretend that it's
possible to write a sensible system where end users don't need to worry about
and specify the encoding of their data.
Post by Chris Barker
Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.
If the encoding cannot be specified then the whole idea is misguided.
Post by Chris Barker
I've explained the latin-1 thing on other threads, but the short
- It will work perfectly for ascii text
- It will work perfectly for latin-1 text (natch)
- It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
- It will preserve those arbitrary bytes through a encoding/decoding
operation.
... text = numpy.fromfile(fin, dtype='s')
Post by Chris Barker
text[0] # Decodes as latin-1 leading to mojibake.
... text = numpy.fromfile(fin, dtype='s:utf-8')
There's really no way to get around the fact that users need to specify the
encoding of their text files.
Post by Chris Barker
(it still wouldn't allow you to store arbitrary unicode -- but that's
the
Post by Chris Barker
limitation of one-byte per character...)
You could if you use 'utf-8'. It would be one-byte-per-char for text that only
contains ascii characters. However it would still support every character that
the unicode consortium can dream up.
The only possible advantage here is as a memory optimisation (potentially
having a speed impact too although it could equally be a speed regression).
Otherwise it just adds needless complexity to numpy and to the code that uses
the new dtype as well as limiting its ability to handle unicode.
How significant are the performance issues? Does anyone really use numpy for
this kind of text handling? If you really are operating on gigantic text
arrays of ascii characters then is it so bad to just use the bytes dtype and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using the 'U'
dtype?
I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is
really a problem because users of a text array want to do things like
filtering (`match_rows = text_array == 'match'`), printing, or other
manipulations in a natural way without having to continually use bytestring
literals or `.decode('ascii')` everywhere. I tried converting a few
packages while leaving the arrays as bytestrings and it just ended up as a
very big mess.
David Goldsmith
2014-01-21 17:28:19 UTC
Permalink
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

DG
Nathaniel Smith
2014-01-21 17:35:26 UTC
Permalink
Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

Sounds plausible, perhaps you could write up such a page?

-n
Chris Barker
2014-01-21 17:46:41 UTC
Permalink
Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?
Or maybe a NEP?

https://github.com/numpy/numpy/tree/master/doc/neps

sorry -- really swamped this week, so I won't be writing it...

-Chris
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Chris Barker
2014-01-21 18:00:19 UTC
Permalink
A lot of good discussion here -- to much to comment individually, but it
seems we can boil it down to a couple somewhat distinct proposals:

1) a one-byte-per-char dtype:

This would provide compact, high efficiency storage for common text
for scientific computing. It is analogous to a lower-precision numeric type
-- i.e. it could not store any unicode strings -- only the subset that are
compatible the suggested encoding.
Suggested encoding: latin-1
Other options:
- ascii only.
- settable to any one-byte per char encoding supported by python
I like this IFF it's pretty easy, but it may
add significant complications (and overhead) for comparisons, etc....

NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
back to the py2 mojibake hell" -- the goal here is to very clearly have
this be text data, and have a clearly defined encoding. Which is why we
can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
to conveniently and efficiently use numpy for text that is ansi compatible.

2) a utf-8 dtype:
NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
per char encoding, so would not snuggly into the numpy data model.
It would give compact memory use for mostly-ascii data, so that would be
nice.

3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
This would get us the advantages of the new py3 unicode model -- compact
and efficient when it can be, but also supporting all of unicode. Honestly,
this seems like more work than it's worth to me, at least given the current
numpy dtype model -- maybe a nice addition to dynd. YOu can, after
all, simply use an object array with py3 strings in it. Though perhaps
using the py3 unicode type, but having a dtype that specifically links to
that, rather than a generic python object would be a good compromise.


Hmm -- I guess despite what I said, I just write the starting pint for a
NEP...

(or two, actually...)

-Chris
Post by Chris Barker
Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?
Or maybe a NEP?
https://github.com/numpy/numpy/tree/master/doc/neps
sorry -- really swamped this week, so I won't be writing it...
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Charles R Harris
2014-01-21 18:14:28 UTC
Permalink
Post by Chris Barker
A lot of good discussion here -- to much to comment individually, but it
This would provide compact, high efficiency storage for common text
for scientific computing. It is analogous to a lower-precision numeric type
-- i.e. it could not store any unicode strings -- only the subset that are
compatible the suggested encoding.
Suggested encoding: latin-1
- ascii only.
- settable to any one-byte per char encoding supported by python
I like this IFF it's pretty easy, but it may
add significant complications (and overhead) for comparisons, etc....
NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
back to the py2 mojibake hell" -- the goal here is to very clearly have
this be text data, and have a clearly defined encoding. Which is why we
can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
to conveniently and efficiently use numpy for text that is ansi compatible.
NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
per char encoding, so would not snuggly into the numpy data model.
It would give compact memory use for mostly-ascii data, so that would
be nice.
3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
This would get us the advantages of the new py3 unicode model -- compact
and efficient when it can be, but also supporting all of unicode. Honestly,
this seems like more work than it's worth to me, at least given the current
numpy dtype model -- maybe a nice addition to dynd. YOu can, after
all, simply use an object array with py3 strings in it. Though perhaps
using the py3 unicode type, but having a dtype that specifically links to
that, rather than a generic python object would be a good compromise.
Hmm -- I guess despite what I said, I just write the starting pint for a
NEP...
Should also mention the reasons for adding a new data type.

<snip>

Chuck
David Goldsmith
2014-01-21 17:53:25 UTC
Permalink
Date: Tue, 21 Jan 2014 17:35:26 +0000
Subject: Re: [Numpy-discussion] A one-byte string dtype?
<CAPJVwB=+47ofYvnvN76=
Content-Type: text/plain; charset="utf-8"
Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?
Sounds plausible, perhaps you could write up such a page?
-n
I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).

DG
David Goldsmith
2014-01-21 18:34:38 UTC
Permalink
Date: Tue, 21 Jan 2014 09:53:25 -0800
Subject: Re: [Numpy-discussion] A one-byte string dtype?
<CAFtPsZqRrDxrshBMVyS+Z=
Content-Type: text/plain; charset="iso-8859-1"
Date: Tue, 21 Jan 2014 17:35:26 +0000
Subject: Re: [Numpy-discussion] A one-byte string dtype?
<CAPJVwB=+47ofYvnvN76=
Content-Type: text/plain; charset="utf-8"
Post by David Goldsmith
Am I the only one who feels that this (very important--I'm being
sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?
Sounds plausible, perhaps you could write up such a page?
-n
I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).
DG
OK, I'm "lost" already: is there general agreement that this should "jump"
straight to one or more NEP's? If not (or if there should be a Wiki page
for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login? I guess starting such a page is not as trivial as I had
assumed.

DG
Robert Kern
2014-01-21 19:20:12 UTC
Permalink
Post by David Goldsmith
Post by David Goldsmith
I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).
DG
OK, I'm "lost" already: is there general agreement that this should
"jump" straight to one or more NEP's? If not (or if there should be a Wiki
page for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login? I guess starting such a page is not as trivial as I had
assumed.

The wiki is frozen. Please do not add anything to it. It plays no role in
our current development workflow. Drafting a NEP or two and iterating on
them would be the next step.

--
Robert Kern
David Goldsmith
2014-01-22 00:58:30 UTC
Permalink
Date: Tue, 21 Jan 2014 19:20:12 +0000
Subject: Re: [Numpy-discussion] A one-byte string dtype?
The wiki is frozen. Please do not add anything to it. It plays no role in
our current development workflow. Drafting a NEP or two and iterating on
them would be the next step.
--
Robert Kern
OK, well that's definitely beyond my level of expertise.

DG
Chris Barker - NOAA Federal
2014-01-22 01:46:42 UTC
Permalink
Post by David Goldsmith
OK, well that's definitely beyond my level of expertise.
Well, it's in github--now's as good a time as any to learn github
collaboration...

-Fork the numpy source.

-Create a new file in:
numpy/doc/neps

Point folks to it here so they can comment, etc.

At some point, issue a pull request, and it can get merged into the
main source for final polishing...

-Chris
Post by David Goldsmith
DG
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...