Discussion:
[Numpy-discussion] String & unicode arrays vs text loading in python 3
Lluís Vilanova
2016-09-13 13:02:26 UTC
Permalink
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).

This brings two major problems:

* numpy code has to go through loops to open and read files as binary data to
load text into a bytes array, and does not play well with users providing
string (unicode) arguments

* the repr of these arrays shows strings as b'text' instead of 'text', which
breaks doctests of software built on numpy

What I'm trying to do is make dtypes 'S' and 'U' equivalnt (NPY_STRING and
NPY_UNICODE).

Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.

Is it acceptable to internally just translate all appearances of 'S'
(NPY_STRING) to 'U' (NPY_UNICODE) and get rid of one of the two when running in
python 3?

The main drawback I see is that dtype reprs would not always be as expected:

# python 2
np.array('foo', dtype='S')
array('foo',
dtype='|S3')

# python 3
np.array('foo', dtype='S')
array('foo',
dtype='<U3')


[1] https://github.com/numpy/numpy/issues/3184


Cheers,
Lluis
Sebastian Berg
2016-09-13 13:39:34 UTC
Permalink
Post by Lluís Vilanova
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).
* numpy code has to go through loops to open and read files as binary data to
  load text into a bytes array, and does not play well with users
providing
  string (unicode) arguments
* the repr of these arrays shows strings as b'text' instead of
'text', which
  breaks doctests of software built on numpy
What I'm trying to do is make dtypes 'S' and 'U' equivalnt
(NPY_STRING and
NPY_UNICODE).
Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.
I am not sure how that can be possible. Those types are fundamentally
different in how they store their data. String types use one byte per
character, unicode types will use 4 bytes per character. You can maybe
default to unicode in more cases in python 3, but you cannot make them
identical internally.

What about giving `np.loadtxt` an encoding kwarg or something along
that line?

- Sebastian
Post by Lluís Vilanova
Is it acceptable to internally just translate all appearances of 'S'
(NPY_STRING) to 'U' (NPY_UNICODE) and get rid of one of the two when running in
python 3?
   # python 2
   >>> np.array('foo', dtype='S')
   array('foo',
         dtype='|S3')
   # python 3
   >>> np.array('foo', dtype='S')
   array('foo',
         dtype='<U3')
[1] https://github.com/numpy/numpy/issues/3184
Cheers,
  Lluis
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Lluís Vilanova
2016-09-13 14:17:51 UTC
Permalink
Post by Sebastian Berg
Post by Lluís Vilanova
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).
* numpy code has to go through loops to open and read files as binary data to
  load text into a bytes array, and does not play well with users
providing
  string (unicode) arguments
* the repr of these arrays shows strings as b'text' instead of 'text', which
  breaks doctests of software built on numpy
What I'm trying to do is make dtypes 'S' and 'U' equivalnt
(NPY_STRING and
NPY_UNICODE).
Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.
I am not sure how that can be possible. Those types are fundamentally
different in how they store their data. String types use one byte per
character, unicode types will use 4 bytes per character. You can maybe
default to unicode in more cases in python 3, but you cannot make them
identical internally.
What about giving `np.loadtxt` an encoding kwarg or something along
that line?
np.loadtxt and np.genfromtxt are already quite complex in handling the implicit
conversion to byte-array imposed by numpy's port to python 3, and still fail in
some corner cases.

This conversion is also inherently surprising to users, since what I'd get in
Post by Sebastian Berg
Post by Lluís Vilanova
np.array('foo', dtype='S')
array('foo', dtype='|S3')
Post by Sebastian Berg
Post by Lluís Vilanova
np.array('foo', dtype='S')
array(b'foo', dtype='|S3')

It's not only surprising, but also breaks absolutely all the doctests I have
with arrays that contain strings (it even breaks numpy's examples).

That's why adding an encoding kwarg (better than the current auto-magical
conversion to binary) won't solve my problems. The 'S' dtype will still be a
binary array, which shows up in the repr.


Since all strings in python 3 are unicode, I'm expecting "string" and "unicode"
arrays in numpy to be the same *and* show up as strings (e.g., 'foo' instead of
b'foo').

Yes, the difference between these types is in how they store their data. What
I'm proposing is to always use unicode in python 3.

If necessary, we can add a new dtype that lets users store raw byte arrays. By
making them explicitly byte arrays, that shouldn't raise any new surprises.


I already started doing the changes I described (as a result from the discussion
in #3184 [1]), but wanted to double-check with the list before getting deeper
into it.

[1] https://github.com/numpy/numpy/issues/3184


Cheers,
Lluis
Lluís Vilanova
2016-09-13 14:21:57 UTC
Permalink
Post by Sebastian Berg
Post by Lluís Vilanova
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).
* numpy code has to go through loops to open and read files as binary data to
  load text into a bytes array, and does not play well with users
providing
  string (unicode) arguments
* the repr of these arrays shows strings as b'text' instead of 'text', which
  breaks doctests of software built on numpy
What I'm trying to do is make dtypes 'S' and 'U' equivalnt
(NPY_STRING and
NPY_UNICODE).
Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.
I am not sure how that can be possible. Those types are fundamentally
different in how they store their data. String types use one byte per
character, unicode types will use 4 bytes per character. You can maybe
default to unicode in more cases in python 3, but you cannot make them
identical internally.
BTW, by identical I mean having two externally visible types, but a common
implementation in python 3 (that of NPY_UNICODE).

The as-sane but not backwards-compatible option (I'm asking if this is
acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
np.string_).


Cheers,
Lluis
Chris Barker
2016-09-13 16:55:38 UTC
Permalink
We had a big long discussion about this on this list a while back (maybe 2
yrs ago???) please search the archives to find it. Though I'm pretty sure
that we never did come to a conclusion. I think it stared with wanting
better support ofr unicode in loadtxt and the like, and ended up delving
into other encodings for the 'U' dtype, and maybe a single byte string
dtype (latin-1), or maybe a variable-size unicode object like Py3's, or...

However, it is absolutely a non-starter to change the binary representation
of the 'S' type in any version of numpy. Due to the legacy of py2 (and,
indeed, most computing environments) 'S' is a single byte string
representation. And the binary representation is often really key to numpy
use.
Period, end of story.

And that maps to a py2 string and py3 bytes object.

py2 does, of course, have a Unicode object as well. If you want your code
(and doctests, and ...) to be compatible, then you should probably go to
Unicode strings everywhere. py3 now supports the u'string' no-op literal to
make this easier.

(though I guess the __repr__ won't tack on that 'u', which is going to be a
problem for docstrings).

Note also that py3 has added more an more "string-like" support to the
bytes object, so it's not too bad to go bytes-only.

-CHB
Post by Lluís Vilanova
Post by Sebastian Berg
Post by Lluís Vilanova
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).
* numpy code has to go through loops to open and read files as binary data to
load text into a bytes array, and does not play well with users providing
string (unicode) arguments
* the repr of these arrays shows strings as b'text' instead of 'text', which
breaks doctests of software built on numpy
What I'm trying to do is make dtypes 'S' and 'U' equivalnt
(NPY_STRING and
NPY_UNICODE).
Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.
I am not sure how that can be possible. Those types are fundamentally
different in how they store their data. String types use one byte per
character, unicode types will use 4 bytes per character. You can maybe
default to unicode in more cases in python 3, but you cannot make them
identical internally.
BTW, by identical I mean having two externally visible types, but a common
implementation in python 3 (that of NPY_UNICODE).
The as-sane but not backwards-compatible option (I'm asking if this is
acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
np.string_).
Cheers,
Lluis
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Lluís Vilanova
2016-09-13 18:05:51 UTC
Permalink
We had a big long discussion about this on this list a while back (maybe 2 yrs
ago???) please search the archives to find it. Though I'm pretty sure that we
never did come to a conclusion. I think it stared with wanting better support
ofr unicode in loadtxt and the like, and ended up delving into other encodings
for the 'U' dtype, and maybe a single byte string dtype (latin-1), or maybe a
variable-size unicode object like Py3's, or...
However, it is absolutely a non-starter to change the binary representation of
the 'S' type in any version of numpy. Due to the legacy of py2 (and, indeed,
most computing environments) 'S' is a single byte string representation. And the
binary representation is often really key to numpy use.
Period, end of story.
Great, that's the type of info I wanted to get before going forward. I guess
there's code relying on the binary representation of 'S' to do mmap's or access
the array's raw contents. Is that right?
And that maps to a py2 string and py3 bytes object.
py2 does, of course, have a Unicode object as well. If you want your code (and
doctests, and ...) to be compatible, then you should probably go to Unicode
strings everywhere. py3 now supports the u'string' no-op literal to make this
easier.
(though I guess the __repr__ won't tack on that 'u', which is going to be a
problem for docstrings).
That's exactly the problem. Doing all examples and doctests with 'U' instead of
'S' will break it for py2 instead of py3.
Note also that py3 has added more an more "string-like" support to the bytes
object, so it's not too bad to go bytes-only.
There is a fundamental semantic difference between a string and a byte array,
that's the core of the problem.


Here's an alternative that only handles the repr. Separate fixes would be needed
for loadtxt's and genfromtxt's problems (Sevastian Berg briefly pointed at that,
but I'd like to know more).

Whenever we repr an array using 'S', we can instead show a unicode in py3. That
keeps the binary representation, but will always show the expected result to
users, and it's only a handful of lines added to dump_data().

If needed, I could easily add a bytes array to make the alternative explicit
(where py3 would repr the contents as b'foo').

This would only leave the less-common paths inconsistent across python versions,
which should not be a problem for most examples/doctests:

* A 'U' array will show u'foo' in py2 and 'foo' in py3.
* The new binary array will show 'foo' in py2 and b'foo' in py3 (that could also
be patched on the repr code).
* A 'O' array will not be able to do any meaningful repr conversions.


A more complex alternative (and actually closer to what I'm proposing) is to
modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
string. It would have the binary compatibility, while being a unicode string in
practice.


Cheers,
Lluis
Stephan Hoyer
2016-09-13 18:21:21 UTC
Permalink
Post by Lluís Vilanova
Whenever we repr an array using 'S', we can instead show a unicode in py3. That
keeps the binary representation, but will always show the expected result to
users, and it's only a handful of lines added to dump_data().
If needed, I could easily add a bytes array to make the alternative explicit
(where py3 would repr the contents as b'foo').
This would only leave the less-common paths inconsistent across python versions,
* A 'U' array will show u'foo' in py2 and 'foo' in py3.
* The new binary array will show 'foo' in py2 and b'foo' in py3 (that could also
be patched on the repr code).
* A 'O' array will not be able to do any meaningful repr conversions.
A more complex alternative (and actually closer to what I'm proposing) is to
modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
string. It would have the binary compatibility, while being a unicode string in
practice.
I'm afraid these are both also non-starters at this point. NumPy's string
dtype corresponds to bytes on Python 3, and you can use it to store
arbitrary binary values. Would it really be an improvement to change the
repr, if the scalar value resulting from indexing is still bytes?

The sanest approach is probably a new dtype for one-byte strings. We talked
about this a few years ago, but nobody has implemented it yet:
http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype

(normally I would link to the archives on scipy.org, but the certificate
for HTTPS has expired so you see a big error message right now...)
Lluís Vilanova
2016-09-14 14:36:48 UTC
Permalink
Post by Lluís Vilanova
Whenever we repr an array using 'S', we can instead show a unicode in py3. That
keeps the binary representation, but will always show the expected result to
users, and it's only a handful of lines added to dump_data().
If needed, I could easily add a bytes array to make the alternative explicit
(where py3 would repr the contents as b'foo').
This would only leave the less-common paths inconsistent across python versions,
* A 'U' array will show u'foo' in py2 and 'foo' in py3.
* The new binary array will show 'foo' in py2 and b'foo' in py3 (that could also
be patched on the repr code).
* A 'O' array will not be able to do any meaningful repr conversions.
A more complex alternative (and actually closer to what I'm proposing) is to
modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
string. It would have the binary compatibility, while being a unicode string in
practice.
I'm afraid these are both also non-starters at this point. NumPy's string dtype
corresponds to bytes on Python 3, and you can use it to store arbitrary binary
values. Would it really be an improvement to change the repr, if the scalar
value resulting from indexing is still bytes?
The sanest approach is probably a new dtype for one-byte strings. We talked
http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype
From the ref manual, 'S' is a "(byte-)string", which (to me) should never have
non-printable characters. That's why I'm advocating "S" to be your proposed
one-byte strings, while a new "B" dtype is needed for arbitrary binary arrays.
This has the added benefit of making docstrings correct on both py2 and py3.

But I won't keep pushing for this; I understand the backwards-compatibility
issues mentioned before. Maybe "S" should just be deprecated, "s" (as the
one-byte strings) and "B" added instead, and all docstrings and tests changed to
"s".

In any case, after reading the whole thread, it's not clear to me what's the
consensus on what the solution should be (Chris's summary is the closest thing
to that).

Cheers,
Lluis

Chris Barker
2016-09-13 20:44:53 UTC
Permalink
Post by Lluís Vilanova
Great, that's the type of info I wanted to get before going forward. I guess
there's code relying on the binary representation of 'S' to do mmap's or access
the array's raw contents. Is that right?
yes, there is a LOT of code, most of it third party, that relies on
particular binary representations of the numpy dtypes.

There is a fundamental semantic difference between a string and a byte
Post by Lluís Vilanova
array,
that's the core of the problem.
well yes. but they were mingled in py2, and the 'S' dtype is essentially a
py2 string. But in py3, it maps more closely with bytes than string --
though yes, not exactly either :-(

Here's an alternative that only handles the repr.
Post by Lluís Vilanova
Whenever we repr an array using 'S', we can instead show a unicode in py3. That
keeps the binary representation, but will always show the expected result to
users, and it's only a handful of lines added to dump_data().
This would probably be more confusing than helpful -- if a 'S' object
converts to a bytes object, than it's repr should show that.

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Loading...