Lluís Vilanova
2016-09-13 13:02:26 UTC
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).
This brings two major problems:
* numpy code has to go through loops to open and read files as binary data to
load text into a bytes array, and does not play well with users providing
string (unicode) arguments
* the repr of these arrays shows strings as b'text' instead of 'text', which
breaks doctests of software built on numpy
What I'm trying to do is make dtypes 'S' and 'U' equivalnt (NPY_STRING and
NPY_UNICODE).
Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.
Is it acceptable to internally just translate all appearances of 'S'
(NPY_STRING) to 'U' (NPY_UNICODE) and get rid of one of the two when running in
python 3?
The main drawback I see is that dtype reprs would not always be as expected:
# python 2
dtype='|S3')
# python 3
dtype='<U3')
[1] https://github.com/numpy/numpy/issues/3184
Cheers,
Lluis
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).
This brings two major problems:
* numpy code has to go through loops to open and read files as binary data to
load text into a bytes array, and does not play well with users providing
string (unicode) arguments
* the repr of these arrays shows strings as b'text' instead of 'text', which
breaks doctests of software built on numpy
What I'm trying to do is make dtypes 'S' and 'U' equivalnt (NPY_STRING and
NPY_UNICODE).
Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.
Is it acceptable to internally just translate all appearances of 'S'
(NPY_STRING) to 'U' (NPY_UNICODE) and get rid of one of the two when running in
python 3?
The main drawback I see is that dtype reprs would not always be as expected:
# python 2
np.array('foo', dtype='S')
array('foo',dtype='|S3')
# python 3
np.array('foo', dtype='S')
array('foo',dtype='<U3')
[1] https://github.com/numpy/numpy/issues/3184
Cheers,
Lluis