On Thu, Jan 23, 2014 at 10:41 AM, <***@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
> <***@gmail.com> wrote:
>> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
>>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <***@gmail.com> wrote:
>>>
>>> >
>>> > It's not safe to stop removing the null bytes. This is how numpy determines
>>> > the length of the strings in a dtype='S' array. The strings are not
>>> > "fixed-width" but rather have a maximum width.
>>>
>>> Exactly--but folks have told us on this list that they want (and are)
>>> using the 'S' style for arbitrary bytes, NOT for text. In which case
>>> you wouldn't want to remove null bytes. This is more evidence that 'S'
>>> was designed to handle c-style one-byte-per-char strings, and NOT
>>> arbitrary bytes, and thus not to map directly to the py2 string type
>>> (you can store null bytes in a py2 string"
>>
>> You can store null bytes in a Py2 string but you normally wouldn't if it was
>> supposed to be text.
>>
>>>
>>> Which brings me back to my original proposal: properly map the 'S'
>>> type to the py3 data model, and maybe add some kind of fixed width
>>> bytes style of there is a use case for that. I still have no idea what
>>> the use case might be.
>>>
>>
>> There would definitely be a use case for a fixed-byte-width
>> bytes-representing-text dtype in record arrays to read from a binary file:
>>
>> dt = np.dtype([
>> ('name', '|b8:utf-8'),
>> ('param1', '<i4'),
>> ('param2', '<i4')
>> ...
>> ])
>>
>> with open('binaryfile', 'rb') as fin:
>> a = np.fromfile(fin, dtype=dt)
>>
>> You could also use this for ASCII if desired. I don't think it really matters
>> that utf-8 uses variable width as long as a too long byte string throws an
>> error (and does not truncate).
>>
>> For non 8-bit encodings there would have to be some way to handle endianness
>> without a BOM, but otherwise I think that it's always possible to pad with zero
>> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
>> null *characters* after decoding. i.e.:
>>
>> $ cat tmp.py
>> import encodings
>>
>> def test_encoding(s1, enc):
>> b = s1.encode(enc).ljust(32, b'\0')
>> s2 = b.decode(enc)
>> index = s2.find('\0')
>> if index != -1:
>> s2 = s2[:index]
>> assert s1 == s2, enc
>>
>> encodings_set = set(encodings.aliases.aliases.values())
>>
>> for N, enc in enumerate(encodings_set):
>> try:
>> test_encoding('qwe', enc)
>> except LookupError:
>> pass
>>
>> print('Tested %d encodings without error' % N)
>> $ python3 tmp.py
>> Tested 88 encodings without error
>>
>>> > If the trailing nulls are not removed then you would get:
>>> >
>>> >>>> a[0]
>>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> >>>> len(a[0])
>>> > 9
>>> >
>>> > And I'm sure that someone would get upset about that.
>>>
>>> Only if they are using it for text-which you "should not" do with py3.
>>
>> But people definitely are using it for text on Python 3. It should be
>> deprecated in favour of something new but breaking it is just gratuitous.
>> Numpy doesn't have the option to make a clean break with Python 3 precisely
>> because it needs to straddle 2.x and 3.x while numpy-based applications are
>> ported to 3.x.
>>
>>> > Some more oddities:
>>> >
>>> >>>> a[0] = 1
>>> >>>> a
>>> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
>>> > dtype='|S9')
>>> >>>> a[0] = None
>>> >>>> a
>>> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
>>> > dtype='|S9')
>>>
>>> More evidence that this is a text type.....
>>
>> And the big one:
>>
>> $ python3
>> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
>> [GCC 4.6.3] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import numpy as np
>>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>>>> a
>> array([b'asd', b'zxc'],
>> dtype='|S3')
>>>>> a[0] = 'qwer' # Unicode string again
>>>>> a
>> array([b'qwe', b'zxc'],
>> dtype='|S3')
>>>>> a[0] = 'Õscar'
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>
> looks mostly like casting rules to me, which looks like ASCII based
> instead of an arbitrary encoding.
>
>>>> a = np.array(['asd', 'zxc'], dtype='S')
>>>> b = a.astype('U')
>>>> b[0] = 'Õscar'
>>>> a[0] = 'Õscar'
> Traceback (most recent call last):
> File "<pyshell#17>", line 1, in <module>
> a[0] = 'Õscar'
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
>>>> b
> array(['Õsc', 'zxc'],
> dtype='<U3')
>>>> b.astype('S')
> Traceback (most recent call last):
> File "<pyshell#19>", line 1, in <module>
> b.astype('S')
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
>>>> b.view('S4')
> array([b'\xd5', b's', b'c', b'z', b'x', b'c'],
> dtype='|S4')
>
>>>> a.astype('U').astype('S')
> array([b'asd', b'zxc'],
> dtype='|S3')
another curious example, encode utf-8 to latin-1 bytes
>>> b
array(['Õsc', 'zxc'],
dtype='<U3')
>>> b[0].encode('utf8')
b'\xc3\x95sc'
>>> b[0].encode('latin1')
b'\xd5sc'
>>> b.astype('S')
Traceback (most recent call last):
File "<pyshell#40>", line 1, in <module>
b.astype('S')
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
>>> c = b.view('S4').astype('S1').view('S3')
>>> c
array([b'\xd5sc', b'zxc'],
dtype='|S3')
>>> c[0].decode('latin1')
'Õsc'
--------
The original numpy py3 conversion used latin-1 as default
(It's still used in statsmodels, and I haven't looked at the structure
under the common py2-3 codebase)
if sys.version_info[0] >= 3:
import io
bytes = bytes
unicode = str
asunicode = str
def asbytes(s):
if isinstance(s, bytes):
return s
return s.encode('latin1')
def asstr(s):
if isinstance(s, str):
return s
return s.decode('latin1')
--------------
Josef
>
> Josef
>
>>
>> The analogous behaviour was very deliberately removed from Python 3:
>>
>>>>> a[0] == 'qwe'
>> False
>>>>> a[0] == b'qwe'
>> True
>>
>>
>> Oscar
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-***@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion