Erik Bray
2015-10-09 17:06:45 UTC
Hi all,
This is a post about strings--for the purpose of discussion then I'll
be assuming Python 2 and string means non-unicode strings. However,
the discussion applies all the same to unicode strings.
For a long time Numpy has had the following behavior: When creating an
array with a zero-width string dtype like 'S0', Numpy automatically
increases the width of the dtype to support the longest string in the
dtype='|S3')
But it *always* converts to a one character string dtype, at a
dtype='|S1')
Or even
dtype='|S1')
This behavior is encoded in PyArray_NewFromDescr_int [1] and is very
old (since 2006) [2]. This made sense at the time, certainly, since
the logic for handling zero-sized strides was shaky, but most issues
with that have long since been worked out.
However, there's an oversight associated with this that it *is*
possible to make a structured dtype that has a zero-width string as
one of its fields. But since even PyArray_View goes through
PyArray_NewFromDescr, viewing such a field results in a non-empty view
that contains garbage and allows writing garbage into a structured
array. This is documented in several issues, such as #473 [3].
A fixed I've proposed in #6430 [4] takes a conservative approach of
keeping all the existing behavior *except* in the case of structured
arrays, where views with a dtype of 'S0' would be allowed. However, a
simpler fix would be to just remove the restriction on creating arrays
of dtype 'S0' in general (with my first example above being one
exception--given a list of strings it will still convert 'S0' to a
dtype that can hold the longest string in the list).
I think I would prefer the general fix, but it would be a slight
change in behavior for any code using PyArray_NewFromDescr to create
string arrays. But would anyone actually be negatively impacted by
such a change? It seems to me that any code actually relies on the
existing behavior would smell fishy anyways.
Thanks,
Erik
[1] https://github.com/numpy/numpy/blob/8cb3ec6ab804f594daf553e53e7cf7478656bebd/numpy/core/src/multiarray/ctors.c#L940-L956
[2] https://github.com/numpy/numpy/commit/b022765aa487070866663b1707e4a2a0d8ead2e8
[3] https://github.com/numpy/numpy/issues/473
[4] https://github.com/numpy/numpy/pull/6430
This is a post about strings--for the purpose of discussion then I'll
be assuming Python 2 and string means non-unicode strings. However,
the discussion applies all the same to unicode strings.
For a long time Numpy has had the following behavior: When creating an
array with a zero-width string dtype like 'S0', Numpy automatically
increases the width of the dtype to support the longest string in the
np.array(['abc', 'de'], dtype='S0') # or equivalently dtype=str
array(['abc', 'de'],dtype='|S3')
But it *always* converts to a one character string dtype, at a
np.array(['', '', ''], dtype='S0')
array(['', '', ''],dtype='|S1')
Or even
np.zeros(3, dtype='S0')
array(['', '', ''],dtype='|S1')
This behavior is encoded in PyArray_NewFromDescr_int [1] and is very
old (since 2006) [2]. This made sense at the time, certainly, since
the logic for handling zero-sized strides was shaky, but most issues
with that have long since been worked out.
However, there's an oversight associated with this that it *is*
possible to make a structured dtype that has a zero-width string as
one of its fields. But since even PyArray_View goes through
PyArray_NewFromDescr, viewing such a field results in a non-empty view
that contains garbage and allows writing garbage into a structured
array. This is documented in several issues, such as #473 [3].
A fixed I've proposed in #6430 [4] takes a conservative approach of
keeping all the existing behavior *except* in the case of structured
arrays, where views with a dtype of 'S0' would be allowed. However, a
simpler fix would be to just remove the restriction on creating arrays
of dtype 'S0' in general (with my first example above being one
exception--given a list of strings it will still convert 'S0' to a
dtype that can hold the longest string in the list).
I think I would prefer the general fix, but it would be a slight
change in behavior for any code using PyArray_NewFromDescr to create
string arrays. But would anyone actually be negatively impacted by
such a change? It seems to me that any code actually relies on the
existing behavior would smell fishy anyways.
Thanks,
Erik
[1] https://github.com/numpy/numpy/blob/8cb3ec6ab804f594daf553e53e7cf7478656bebd/numpy/core/src/multiarray/ctors.c#L940-L956
[2] https://github.com/numpy/numpy/commit/b022765aa487070866663b1707e4a2a0d8ead2e8
[3] https://github.com/numpy/numpy/issues/473
[4] https://github.com/numpy/numpy/pull/6430