[Numpy-discussion] Proposal to support __format_

Discussion:

[Numpy-discussion] Proposal to support __format__

Gustav Larsson

2017-02-14 23:34:32 UTC

Hi everyone!

I want to discuss adding support for __format__ in ndarray and I am willing to
contribute code-wise once consensus has been reached. It was briefly
discussed on GitHub two years ago (https://github.com/numpy/numpy/issues/5543)
and I will re-iterate some of the points made there and build off of that. I
have been thinking about this a lot in the last few weeks and my thoughts turned
into a fairly fleshed out proposal. The discussion should probably start more
high-level, so I apologize if the level of detail is inappropriate at this
point in time.

I decided on a gist, since the email got too long and clear formatting helps:

https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069

OK, those are my thoughts for now. What do you think?

Cheers,
Gustav

Stephan Hoyer

2017-02-14 23:59:49 UTC

Permalink

Post by Gustav Larsson
Hi everyone!
I want to discuss adding support for __format__ in ndarray and I am willing to
contribute code-wise once consensus has been reached. It was briefly
discussed on GitHub two years ago (https://github.com/numpy/
numpy/issues/5543)
and I will re-iterate some of the points made there and build off of that. I
have been thinking about this a lot in the last few weeks and my thoughts turned
into a fairly fleshed out proposal. The discussion should probably start more
high-level, so I apologize if the level of detail is inappropriate at this
point in time.
https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069

This is a lovely and clearly written document. Thanks for taking the time
to think through this!

I encourage you to submit it as a pull request to the NumPy repository as a
"NumPy Enhancement Proposal", either now or after we've discussed it:
https://docs.scipy.org/doc/numpy-dev/neps/index.html

Post by Gustav Larsson
OK, those are my thoughts for now. What do you think?

Two thoughts for now:
1. For object arrays, I would default to calling format on each element
(your "map principle") rather than raising an error.
2. It's absolutely OK to leave functionality unimplemented and not
immediately nail down every edge case. As a default, I would suggest
raising errors whenever non-empty type specifications are provided rather
than raising errors in every case.

Gustav Larsson

2017-02-15 01:35:23 UTC

Permalink

Post by Stephan Hoyer
I encourage you to submit it as a pull request to the NumPy repository as
https://docs.scipy.org/doc/numpy-dev/neps/index.html

OK, I will let it go through one iteration of comments and then I'll submit
one. Thanks!

1. For object arrays, I would default to calling format on each element

Post by Stephan Hoyer
(your "map principle") rather than raising an error.

I'm glad you brought this up as a possibility. It might be possible, but
there are some issues that would need to be resolved. First of all, {} and
{:} always works and gives the same result it currently does. So, this only
affects the situation where the format spec is non-empty. I think there are
two main issues:

Heterogeneity: Let's say we have x = np.array([12.3, True, 'string',
Foo(10)], dtype=np.object). Then, presumably {:.1f} should cause a
ValueError since the string does not support format type 'f'. This could
create a lot of ValueError land mines for the user. For x[:2] however it
should work and produce something like [12.3 1.0]. Note, the "map
principle" still can't be strictly true. Let's say we have an array with
type object and mostly string-like elements. Then {:5s} will still not
produce exactly {:5s} element-wise, because the string representations need
to be repr-based inside the array (otherwise it could break for newlines
and things like that and produce spaces that make the boundary between
elements ambiguous). This brings me to the next issue.

Str vs. repr: If we have a homogeneous object-array with types Foo and Foo
implements __format__, it would be great if this worked. However, one issue
is that Foo.__format__ might return things like newline (or spaces), which
would break (or confuse) the printed output (unless it is made incredibly
smart to support "vertical alignment"). This issue is essentially the same
as for strings in general, which is why they use repr instead. I can think
of two solutions: 1) Try to sanitize (or repr-ify) the string returned by
__format__ somehow; 2) Put the responsibility on the user and simply let
the rendering break if Foo.__format__ does not play well.

2. It's absolutely OK to leave functionality unimplemented and not

Post by Stephan Hoyer
immediately nail down every edge case. As a default, I would suggest
raising errors whenever non-empty type specifications are provided rather
than raising errors in every case.

I agree.

Gustav

Post by Stephan Hoyer

Post by Gustav Larsson
Hi everyone!
I want to discuss adding support for __format__ in ndarray and I am willing to
contribute code-wise once consensus has been reached. It was briefly
discussed on GitHub two years ago (https://github.com/numpy/nump
y/issues/5543)
and I will re-iterate some of the points made there and build off of that. I
have been thinking about this a lot in the last few weeks and my thoughts turned
into a fairly fleshed out proposal. The discussion should probably start more
high-level, so I apologize if the level of detail is inappropriate at this
point in time.
https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069

This is a lovely and clearly written document. Thanks for taking the time
to think through this!
I encourage you to submit it as a pull request to the NumPy repository as
https://docs.scipy.org/doc/numpy-dev/neps/index.html

Post by Gustav Larsson
OK, those are my thoughts for now. What do you think?

1. For object arrays, I would default to calling format on each element
(your "map principle") rather than raising an error.
2. It's absolutely OK to leave functionality unimplemented and not
immediately nail down every edge case. As a default, I would suggest
raising errors whenever non-empty type specifications are provided rather
than raising errors in every case.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Stephan Hoyer

2017-02-15 01:55:21 UTC

Permalink

Post by Stephan Hoyer
1. For object arrays, I would default to calling format on each element

Post by Stephan Hoyer
(your "map principle") rather than raising an error.

Things will absolutely break if you try to do complex operations on
in-homogeneously typed arrays. I would put the onus on the user in such a
case.

Post by Stephan Hoyer
For x[:2] however it should work and produce something like [12.3 1.0].
Note, the "map principle" still can't be strictly true. Let's say we have
an array with type object and mostly string-like elements. Then {:5s} will
still not produce exactly {:5s} element-wise, because the string
representations need to be repr-based inside the array (otherwise it could
break for newlines and things like that and produce spaces that make the
boundary between elements ambiguous). This brings me to the next issue.

Indeed, this will be a departure from the behavior without a format string,
which just uses repr. In my mind, this is the strongest argument against
using the map principle here, because there is a discontinuous shift
between providing and not providing a format string.

Post by Stephan Hoyer
Str vs. repr: If we have a homogeneous object-array with types Foo and Foo
implements __format__, it would be great if this worked. However, one issue
is that Foo.__format__ might return things like newline (or spaces), which
would break (or confuse) the printed output (unless it is made incredibly
smart to support "vertical alignment"). This issue is essentially the same
as for strings in general, which is why they use repr instead. I can think
of two solutions: 1) Try to sanitize (or repr-ify) the string returned by
__format__ somehow; 2) Put the responsibility on the user and simply let
the rendering break if Foo.__format__ does not play well.

I wouldn't do anything fancy here to worry about line breaks. It's
basically impossible to get this right for edge cases, so I would certainly
put the responsibility on the user.

On another note, about Python 2 vs 3: I would definitely take the approach
of copying the Python 3 behavior on all versions of NumPy (when feasible)
and not being too concerned about compatibility with format on Python 2.
The future is Python 3.

Marten van Kerkwijk

2017-02-15 16:03:51 UTC

Permalink

Hi Gustav,

This is great! A few quick comments (mostly echo-ing Stephan's).

1. You basically have a NEP already! Making a PR from it allows to
give line-by-line comments, so would help!

2. Don't worry about supporting python2 specifics; just try to ensure
it doesn't break; I would not say more about it!

3. On `set_printoptions` -- ideally, it will become possible to use
this as a context (i.e., `with set_printoption(...)`). It might make
sense to have an `override_format` keyword argument to it.

4. Otherwise, my main suggestion is to start small with the more
obvious ones, and not worry too much about format validation, but
rather about getting the simple ones to work well (e.g., for an object
array, just apply the format given; if it doesn't work, it will error
out on its own, which is OK).

5. One bit of detail: the "g" one does confuse me.

All the best,

Marten

Gustav Larsson

2017-02-15 21:48:34 UTC

Permalink

Post by Marten van Kerkwijk
This is great!

Thanks! Glad to be met by enthusiasm about this.

1. You basically have a NEP already! Making a PR from it allows to

Post by Marten van Kerkwijk
give line-by-line comments, so would help!

I will do this soon.

2. Don't worry about supporting python2 specifics; just try to ensure

Post by Marten van Kerkwijk
it doesn't break; I would not say more about it!

Sounds good to me.

3. On `set_printoptions` -- ideally, it will become possible to use

Post by Marten van Kerkwijk
this as a context (i.e., `with set_printoption(...)`). It might make
sense to have an `override_format` keyword argument to it.

Having a `with np.printoptions(...)` context manager is a great idea. It
does sound orthogonal to __format__ though, so it could be addressed
separately.

4. Otherwise, my main suggestion is to start small with the more

Post by Marten van Kerkwijk
obvious ones, and not worry too much about format validation, but
rather about getting the simple ones to work well (e.g., for an object
array, just apply the format given; if it doesn't work, it will error
out on its own, which is OK).

Sounds good to me. I was thinking of approaching the implementation by
writing unit tests first and group them into different priority tiers. That
way, the unit tests can go through another review before implementation
gets going. I agree that __format__ doesn't have to check format validation
if a ValueError is going to be raised anyway by sub-calls.

5. One bit of detail: the "g" one does confuse me.

I will re-write this a bit to make it clearer. Basically, the 'g' with the
mix of 'e'/'f' depending on max/min>1000 is all from the current numpy
behavior, so it is not something I had much creative input on at all.
Although, as it is written right now it may seem so. That is, the goal is
to have {:} == {:g} for float arrays, analogous to how {:} == {:g} for
built-in floats. Then, if the user departs a bit, like {:.2g}, it will
simply be identical to calling np.set_printoptions(precision=2) first.

Gustav

On Wed, Feb 15, 2017 at 8:03 AM, Marten van Kerkwijk <

Post by Marten van Kerkwijk
Hi Gustav,
This is great! A few quick comments (mostly echo-ing Stephan's).
1. You basically have a NEP already! Making a PR from it allows to
give line-by-line comments, so would help!
2. Don't worry about supporting python2 specifics; just try to ensure
it doesn't break; I would not say more about it!
3. On `set_printoptions` -- ideally, it will become possible to use
this as a context (i.e., `with set_printoption(...)`). It might make
sense to have an `override_format` keyword argument to it.
4. Otherwise, my main suggestion is to start small with the more
obvious ones, and not worry too much about format validation, but
rather about getting the simple ones to work well (e.g., for an object
array, just apply the format given; if it doesn't work, it will error
out on its own, which is OK).
5. One bit of detail: the "g" one does confuse me.
All the best,
Marten
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Ilhan Polat

2017-02-15 22:05:00 UTC

Permalink

On the last item, do we really have to follow that strange, `d`,`g` and so
on conventions on formatting? With all respect to the humongous historical
baggage, I think that notation is pretty archaic and terminal like. If
being pythonic is of a concern here, maybe it is better to use a more
verbose syntax. Just throwing out an idea after 15 seconds of thought (so
by no means an alternative suggestion)

eng:6i5d -> engineering notation (always powers of ten of multiples of 3) 6
integral digits and 5 decimal digits.
float (whatever the default is)
float:4i2d (you get the idea)

etc.

FULL DISCLOSURE: I am a very displeased customer of `fprintf ` of matlab
(and others) and this archaic formatting. I never got a hang of it so it
might be the case that I don't quite get the rationale behind it and I
almost always get it wrong. Maybe at least the rationale can be clarified.

Lastly, repeating what others mentioned: thank you for this well prepared
initiative

Post by Marten van Kerkwijk
This is great!
Thanks! Glad to be met by enthusiasm about this.
1. You basically have a NEP already! Making a PR from it allows to

Post by Marten van Kerkwijk
give line-by-line comments, so would help!

I will do this soon.
2. Don't worry about supporting python2 specifics; just try to ensure

Post by Marten van Kerkwijk
it doesn't break; I would not say more about it!

Sounds good to me.
3. On `set_printoptions` -- ideally, it will become possible to use

Post by Marten van Kerkwijk
this as a context (i.e., `with set_printoption(...)`). It might make
sense to have an `override_format` keyword argument to it.

Having a `with np.printoptions(...)` context manager is a great idea. It
does sound orthogonal to __format__ though, so it could be addressed
separately.
4. Otherwise, my main suggestion is to start small with the more

Sounds good to me. I was thinking of approaching the implementation by
writing unit tests first and group them into different priority tiers. That
way, the unit tests can go through another review before implementation
gets going. I agree that __format__ doesn't have to check format validation
if a ValueError is going to be raised anyway by sub-calls.
5. One bit of detail: the "g" one does confuse me.
I will re-write this a bit to make it clearer. Basically, the 'g' with the
mix of 'e'/'f' depending on max/min>1000 is all from the current numpy
behavior, so it is not something I had much creative input on at all.
Although, as it is written right now it may seem so. That is, the goal is
to have {:} == {:g} for float arrays, analogous to how {:} == {:g} for
built-in floats. Then, if the user departs a bit, like {:.2g}, it will
simply be identical to calling np.set_printoptions(precision=2) first.
Gustav
On Wed, Feb 15, 2017 at 8:03 AM, Marten van Kerkwijk <

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathan Goldbaum

2017-02-15 22:14:42 UTC

Permalink

Post by Ilhan Polat
On the last item, do we really have to follow that strange, `d`,`g` and so
on conventions on formatting? With all respect to the humongous historical
baggage, I think that notation is pretty archaic and terminal like. If
being pythonic is of a concern here, maybe it is better to use a more
verbose syntax. Just throwing out an idea after 15 seconds of thought (so
by no means an alternative suggestion)
eng:6i5d -> engineering notation (always powers of ten of multiples of 3)
6 integral digits and 5 decimal digits.
float (whatever the default is)
float:4i2d (you get the idea)
etc.

While I agree with you that printf format codes are arcane, unfortunately
they need to be used here since they are supported by Python:

https://docs.python.org/3.1/library/string.html#formatspec

Post by Ilhan Polat
FULL DISCLOSURE: I am a very displeased customer of `fprintf ` of matlab
(and others) and this archaic formatting. I never got a hang of it so it
might be the case that I don't quite get the rationale behind it and I
almost always get it wrong. Maybe at least the rationale can be clarified.
Lastly, repeating what others mentioned: thank you for this well prepared
initiative

Post by Marten van Kerkwijk
This is great!
Thanks! Glad to be met by enthusiasm about this.
1. You basically have a NEP already! Making a PR from it allows to

Post by Marten van Kerkwijk
give line-by-line comments, so would help!

I will do this soon.
2. Don't worry about supporting python2 specifics; just try to ensure

Post by Marten van Kerkwijk
it doesn't break; I would not say more about it!

Sounds good to me.
3. On `set_printoptions` -- ideally, it will become possible to use

Post by Marten van Kerkwijk
this as a context (i.e., `with set_printoption(...)`). It might make
sense to have an `override_format` keyword argument to it.

Having a `with np.printoptions(...)` context manager is a great idea. It
does sound orthogonal to __format__ though, so it could be addressed
separately.
4. Otherwise, my main suggestion is to start small with the more

Sounds good to me. I was thinking of approaching the implementation by
writing unit tests first and group them into different priority tiers. That
way, the unit tests can go through another review before implementation
gets going. I agree that __format__ doesn't have to check format validation
if a ValueError is going to be raised anyway by sub-calls.
5. One bit of detail: the "g" one does confuse me.
I will re-write this a bit to make it clearer. Basically, the 'g' with
the mix of 'e'/'f' depending on max/min>1000 is all from the current numpy
behavior, so it is not something I had much creative input on at all.
Although, as it is written right now it may seem so. That is, the goal is
to have {:} == {:g} for float arrays, analogous to how {:} == {:g} for
built-in floats. Then, if the user departs a bit, like {:.2g}, it will
simply be identical to calling np.set_printoptions(precision=2) first.
Gustav
On Wed, Feb 15, 2017 at 8:03 AM, Marten van Kerkwijk <