Discussion:
[Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Nathaniel Smith
2015-08-25 10:03:41 UTC
Permalink
Hi all,

These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!

(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)

Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.


Attendees
=========

Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)

Joining remotely for all or part: Stephan Hoyer, Julian Taylor.


Formalizing our governance/decision making
==========================================

This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.

I'll send out a proper draft of this shortly for further discussion.


Development roadmap
===================

General consensus:

Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)

Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).

And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"

This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.

Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)

Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.

Vision: Chandni can then come along and combine them by doing

a = alice_array([...], dtype=darryl_dtype)

and it just works.

Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".


Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sub-goals:
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
story for how to let it handle third-party array classes:
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.

This would solve a lot of problems for projects like:
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray


Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.

The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.

So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.

Some of the pieces involved in doing this:

- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.

- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)

- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).

- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])

- We need to migrate the current dtypes over to the new system,
which can be done in stages:

- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"

- Then move each of them into their own classes

- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python

- And vice-versa, it should be possible to subclass dtype at the
Python level

- etc.

Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.

Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.

Some features that would become straightforward to implement
(e.g. even in third-party libraries) if this were fixed:
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now

I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.


Money
=====

There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"

This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.

Points of general agreement:

- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.

- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?

- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.

- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
extremely difficult time filling it:
[http://cds.nyu.edu/research-engineer/]]

- General consensus though was that there isn't much to be done
about this though, except try it and see.

- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)


More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================

- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.

- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)

- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.

- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.

- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.

- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)

- On the question of using Cython inside numpy core:
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.

- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.

- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say

#define NUMPY_API_VERSION 4

and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.

- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.


Policies that should be documented
==================================

...together with some notes about what the contents of the document
should be:


How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").

In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.

(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)

- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.


Deprecations and breakage policy:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.

- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.

- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?

- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).

- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.

- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.


Other points that were discussed
================================

- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.


Links to raw notes
==================

Main page:
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]

Notes from the meeting proper:
[https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing]

Slides from the followup BoF:
[https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp]

Notes from the followup BoF:
[https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit]

-n
--
Nathaniel J. Smith -- http://vorpus.org
Charles R Harris
2015-08-25 16:43:19 UTC
Permalink
Post by Nathaniel Smith
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.
Attendees
=========
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
==========================================
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap
===================
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system,
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
Money
=====
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
[http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
==================================
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.
Other points that were discussed
================================
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.
Links to raw notes
==================
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
[
https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
]
[
https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
]
[
https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
]
-n
Hi Nathaniel. Thanks for putting this together.

Chuck
Nathan Goldbaum
2015-08-25 16:52:42 UTC
Permalink
Post by Nathaniel Smith
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.
Attendees
=========
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
==========================================
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap
===================
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system,
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
Money
=====
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
[http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
==================================
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
Just a note in here - have you all thought about running the test suites
for downstream projects as part of the numpy test suite?

Thanks so much for the summary - lots of interesting ideas in here!
Post by Nathaniel Smith
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.
Other points that were discussed
================================
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.
Links to raw notes
==================
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
[
https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
]
[
https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
]
[
https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
]
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Travis Oliphant
2015-08-25 19:00:54 UTC
Permalink
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.

I've am very eager to understand how to help NumPy and the wider community
move forward however I can (my passions on this have not changed since
1999, though what I myself spend time on has changed).

There are a lot of ways to think about approaching this, though. It's
hard to get all the ideas on the table, and it was unfortunate we couldn't
get everybody wyho are core NumPy devs together in person to have this
discussion as there are still a lot of questions unanswered and a lot of
thought that has gone into other approaches that was not brought up or
represented in the meeting (how does Numba fit into this, what about
data-shape, dynd, memory-views and Python type system, etc.). If NumPy
becomes just an interface-specification, then why don't we just do that
*outside* NumPy itself in a way that doesn't jeopardize the stability of
NumPy today. These are some of the real questions I have. I will try
to write up my thoughts in more depth soon, but I won't be able to respond
in-depth right now. I just wanted to comment because Nathaniel said I
disagree which is only partly true.

The three most important things for me are 1) let's make sure we have
representation from as wide of the community as possible (this is really
hard), 2) let's look around at the broader community and the prior art that
is happening in this space right now and 3) let's not pretend we are going
to be able to make all this happen without breaking ABI compatibility.
Let's just break ABI compatibility with NumPy 2.0 *and* have as much
fidelity with the API and semantics of current NumPy as possible (though
there will be some changes necessary long-term).

I don't think we should intentionally break ABI if we can avoid it, but I
also don't think we should spend in-ordinate amounts of time trying to
pretend that we won't break ABI (for at least some people), and most
importantly we should not pretend *not* to break the ABI when we actually
do. We did this once before with the roll-out of date-time, and it was
really un-necessary. When I released NumPy 1.0, there were several
things that I knew should be fixed very soon (NumPy was never designed to
not break ABI). Those problems are still there. Now, that we have
quite a bit better understanding of what NumPy *should* be (there have been
tremendous strides in understanding and community size over the past 10
years), let's actually make the infrastructure we think will last for the
next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
code-base that wasn't designed for it).

NumPy is a hard code-base. It has been since Numeric days in 1995. I
could be wrong, but my guess is that we will be passed by as a community if
we don't seize the opportunity to build something better than we can build
if we are forced to use a 20 year old code-base.

It is more important to not break people's code and to be clear when a
re-compile is necessary for dependencies. Those to me are the most
important constraints. There are a lot of great ideas that we all have
about what we want NumPy to be able to do. Some of this are pretty
transformational (and the more exciting they are, the harder I think they
are going to be to implement without breaking at least the ABI). There
is probably some CAP-like theorem around
Stability-Features-Speed-of-Development (pick 2) when it comes to Open
Source Software development and making feature-progress with NumPy *is
going* to create in-stability which concerns me.

I would like to see a little-bit-of-pain one time with a NumPy 2.0, rather
than a constant pain because of constant churn over many years approach
that Nathaniel seems to advocate. To me NumPy 2.0 is an ABI-breaking
release that is as API-compatible as possible and whose semantics are not
dramatically different.

There are at least 3 areas of compatibility (ABI, API, and semantic).
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.

API compatibility should be much more sacrosanct, but it is also something
that can also be managed. Any NumPy 2.0 should definitely support the
full NumPy API (though there could be deprecated swaths). I think the
community has done well in using deprecation and limiting the public API to
make this more manageable and I would love to see a NumPy 2.0 that
solidifies a future-oriented API along with a back-ward compatible API that
is also available.

Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.

I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.


Best,

-Travis
Post by Nathaniel Smith
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.
Attendees
=========
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
==========================================
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap
===================
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system,
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
Money
=====
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
[http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
==================================
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.
Other points that were discussed
================================
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.
Links to raw notes
==================
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
[
https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
]
[
https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
]
[
https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
]
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*


@teoliphant
512-222-5440
http://www.continuum.io
Charles R Harris
2015-08-25 20:58:46 UTC
Permalink
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
I've am very eager to understand how to help NumPy and the wider community
move forward however I can (my passions on this have not changed since
1999, though what I myself spend time on has changed).
There are a lot of ways to think about approaching this, though. It's
hard to get all the ideas on the table, and it was unfortunate we couldn't
get everybody wyho are core NumPy devs together in person to have this
discussion as there are still a lot of questions unanswered and a lot of
thought that has gone into other approaches that was not brought up or
represented in the meeting (how does Numba fit into this, what about
data-shape, dynd, memory-views and Python type system, etc.). If NumPy
becomes just an interface-specification, then why don't we just do that
*outside* NumPy itself in a way that doesn't jeopardize the stability of
NumPy today. These are some of the real questions I have. I will try
to write up my thoughts in more depth soon, but I won't be able to respond
in-depth right now. I just wanted to comment because Nathaniel said I
disagree which is only partly true.
The three most important things for me are 1) let's make sure we have
representation from as wide of the community as possible (this is really
hard), 2) let's look around at the broader community and the prior art that
is happening in this space right now and 3) let's not pretend we are going
to be able to make all this happen without breaking ABI compatibility.
Let's just break ABI compatibility with NumPy 2.0 *and* have as much
fidelity with the API and semantics of current NumPy as possible (though
there will be some changes necessary long-term).
I don't think we should intentionally break ABI if we can avoid it, but I
also don't think we should spend in-ordinate amounts of time trying to
pretend that we won't break ABI (for at least some people), and most
importantly we should not pretend *not* to break the ABI when we actually
do. We did this once before with the roll-out of date-time, and it was
really un-necessary. When I released NumPy 1.0, there were several
things that I knew should be fixed very soon (NumPy was never designed to
not break ABI). Those problems are still there. Now, that we have
quite a bit better understanding of what NumPy *should* be (there have been
tremendous strides in understanding and community size over the past 10
years), let's actually make the infrastructure we think will last for the
next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
code-base that wasn't designed for it).
NumPy is a hard code-base. It has been since Numeric days in 1995. I
could be wrong, but my guess is that we will be passed by as a community if
we don't seize the opportunity to build something better than we can build
if we are forced to use a 20 year old code-base.
It is more important to not break people's code and to be clear when a
re-compile is necessary for dependencies. Those to me are the most
important constraints. There are a lot of great ideas that we all have
about what we want NumPy to be able to do. Some of this are pretty
transformational (and the more exciting they are, the harder I think they
are going to be to implement without breaking at least the ABI). There
is probably some CAP-like theorem around
Stability-Features-Speed-of-Development (pick 2) when it comes to Open
Source Software development and making feature-progress with NumPy *is
going* to create in-stability which concerns me.
I would like to see a little-bit-of-pain one time with a NumPy 2.0, rather
than a constant pain because of constant churn over many years approach
that Nathaniel seems to advocate. To me NumPy 2.0 is an ABI-breaking
release that is as API-compatible as possible and whose semantics are not
dramatically different.
There are at least 3 areas of compatibility (ABI, API, and semantic).
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.
API compatibility should be much more sacrosanct, but it is also something
that can also be managed. Any NumPy 2.0 should definitely support the
full NumPy API (though there could be deprecated swaths). I think the
community has done well in using deprecation and limiting the public API to
make this more manageable and I would love to see a NumPy 2.0 that
solidifies a future-oriented API along with a back-ward compatible API that
is also available.
Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
I think the only thing that looks even a little bit like a numpy 2.0 at
this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a
major project. Dynd is 2.5+ years old, 3500+ commits in, and still in
progress. If there is a decision to pursue Dynd I could support that, but
I think we would want to think deeply about how to make the transition as
painless as possible. It would be good at this point to get some feedback
from people currently using dynd. IIRC, part of the reason for starting
dynd was the perception that is was not possible to evolve numpy without
running into compatibility road blocks. Travis, could you perhaps summarize
the thinking that went into the decision to make dynd a separate project?

<snip>

Chuck
David Cournapeau
2015-08-26 00:53:00 UTC
Permalink
Thanks for the good summary Nathaniel.

Regarding dtype machinery, I agree casting is the hardest part. Unless the
code has changed dramatically, this was the main reason why you could not
make most of the dtypes separate from numpy codebase (I tried to move the
datetime dtype out of multiarray into a separate C extension some years
ago). Being able to separate the dtypes from the multiarray module would be
an obvious way to drive the internal API change.

Regarding the use of cython in numpy, was there any discussion about the
compilation/size cost of using cython, and talking to the cython team to
improve this ? Or was that considered acceptable with current cython for
numpy. I am convinced cleanly separating the low level parts from the
python C API plumbing would be the single most important thing one could do
to make the codebase more amenable.

David
Post by Charles R Harris
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
I've am very eager to understand how to help NumPy and the wider
community move forward however I can (my passions on this have not changed
since 1999, though what I myself spend time on has changed).
There are a lot of ways to think about approaching this, though. It's
hard to get all the ideas on the table, and it was unfortunate we couldn't
get everybody wyho are core NumPy devs together in person to have this
discussion as there are still a lot of questions unanswered and a lot of
thought that has gone into other approaches that was not brought up or
represented in the meeting (how does Numba fit into this, what about
data-shape, dynd, memory-views and Python type system, etc.). If NumPy
becomes just an interface-specification, then why don't we just do that
*outside* NumPy itself in a way that doesn't jeopardize the stability of
NumPy today. These are some of the real questions I have. I will try
to write up my thoughts in more depth soon, but I won't be able to respond
in-depth right now. I just wanted to comment because Nathaniel said I
disagree which is only partly true.
The three most important things for me are 1) let's make sure we have
representation from as wide of the community as possible (this is really
hard), 2) let's look around at the broader community and the prior art that
is happening in this space right now and 3) let's not pretend we are going
to be able to make all this happen without breaking ABI compatibility.
Let's just break ABI compatibility with NumPy 2.0 *and* have as much
fidelity with the API and semantics of current NumPy as possible (though
there will be some changes necessary long-term).
I don't think we should intentionally break ABI if we can avoid it, but I
also don't think we should spend in-ordinate amounts of time trying to
pretend that we won't break ABI (for at least some people), and most
importantly we should not pretend *not* to break the ABI when we actually
do. We did this once before with the roll-out of date-time, and it was
really un-necessary. When I released NumPy 1.0, there were several
things that I knew should be fixed very soon (NumPy was never designed to
not break ABI). Those problems are still there. Now, that we have
quite a bit better understanding of what NumPy *should* be (there have been
tremendous strides in understanding and community size over the past 10
years), let's actually make the infrastructure we think will last for the
next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
code-base that wasn't designed for it).
NumPy is a hard code-base. It has been since Numeric days in 1995. I
could be wrong, but my guess is that we will be passed by as a community if
we don't seize the opportunity to build something better than we can build
if we are forced to use a 20 year old code-base.
It is more important to not break people's code and to be clear when a
re-compile is necessary for dependencies. Those to me are the most
important constraints. There are a lot of great ideas that we all have
about what we want NumPy to be able to do. Some of this are pretty
transformational (and the more exciting they are, the harder I think they
are going to be to implement without breaking at least the ABI). There
is probably some CAP-like theorem around
Stability-Features-Speed-of-Development (pick 2) when it comes to Open
Source Software development and making feature-progress with NumPy *is
going* to create in-stability which concerns me.
I would like to see a little-bit-of-pain one time with a NumPy 2.0,
rather than a constant pain because of constant churn over many years
approach that Nathaniel seems to advocate. To me NumPy 2.0 is an
ABI-breaking release that is as API-compatible as possible and whose
semantics are not dramatically different.
There are at least 3 areas of compatibility (ABI, API, and semantic).
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.
API compatibility should be much more sacrosanct, but it is also
something that can also be managed. Any NumPy 2.0 should definitely
support the full NumPy API (though there could be deprecated swaths). I
think the community has done well in using deprecation and limiting the
public API to make this more manageable and I would love to see a NumPy 2.0
that solidifies a future-oriented API along with a back-ward compatible API
that is also available.
Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
I think the only thing that looks even a little bit like a numpy 2.0 at
this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a
major project. Dynd is 2.5+ years old, 3500+ commits in, and still in
progress. If there is a decision to pursue Dynd I could support that, but
I think we would want to think deeply about how to make the transition as
painless as possible. It would be good at this point to get some feedback
from people currently using dynd. IIRC, part of the reason for starting
dynd was the perception that is was not possible to evolve numpy without
running into compatibility road blocks. Travis, could you perhaps summarize
the thinking that went into the decision to make dynd a separate project?
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-08-26 07:05:41 UTC
Permalink
Post by David Cournapeau
Thanks for the good summary Nathaniel.
Regarding dtype machinery, I agree casting is the hardest part. Unless the
code has changed dramatically, this was the main reason why you could not
make most of the dtypes separate from numpy codebase (I tried to move the
datetime dtype out of multiarray into a separate C extension some years
ago). Being able to separate the dtypes from the multiarray module would be
an obvious way to drive the internal API change.
For practical reasons I don't imagine we'll ever want to actually move
the core dtypes out of multiarray -- if nothing else they will always
remain a little bit special, like np.array([1.0, 2.0]) will just
"know" that this should use the float64 dtype. But yeah, in general a
good heuristic would be that -- aside from a few limited cases like
that -- we want to make built-in dtypes and user-defined dtypes use
the same APIs.
Post by David Cournapeau
Regarding the use of cython in numpy, was there any discussion about the
compilation/size cost of using cython, and talking to the cython team to
improve this ? Or was that considered acceptable with current cython for
numpy. I am convinced cleanly separating the low level parts from the python
C API plumbing would be the single most important thing one could do to make
the codebase more amenable.
It's still a more blue-sky idea than that... the discussion was more
at the level of "is this something that is even worth trying to make
work and seeing where the problems are?"

The big immediate problem, before we got into code size issues, would
be that we would need to be able to compile a mix of .pyx files and .c
files into a single .so, while cython generated code currently makes
some strong assumptions about how each .pyx file will live in its own
.so. From playing around with it I suspect the first version of making
this work will be klugey indeed. But yeah, the thing to do would be
for someone to dig in and make the kluges and then decide how to clean
them up once you know where they are.

-n
--
Nathaniel J. Smith -- http://vorpus.org
Sebastian Berg
2015-08-26 08:57:57 UTC
Permalink
Post by Nathaniel Smith
Post by David Cournapeau
Thanks for the good summary Nathaniel.
Regarding dtype machinery, I agree casting is the hardest part. Unless the
code has changed dramatically, this was the main reason why you could not
make most of the dtypes separate from numpy codebase (I tried to move the
datetime dtype out of multiarray into a separate C extension some years
ago). Being able to separate the dtypes from the multiarray module would be
an obvious way to drive the internal API change.
For practical reasons I don't imagine we'll ever want to actually move
the core dtypes out of multiarray -- if nothing else they will always
remain a little bit special, like np.array([1.0, 2.0]) will just
"know" that this should use the float64 dtype. But yeah, in general a
good heuristic would be that -- aside from a few limited cases like
that -- we want to make built-in dtypes and user-defined dtypes use
the same APIs.
Well, casting is the conceptional hardest part. Marrying it to the rest
of numpy is probably just as hard ;).

With the chance of not having thought this through enough, maybe some
points about the general discussion. I think I would like some more
clarity of what we want and especially *need* [1].

From SciPy, there were two things I particularly remember:
1. the dtype/scalar issue
2. making an interface to make array-likes interaction more sane (this I
think can go quite far, and we are already going part of it)

The dtypes/scalars seem a particularly dark corner of numpy and if it is
feasible for us to replace it with something new, then I would be
willing to do some breaks for it (admittingly, given protest, I would
back down from that and another solution would be needed).

The point for me is, I currently think a dtype/scalar could get numpy a
big way, especially from the point of view of downstream packages. Of
course it would be harder to do in numpy then in something new, but it
should also be of much more immediate use.
Maybe I am going a bit too far with this right now, but I could imagine
that if we cannot clean up the dtype/scalars, numpy may indeed be doomed
or at least a brick slowing down a lot of other people.

And if it is not possible to do this without a numpy 2, then likely that
is the way to go. But I am not convinced we should aim to fix all the
other stuff at the same time. I am afraid it would just accumulate to
grow over everyones heads.
In other words, I think if we can muster the resources I would like to
see this problem attacked within numpy. If this proves impossible a new
dtype abstraction may well be reason for numpy 2, or used by a DyND or
similar? But I do believe we should not give up on Numpy here from the
start, at least I do not see a compelling reason to do. Instead giving
up on numpy seems like the last way out of a misery.
And much of the different opinions to me seem to be whether we think
this will clearly happen or not or has already happened (or maybe
whether it is too costly to do in numpy).

Cleaning it up, would open doors to many things. Note that I think it
would make the numpy source much less scary, because I think it is the
one big piece of code that is maybe not clearly a separate chunk [2].
After making it sane, I would argue that numpy does become much more
maintainable and extensible. From my current view, probably enough so
for a long time.
Also, I think it would give us abstraction to make different/new
projects work together better and if done well enough, some grand new
project set to replace numpy could reuse it.

Of course it is entirely possible that more things need to be changed in
numpy and that some others would be just as hard or even harder to do.
But if we can identify this as the "one big thing that gets us 90%" then
I refuse to give up hope of doing it in numpy just yet.

- Sebastian


[1] Travis has said quite a lot about it, but it is not yet clear to me
what is a priority/real pain point. Take "datashape" for example. By now
I think that the datashape is likely a good idea to make structured
arrays nicer, since it moves the "structured" part into the array object
and not the dtype, which makes sense to me. However, I am not convinced
that the datashape is something that would make numpy a compelling
amount better. In fact I could imagine that for many things it would
make it unnecessarily more complicated for users.


[2] Take indexing, I like to think I did not break that much when
redoing it (except on purpose, which I hope did not create much
trouble). In some sense indexing was simple to redo, because it does not
overlap at all with anything else directly. If we get dtypes/scalars
more separated, I think we are at a point where this is possible with
pretty much any part of numpy.
Post by Nathaniel Smith
Post by David Cournapeau
Regarding the use of cython in numpy, was there any discussion about the
compilation/size cost of using cython, and talking to the cython team to
improve this ? Or was that considered acceptable with current cython for
numpy. I am convinced cleanly separating the low level parts from the python
C API plumbing would be the single most important thing one could do to make
the codebase more amenable.
It's still a more blue-sky idea than that... the discussion was more
at the level of "is this something that is even worth trying to make
work and seeing where the problems are?"
The big immediate problem, before we got into code size issues, would
be that we would need to be able to compile a mix of .pyx files and .c
files into a single .so, while cython generated code currently makes
some strong assumptions about how each .pyx file will live in its own
.so. From playing around with it I suspect the first version of making
this work will be klugey indeed. But yeah, the thing to do would be
for someone to dig in and make the kluges and then decide how to clean
them up once you know where they are.
-n
Travis Oliphant
2015-08-26 03:34:25 UTC
Permalink
Post by Travis Oliphant
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
<snip>
There are at least 3 areas of compatibility (ABI, API, and semantic).
Post by Travis Oliphant
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.
API compatibility should be much more sacrosanct, but it is also
something that can also be managed. Any NumPy 2.0 should definitely
support the full NumPy API (though there could be deprecated swaths). I
think the community has done well in using deprecation and limiting the
public API to make this more manageable and I would love to see a NumPy 2.0
that solidifies a future-oriented API along with a back-ward compatible API
that is also available.
Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
I think the only thing that looks even a little bit like a numpy 2.0 at
this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a
major project. Dynd is 2.5+ years old, 3500+ commits in, and still in
progress. If there is a decision to pursue Dynd I could support that, but
I think we would want to think deeply about how to make the transition as
painless as possible. It would be good at this point to get some feedback
from people currently using dynd. IIRC, part of the reason for starting
dynd was the perception that is was not possible to evolve numpy without
running into compatibility road blocks. Travis, could you perhaps summarize
the thinking that went into the decision to make dynd a separate project?
Thanks Chuck. I'll do this in a separate email, but I just wanted to
point out that when I say NumPy 2.0, I'm actually only specifically talking
about a release of NumPy that breaks ABI compatibility --- not some
potential re-write. I'm not ruling that out, but I'm not necessarily
implying such a thing by saying NumPy 2.0.
Post by Travis Oliphant
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*


@teoliphant
512-222-5440
http://www.continuum.io
Travis Oliphant
2015-08-26 04:55:49 UTC
Permalink
Post by Travis Oliphant
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
<snip>
I think that summarizes my main concerns. I will write-up more forward
Post by Travis Oliphant
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
I think the only thing that looks even a little bit like a numpy 2.0 at
this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a
major project. Dynd is 2.5+ years old, 3500+ commits in, and still in
progress. If there is a decision to pursue Dynd I could support that, but
I think we would want to think deeply about how to make the transition as
painless as possible. It would be good at this point to get some feedback
from people currently using dynd. IIRC, part of the reason for starting
dynd was the perception that is was not possible to evolve numpy without
running into compatibility road blocks. Travis, could you perhaps summarize
the thinking that went into the decision to make dynd a separate project?
I think it would be best if Mark Wiebe speaks up here. I can explain why
Continuum supported DyND with some fraction of Mark's time for a few years
and give my perspective, but ultimately DyND is Mark's story to tell (and a
few talented people have now joined him in the effort). Mark Wiebe was a
productive NumPy developer. He was one of a few people that jumped in on
the code-base and made substantial and significant changes and came to
understand just how hard it can be to develop in the NumPy code-base.
He also is a C++ developer who really likes the beauty and power of that
language (which definitely biases his NumPy work, but he did put a lot of
effort into making NumPy better). Before Peter and I started Continuum,
Mark had begun the DyND project as an example of a general-purpose dynamic
array library that could be used by any dynamic language to make arrays.

In the early days of Continuum, we spent time from at least Mark W, Bryan
Van de Ven, Jay Borque, and Francesc Alted looking at how to extend NumPy
to add 1) categorical data-types, 2) variable-length strings, and 3) better
date-time types. Bryan, a good developer, who has gone on to be a
primary developer of Bokeh spent quite a bit of time and had a prototype of
categoricals *nearly* working. He did not like working on the NumPy
code-base "at all". He struggled with it and found it very difficult to
extend. He worked closely with Mark Wiebe who helped him the best he
could. What took him 4 weeks in NumPy took him 3 days in DyND to build.
I think that experience, convinced him and Mark W both that working with
NumPy code-base would take too long to make significant progress.

Also, during 2012 I was trying to help with release-management (though I
ended up just hiring Ondrej Certek to actually do the work and he did a
great job of getting a release of NumPy out the door --- thanks to much
help from many of you). At that point, I realized very clearly, that
what I could best do at this point was to try and get more resources for
open source and for the NumPy stack rather than work on the code directly.
We also did work with several clients that helped me realize just how
many disruptive changes had happened from 1.4 to 1.7 for extensive users of
NumPy (much more than would be justified from a "we don't break the ABI"
mantra that was the stated goal).

We also realized that the kind of experimentation we wanted to do in the
first 2 years of Continuum would just not be possible on the NumPy
code-base and the need for getting community buy-in on every decision would
slow us down too much --- as we had to iterate rapidly on so many things
and find our center as a startup. It also would not be fair to the NumPy
community. Our decision to do *all* of our exploration outside the
NumPy code base was basically 1) the kinds of changes we wanted ultimately
were potentially dramatic and disruptive, 2) it would be too difficult and
time-consuming to decide all things in public discussions with the NumPy
community --- especially when some things were experimental 3) tying
ourselves to releases of NumPy would be difficult at that time, and 4) the
design of the NumPy code-base makes it difficult to contribute to --- both
Mark W and Bryan V felt they could make progress *much* faster in a new
code-base.

Continuum did not have enough start-up funding to devote significant time
on DyND in the early days. So Mark rallied what resources he could and
we supported him the best we could and he made progress. My only real
requirement with sponsoring his work when we did was that it must have a
python interface that did not use Boost. He stretched Cython and found a
lot of holes in it and that took a bit of his time as well. I think he is
now a "just write your own wrapper believer" but I shouldn't put words in
his mouth or digress. DyND became part of the Blaze effort once we
received DARPA money (though the grant was primarily for Bokeh but we also
received permission to use some of the funds for Numba and Blaze
development). Because of the other work around Numba and Blaze, DyND work
was delayed quite often. For the Blaze project, mostly DyND became
another implementation of the data-shape data description mechanism and a
way to proto-type computed columns and remote arrays (now in Blaze server).


The Blaze team struggled for the first 18 months with the lack of a gelled
team and a concrete vision for what it should be exactly. Thanks to Andy
Terrel, Phillip Cloud, Mark Wiebe, and Matt Rocklin as well as others who
are currently on the project, Blaze is now much more clear in its goals as
a high-level array and table logical object for scientists,
data-scientists, and engineers that can be backed by larger-than-memory
(i.e. Dask) and cluster-based computational systems (i.e. Spark and
Impala). This clarity was not present as we looked for people to
collaborate with and explored the space of code-compilation, delayed
evaluation, and data-type-systems that are necessary and useful for
distributed array-systems generally. If you look today at Ibis and
Bolt-project you see other examples of what Blaze is. I see massive
overlap between Blaze and these projects. I think the description of
those projects can help you understand Blaze which is why I mention them.

In that confusion, Mark continued to make progress on his C++-based
container-type (at one point we even called it "Blaze-local") that had the
advantage of not requiring a Python-runtime and could fully parse the
data-shape data-description system that is a generalization of NumPy dtypes
(some on Continuum time, some on his own time). Last year, he attracted
the attention of Irwin Zaid who added GPU-computation capability. Last
fall, Pandas was able to make DyND an optional dependency because DyND has
better support for some of the key things Pandas needs and does not require
the full NumPy API. In January, Mark W left Continuum to go back to work
in the digital effects industry on his old code-base though he continues to
take interest in DyND. A month ago, Continuum began to again sponsor Irwin
to work on DyND in order to continue its development at least sufficient to
support 1) Pandas and 2) processing of semi-structured data (like a
collection of JSON objects).

DyND is a bigger system than NumPy (as it doesn't rely on Python at all for
its core functionality). The Python-interface has not always been as up
to date as it could be and Irwin is currently working on that as well as
making it easier to install. I'm sure he would love the help if anyone
wants to join him.

At the same time in 2012, I became very enamored with Numba and the
potential for how Numba could make it possible to not even *have* to depend
on a single container library like NumPy. I often say that If Numba and
Conda had existed 15 years ago, there would not even *be* a SciPy library.
Instead there would be a collection of numba-modules that do all the same
things. We might not even have Julia, as well --- but that is a longer
and more controversial conversation.

With Numba you can write your own array-code as needed. We moved the basic
array-type into an llvm specification (llvm_array.py) in old llvm.py:
https://github.com/llvmpy/llvmpy/blob/master/llvm_array/array.py. (Note
that llvm.py is no longer maintained, though). At this point quite a bit
of the NumPy API is implemented outside of NumPy in Numba (there is still
much more to do, though). As Numba has developed, I have seen how *both*
DyND *and* Numba could independently be an architecture to underly a new
array abstraction that could effectively replace NumPy for people. A
combination of the two would be quite powerful -- especially when combined
now with Dask.

Numba needs 2 things presently before I can confidently say that a numpy
module could be built that is fully backwards API compatible with current
NumPy in about 6 months (though not necessarily semantically in all corner
cases). These 2 things are currently on the near-term Numba road-map: 1)
the ability to ship a Python extension module that does not require numba
to be installed, and 2) jit-classes (so that you can build native-classes
and have that be part of the type-specification.

So, basically you have 2 additional options for NumPy future besides what
Nathaniel laid out: 1) DyND-based or 2) Numba-based. A combination of
the two (DyND for a pre-compiled run-time library) and Numba for JIT
extensions is also a corollary.

A third approach has even more potential to change super-charge Python 3.X
for array-oriented programming. This approach could also be combined
with DyND and/or Numba as desired. This approach is to use the fact that
the buffer protocol in Python exists and therefore we *can* have more than
one array-type. In fact, the basic array-structure exists as the
memory-view object in Python (rescued from its unfinished form by Antoine
and now supported in Cython). The main problem with it as an underlying
array-type for computation 1) it's type-system is low-level struct-string
syntax that is hard to build-on and 2) there are no basic computations on
memory-views. These are both easily remedied. So, the approach would be
to:

1) build a Python-type-to-struct-string syntax translator that would allow
you to create memory-views from a Python-based type-system that replaces
dtype
2) make a new gufunc sub-system that works with memory-views as
containers. I think this would be an interesting project in it's own right
and could borrow from current NumPy a great deal --- I think it would be
simpler than the re-factor of gufuncs that Nathaniel proposes to enable
dtype-information to be available to the low-level multi-methods.

You can basically eliminate NumPy with something that provides those 2
things --- and that is potentially something you could rally PyPy and
Jython and any other Python implementation behind (rather than numpypy
and/or numpy4j). If anyone is interested in pursuing this last idea,
please let me know. It hit me like a brick at PyCon this year after
talking with Nathaniel about what he wanted to do with dtypes and watching
Guido's talk on type-hinting now in Python 3.

Finally, as I've been thinking more and more about *big* data and the needs
of scaling, I've toned-down my infatuation with "typed pointers" (which
NumPy basically is). The real value of "typed pointers" is that there is
so much low-level code out there that does interesting things that use
"typed pointers" for their basic shared abstraction. However, what we
really need shared abstractions around are "typed iterators" and a whole
lot of code that uses these "typed iterators" for all kinds of
calculations. The problem is that there is no C-ABI equivalent for typed
iterators. Where is the BLAS or LAPACK for typed-iterators that doesn't
rely on a particular C++ compiler to get the memory-layout?. Every
language stack implements iterators in their own way --- so you have silos
and not shared abstractions across run-times. The NumPy stack on
typed-iterators is now a *whole lot* harder to build. This is part of
why I want to see jit-classes on Numba -- I want to end up with a defined
ABI for abstractions.

Abstractions are great. Shared abstractions can be *viral* and are
exponentially better. We need more of those! My plea to anyone reading
this is: Please make more shared abstractions ;-) Of course no one person
can make a shared abstraction --- they have to emerge! One person can
make abstractions though --- and that is the pre-requisite to getting them
adopted by others and therefore shared.

I know this is a dump of a lot of information. Some of it might even make
sense and perhaps a little bit might be useful to some of you.

Now for a blatant plea -- if you are interested in working on NumPy (with
ideas from whatever source --- not just mine), please talk to me --- we are
hiring and I can arrange for some of your time to be spent contributing to
any of these ideas (including what Nathaniel wrote about --- as long as we
plan for ABI breakage). Guido offered this for Python, and I will offer
it for NumPy --- if you are a woman with the right back-ground I will
personally commit to training you to be able to work more on NumPy. But,
be warned, working on NumPy is not the path to riches and fame is fleeting
;-)

Best,

-Travis
Post by Travis Oliphant
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*


@teoliphant
512-222-5440
http://www.continuum.io
Nathaniel Smith
2015-08-26 06:41:16 UTC
Permalink
Hi Travis,

Thanks for taking the time to write up your thoughts!

I have many thoughts in return, but I will try to restrict myself to two
main ones :-).

1) On the question of whether work should be directed towards improving
NumPy-as-it-is or instead towards a compatibility-breaking replacement:
There's plenty of room for debate about whether it's better engineering
practice to try and evolve an existing system in place versus starting
over, and I guess we have some fundamental disagreements there, but I
actually think this debate is a distraction -- we can agree to disagree,
because in fact we have to try both.

At a practical level: NumPy *is* going to continue to evolve, because it
has users and people interested in evolving it; similarly, dynd and other
alternatives libraries will also continue to evolve, because they also have
people interested in doing it. And at a normative level, this is a good
thing! If NumPy and dynd both get better, than that's awesome: the worst
case is that NumPy adds the new features that we talked about at the
meeting, and dynd simultaneously becomes so awesome that everyone wants to
switch to it, and the result of this would be... that those NumPy features
are exactly the ones that will make the transition to dynd easier. Or if
some part of that plan goes wrong, then well, NumPy will still be there as
a fallback, and in the mean time we've actually fixed the major pain points
our users are begging us to fix.

You seem to be urging us all to make a double-or-nothing wager that your
extremely ambitious plans will all work out, with the entire numerical
Python ecosystem as the stakes. I think this ambition is awesome, but maybe
it'd be wise to hedge our bets a bit?

2) You really emphasize this idea of an ABI-breaking (but not API-breaking)
release, and I think this must indicate some basic gap in how we're looking
at things. Where I'm getting stuck here is that... I actually can't think
of anything important that we can't do now, but could if we were allowed to
break ABI compatibility. The kinds of things that break ABI but keep API
are like... rearranging what order the fields in a struct fall in, or
changing the numeric value of opaque constants like NPY_ARRAY_WRITEABLE.
The biggest win I can think of is that we could save a few bytes per array
by arranging the fields inside the ndarray struct more optimally, but
that's hardly a feature to hang a 2.0 on. You seem to have a vision of this
ABI-breaking release as being something very different from that, and I'm
not clear on what this vision is.

The main reason I personally am against having a big ABI-breaking release
is not that I hate ABI breakage a priori, it's that all the big features
that I care about and the are users are asking for seem to be ones that...
don't actually require doing that. At most they seem to get a mild benefit
from breaking some obscure corner cases. So the cost/benefits don't make
any sense to me.

So: can you give a concrete example of a change you have in mind where
breaking ABI would be the key enabler?

(I guess you might also be thinking of a separate issue that you sort of
allude to: Perhaps we will try to make changes which we think don't involve
breaking the ABI, but discover too late that we have failed to fully
understand the implications and have broken it by mistake. IIUC this is
what happened in the 1.4 timeframe when datetime64 was merged and
accidentally renumbered some of the NPY_* constants.

Partially I am less worried about this because I have a fair amount of
confidence that our review and QA process has improved these days to the
point that we would not let a change like that slip through by accident --
we have a lot more active reviewers, people are sensitized to the issues,
we've successfully landed intrusive changes like Sebastian's indexing
rewrite, ... though this is very much second-hand impressions on my part,
and I'd welcome input from folks like Chuck who have a clearer view on how
things have changed from then to now.

But more importantly, even if this is true, then I can't see how your
proposal helps. If we aren't good enough at our jobs to predict when we'll
break ABI, then by assumption it makes no sense to pick one release and
decide that this is the one time that we'll break ABI.)
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
I've am very eager to understand how to help NumPy and the wider community
move forward however I can (my passions on this have not changed since
1999, though what I myself spend time on has changed).
There are a lot of ways to think about approaching this, though. It's
hard to get all the ideas on the table, and it was unfortunate we couldn't
get everybody wyho are core NumPy devs together in person to have this
discussion as there are still a lot of questions unanswered and a lot of
thought that has gone into other approaches that was not brought up or
represented in the meeting (how does Numba fit into this, what about
data-shape, dynd, memory-views and Python type system, etc.). If NumPy
becomes just an interface-specification, then why don't we just do that
*outside* NumPy itself in a way that doesn't jeopardize the stability of
NumPy today. These are some of the real questions I have. I will try
to write up my thoughts in more depth soon, but I won't be able to respond
in-depth right now. I just wanted to comment because Nathaniel said I
disagree which is only partly true.
The three most important things for me are 1) let's make sure we have
representation from as wide of the community as possible (this is really
hard), 2) let's look around at the broader community and the prior art that
is happening in this space right now and 3) let's not pretend we are going
to be able to make all this happen without breaking ABI compatibility.
Let's just break ABI compatibility with NumPy 2.0 *and* have as much
fidelity with the API and semantics of current NumPy as possible (though
there will be some changes necessary long-term).
I don't think we should intentionally break ABI if we can avoid it, but I
also don't think we should spend in-ordinate amounts of time trying to
pretend that we won't break ABI (for at least some people), and most
importantly we should not pretend *not* to break the ABI when we actually
do. We did this once before with the roll-out of date-time, and it was
really un-necessary. When I released NumPy 1.0, there were several
things that I knew should be fixed very soon (NumPy was never designed to
not break ABI). Those problems are still there. Now, that we have
quite a bit better understanding of what NumPy *should* be (there have been
tremendous strides in understanding and community size over the past 10
years), let's actually make the infrastructure we think will last for the
next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
code-base that wasn't designed for it).
NumPy is a hard code-base. It has been since Numeric days in 1995. I
could be wrong, but my guess is that we will be passed by as a community if
we don't seize the opportunity to build something better than we can build
if we are forced to use a 20 year old code-base.
It is more important to not break people's code and to be clear when a
re-compile is necessary for dependencies. Those to me are the most
important constraints. There are a lot of great ideas that we all have
about what we want NumPy to be able to do. Some of this are pretty
transformational (and the more exciting they are, the harder I think they
are going to be to implement without breaking at least the ABI). There
is probably some CAP-like theorem around
Stability-Features-Speed-of-Development (pick 2) when it comes to Open
Source Software development and making feature-progress with NumPy *is
going* to create in-stability which concerns me.
I would like to see a little-bit-of-pain one time with a NumPy 2.0, rather
than a constant pain because of constant churn over many years approach
that Nathaniel seems to advocate. To me NumPy 2.0 is an ABI-breaking
release that is as API-compatible as possible and whose semantics are not
dramatically different.
There are at least 3 areas of compatibility (ABI, API, and semantic).
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.
API compatibility should be much more sacrosanct, but it is also something
that can also be managed. Any NumPy 2.0 should definitely support the
full NumPy API (though there could be deprecated swaths). I think the
community has done well in using deprecation and limiting the public API to
make this more manageable and I would love to see a NumPy 2.0 that
solidifies a future-oriented API along with a back-ward compatible API that
is also available.
Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
Best,
-Travis
Post by Nathaniel Smith
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.
Attendees
=========
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
==========================================
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap
===================
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system,
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
Money
=====
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
[http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
==================================
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.
Other points that were discussed
================================
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.
Links to raw notes
==================
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
[
https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
]
[
https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
]
[
https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
]
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*
@teoliphant
512-222-5440
http://www.continuum.io
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Nathaniel J. Smith -- http://vorpus.org
Travis Oliphant
2015-08-26 14:06:14 UTC
Permalink
Post by Nathaniel Smith
Hi Travis,
Thanks for taking the time to write up your thoughts!
I have many thoughts in return, but I will try to restrict myself to two
main ones :-).
1) On the question of whether work should be directed towards improving
There's plenty of room for debate about whether it's better engineering
practice to try and evolve an existing system in place versus starting
over, and I guess we have some fundamental disagreements there, but I
actually think this debate is a distraction -- we can agree to disagree,
because in fact we have to try both.
Yes, on this we agree. I think NumPy can improve *and* we can have new
innovative array objects. I don't disagree about that.
Post by Nathaniel Smith
At a practical level: NumPy *is* going to continue to evolve, because it
has users and people interested in evolving it; similarly, dynd and other
alternatives libraries will also continue to evolve, because they also have
people interested in doing it. And at a normative level, this is a good
thing! If NumPy and dynd both get better, than that's awesome: the worst
case is that NumPy adds the new features that we talked about at the
meeting, and dynd simultaneously becomes so awesome that everyone wants to
switch to it, and the result of this would be... that those NumPy features
are exactly the ones that will make the transition to dynd easier. Or if
some part of that plan goes wrong, then well, NumPy will still be there as
a fallback, and in the mean time we've actually fixed the major pain points
our users are begging us to fix.
You seem to be urging us all to make a double-or-nothing wager that your
extremely ambitious plans will all work out, with the entire numerical
Python ecosystem as the stakes. I think this ambition is awesome, but maybe
it'd be wise to hedge our bets a bit?
You are mis-characterizing my view. I think NumPy can evolve (though I
would personally rather see a bigger change to the underlying system like I
outlined before). But, I don't believe it can even evolve easily in the
direction needed without breaking ABI and that insisting on not breaking it
or even putting too much effort into not breaking it will continue to
create less-optimal solutions that are harder to maintain and do not take
advantage of knowledge this community now has.

I'm also very concerned that 'evolving' NumPy will create a situation where
there are regular semantic and subtle API changes that will cause NumPy to
be less stable for it's user-base. I've watched this happen. This at a
time that people are already looking around for new and different
approaches anyway.
Post by Nathaniel Smith
2) You really emphasize this idea of an ABI-breaking (but not
API-breaking) release, and I think this must indicate some basic gap in how
we're looking at things. Where I'm getting stuck here is that... I actually
can't think of anything important that we can't do now, but could if we
were allowed to break ABI compatibility. The kinds of things that break ABI
but keep API are like... rearranging what order the fields in a struct fall
in, or changing the numeric value of opaque constants like
NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a
few bytes per array by arranging the fields inside the ndarray struct more
optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a
vision of this ABI-breaking release as being something very different from
that, and I'm not clear on what this vision is.
We already broke the ABI with date-time changes --- it's still broken for a
certain percentage of users last I checked. So, part of my disagreement
is that we've tried this and it didn't work --- even though smart people
thought it would. I've had to deal with this personally and I'm not
enthusiastic about having to deal with this for the next 5 years because of
even more attempts to make changes while not breaking the ABI. I think
the group is more careful now --- but I still think the API is broad enough
and uses of NumPy deep enough that the effort involved in trying not to
break the ABI is just not worth the effort (because it's a non-feature
today). Adding new dtypes without breaking the ABI is tricky (and to do
it without breaking the ABI is ugly). I also continue to believe that
putting out a new ABI-breaking NumPy will allow re-compiling *once* (with
some porting changes needed) and not subtle breakages requiring
code-changes every time a release is made. If subtle changes aren't
made, then the new features won't come. Right now, I'd rather have
stability from NumPy than new features. New features can come from other
libraries.

One specific change that could easily be made in NumPy 2.0 (the current
code but with an ABI change) is that Dtypes should become true type objects
and array-scalars (which are the current type-objects) should become
instances of those dtypes. That is the biggest clean-up needed, I think on
the array-front. There should not be *both* array-scalars and dtype
objects. They are the same thing fundamentally. It was a mistake to
have both of them. I don't see how to make that change without breaking
the ABI. Perhaps it could be done in a creative way --- but why put the
effort into that and end up with an even more hacky code-base.

NumPy's ABI was influenced by and evolved from Numeric and Numarray. It
was not "designed" to last 30 years.

I think the dtype "types" should potentially have different
member-structures. The ufunc sub-system needs an overhaul --- it's
member structures need upgrades. With generalized ufuncs and the
iteration protocols of Mark Wiebe we know a whole lot more about ufuncs
now. Ufuncs are the same 1995 structure that Jim Hugunin wrote. I
suppose you *could* just tack new functions on the end of structure and
keep growing the list (while leaving old, unused structures as unused or
deprecated) --- or you can take the opportunity to tidy up a bit. The
longer you leave everything the same, the harder you make the code-base and
the more costly maintenance becomes. I just don't see the value there
--- and I see a lot of pain.

Regarding the ufunc subsystem. We've argued before about the lack of
mulit-methods in NumPy. Continuing to add dunder-methods to try and get
around it will continue to make the system harder to maintain and more
brittle.

You mention making NumPy an interface to multiple things along with many
other ideas. I don't believe you can get there without real changes that
break things (at the very least semantic changes). I'm not excited about
those changes causing instability (which they will cause ---- to me the
burden of proof that they won't is on you who wants to make the change and
not on me to say how they will). I also think it will take much
longer to get there incrementally (if at all) than just creating something
on top of newer ideas.
Post by Nathaniel Smith
The main reason I personally am against having a big ABI-breaking release
is not that I hate ABI breakage a priori, it's that all the big features
that I care about and the are users are asking for seem to be ones that...
don't actually require doing that. At most they seem to get a mild benefit
from breaking some obscure corner cases. So the cost/benefits don't make
any sense to me.
So: can you give a concrete example of a change you have in mind where
breaking ABI would be the key enabler?
(I guess you might also be thinking of a separate issue that you sort of
allude to: Perhaps we will try to make changes which we think don't involve
breaking the ABI, but discover too late that we have failed to fully
understand the implications and have broken it by mistake. IIUC this is
what happened in the 1.4 timeframe when datetime64 was merged and
accidentally renumbered some of the NPY_* constants.
Yes, this is what I'm mainly worried about. But, more than that, I'm
concerned about general *semantic* and API changes at a rapid pace for a
community that is just looking for stability and bug-fixes from NumPy
itself --- with innovation happening elsewhere.
Post by Nathaniel Smith
Partially I am less worried about this because I have a fair amount of
confidence that our review and QA process has improved these days to the
point that we would not let a change like that slip through by accident --
we have a lot more active reviewers, people are sensitized to the issues,
we've successfully landed intrusive changes like Sebastian's indexing
rewrite, ... though this is very much second-hand impressions on my part,
and I'd welcome input from folks like Chuck who have a clearer view on how
things have changed from then to now.
But more importantly, even if this is true, then I can't see how your
proposal helps. If we aren't good enough at our jobs to predict when we'll
break ABI, then by assumption it makes no sense to pick one release and
decide that this is the one time that we'll break ABI.)
I don't understand your point. Picking a release to break the ABI allows
you to actually do things like change macros to functions and move
structures around to be more consistent with a new design that is easier to
maintain and allows more growth. It has nothing to do with "whether you
are good at your job". Everyone has strengths and weaknesses.

This kind of clean-up may be needed regularly --- every 3 years would not
be a crazy pattern, but it could also be every 5 years if you wanted more
discipline. I already knew we needed to break the ABI "soonish" when I
released NumPy 1.0. The fact that we haven't officially done it yet (but
have done it unofficially) is a great injustice to "what could be" and has
slowed development of NumPy tremendously.

We've gone back and forth on this. I'm fine if we disagree, but I just
hope the disagreement doesn't lead to lack of cooperation as we both have
the same ultimate interests in seeing array-computing in Python improve.
I just don't support *major* changes without breaking the ABI without a
whole lot of proof that it is possible (without hackiness). You have
mentioned on your roadmap a lot of what I would consider *major* changes.
Some of it you describe how to get there. The most important change
(improving the dtype system) you don't.

Part of my point is that we now *know* how to improve the dtype system.
Let's do it. Let's not try "yet again" to do it differently inside an old
system designed by a scientist who didn't understand type-theory or type
systems (that was me by the way). Look at data-shape in the blaze
project. Take that and build a Python type-system that also outputs
struct-string syntax for memory-views. That's the data-description system
that NumPy should be using --- not trying to hack on a mixed array-scalar,
dtype-object system that may never support everything we now know is
needed.

Trying to incrementing from where we are now will only lead to a
sub-optimal outcome and unfortunate instability when we already know what
to do differently. I doubt I will convince you --- certainly not via
email. I apologize in advance that I likely won't be able to respond in
depth to any more questions that are really just "prove to me that I can't"
kind of questions. Of course I can't prove that. All I'm saying is that
to me the evidence and my experience leads me to not be able to support
major changes like you have proposed without also intentionally breaking
the ABI (and thus calling it NumPy 2.0).

If I find time to write, I will try to use it to outline more specifically
what I think is a better approach to array- and table-computing in Python
that keeps the stability of NumPy and adds new features using different
approaches.

-Travis
Post by Nathaniel Smith
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
I've am very eager to understand how to help NumPy and the wider
community move forward however I can (my passions on this have not changed
since 1999, though what I myself spend time on has changed).
There are a lot of ways to think about approaching this, though. It's
hard to get all the ideas on the table, and it was unfortunate we couldn't
get everybody wyho are core NumPy devs together in person to have this
discussion as there are still a lot of questions unanswered and a lot of
thought that has gone into other approaches that was not brought up or
represented in the meeting (how does Numba fit into this, what about
data-shape, dynd, memory-views and Python type system, etc.). If NumPy
becomes just an interface-specification, then why don't we just do that
*outside* NumPy itself in a way that doesn't jeopardize the stability of
NumPy today. These are some of the real questions I have. I will try
to write up my thoughts in more depth soon, but I won't be able to respond
in-depth right now. I just wanted to comment because Nathaniel said I
disagree which is only partly true.
The three most important things for me are 1) let's make sure we have
representation from as wide of the community as possible (this is really
hard), 2) let's look around at the broader community and the prior art that
is happening in this space right now and 3) let's not pretend we are going
to be able to make all this happen without breaking ABI compatibility.
Let's just break ABI compatibility with NumPy 2.0 *and* have as much
fidelity with the API and semantics of current NumPy as possible (though
there will be some changes necessary long-term).
I don't think we should intentionally break ABI if we can avoid it, but I
also don't think we should spend in-ordinate amounts of time trying to
pretend that we won't break ABI (for at least some people), and most
importantly we should not pretend *not* to break the ABI when we actually
do. We did this once before with the roll-out of date-time, and it was
really un-necessary. When I released NumPy 1.0, there were several
things that I knew should be fixed very soon (NumPy was never designed to
not break ABI). Those problems are still there. Now, that we have
quite a bit better understanding of what NumPy *should* be (there have been
tremendous strides in understanding and community size over the past 10
years), let's actually make the infrastructure we think will last for the
next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
code-base that wasn't designed for it).
NumPy is a hard code-base. It has been since Numeric days in 1995. I
could be wrong, but my guess is that we will be passed by as a community if
we don't seize the opportunity to build something better than we can build
if we are forced to use a 20 year old code-base.
It is more important to not break people's code and to be clear when a
re-compile is necessary for dependencies. Those to me are the most
important constraints. There are a lot of great ideas that we all have
about what we want NumPy to be able to do. Some of this are pretty
transformational (and the more exciting they are, the harder I think they
are going to be to implement without breaking at least the ABI). There
is probably some CAP-like theorem around
Stability-Features-Speed-of-Development (pick 2) when it comes to Open
Source Software development and making feature-progress with NumPy *is
going* to create in-stability which concerns me.
I would like to see a little-bit-of-pain one time with a NumPy 2.0,
rather than a constant pain because of constant churn over many years
approach that Nathaniel seems to advocate. To me NumPy 2.0 is an
ABI-breaking release that is as API-compatible as possible and whose
semantics are not dramatically different.
There are at least 3 areas of compatibility (ABI, API, and semantic).
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.
API compatibility should be much more sacrosanct, but it is also
something that can also be managed. Any NumPy 2.0 should definitely
support the full NumPy API (though there could be deprecated swaths). I
think the community has done well in using deprecation and limiting the
public API to make this more manageable and I would love to see a NumPy 2.0
that solidifies a future-oriented API along with a back-ward compatible API
that is also available.
Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
Best,
-Travis
Post by Nathaniel Smith
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.
Attendees
=========
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
==========================================
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap
===================
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system,
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
Money
=====
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
[http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
==================================
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.
Other points that were discussed
================================
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.
Links to raw notes
==================
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
[
https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
]
[
https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
]
[
https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
]
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*
@teoliphant
512-222-5440
http://www.continuum.io
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*


@teoliphant
512-222-5440
http://www.continuum.io
j***@gmail.com
2015-08-27 15:03:46 UTC
Permalink
Post by Travis Oliphant
Post by Nathaniel Smith
Hi Travis,
Thanks for taking the time to write up your thoughts!
I have many thoughts in return, but I will try to restrict myself to two
main ones :-).
1) On the question of whether work should be directed towards improving
There's plenty of room for debate about whether it's better engineering
practice to try and evolve an existing system in place versus starting
over, and I guess we have some fundamental disagreements there, but I
actually think this debate is a distraction -- we can agree to disagree,
because in fact we have to try both.
Yes, on this we agree. I think NumPy can improve *and* we can have new
innovative array objects. I don't disagree about that.
Post by Nathaniel Smith
At a practical level: NumPy *is* going to continue to evolve, because it
has users and people interested in evolving it; similarly, dynd and other
alternatives libraries will also continue to evolve, because they also have
people interested in doing it. And at a normative level, this is a good
thing! If NumPy and dynd both get better, than that's awesome: the worst
case is that NumPy adds the new features that we talked about at the
meeting, and dynd simultaneously becomes so awesome that everyone wants to
switch to it, and the result of this would be... that those NumPy features
are exactly the ones that will make the transition to dynd easier. Or if
some part of that plan goes wrong, then well, NumPy will still be there as
a fallback, and in the mean time we've actually fixed the major pain points
our users are begging us to fix.
You seem to be urging us all to make a double-or-nothing wager that your
extremely ambitious plans will all work out, with the entire numerical
Python ecosystem as the stakes. I think this ambition is awesome, but maybe
it'd be wise to hedge our bets a bit?
You are mis-characterizing my view. I think NumPy can evolve (though I
would personally rather see a bigger change to the underlying system like I
outlined before). But, I don't believe it can even evolve easily in the
direction needed without breaking ABI and that insisting on not breaking it
or even putting too much effort into not breaking it will continue to
create less-optimal solutions that are harder to maintain and do not take
advantage of knowledge this community now has.
I'm also very concerned that 'evolving' NumPy will create a situation
where there are regular semantic and subtle API changes that will cause
NumPy to be less stable for it's user-base. I've watched this happen.
This at a time that people are already looking around for new and different
approaches anyway.
Post by Nathaniel Smith
2) You really emphasize this idea of an ABI-breaking (but not
API-breaking) release, and I think this must indicate some basic gap in how
we're looking at things. Where I'm getting stuck here is that... I actually
can't think of anything important that we can't do now, but could if we
were allowed to break ABI compatibility. The kinds of things that break ABI
but keep API are like... rearranging what order the fields in a struct fall
in, or changing the numeric value of opaque constants like
NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a
few bytes per array by arranging the fields inside the ndarray struct more
optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a
vision of this ABI-breaking release as being something very different from
that, and I'm not clear on what this vision is.
We already broke the ABI with date-time changes --- it's still broken for
a certain percentage of users last I checked. So, part of my
disagreement is that we've tried this and it didn't work --- even though
smart people thought it would. I've had to deal with this personally and
I'm not enthusiastic about having to deal with this for the next 5 years
because of even more attempts to make changes while not breaking the ABI.
I think the group is more careful now --- but I still think the API is
broad enough and uses of NumPy deep enough that the effort involved in
trying not to break the ABI is just not worth the effort (because it's a
non-feature today). Adding new dtypes without breaking the ABI is tricky
(and to do it without breaking the ABI is ugly). I also continue to
believe that putting out a new ABI-breaking NumPy will allow re-compiling
*once* (with some porting changes needed) and not subtle breakages
requiring code-changes every time a release is made. If subtle changes
aren't made, then the new features won't come. Right now, I'd rather have
stability from NumPy than new features. New features can come from other
libraries.
One specific change that could easily be made in NumPy 2.0 (the current
code but with an ABI change) is that Dtypes should become true type objects
and array-scalars (which are the current type-objects) should become
instances of those dtypes. That is the biggest clean-up needed, I think on
the array-front. There should not be *both* array-scalars and dtype
objects. They are the same thing fundamentally. It was a mistake to
have both of them. I don't see how to make that change without breaking
the ABI. Perhaps it could be done in a creative way --- but why put the
effort into that and end up with an even more hacky code-base.
NumPy's ABI was influenced by and evolved from Numeric and Numarray. It
was not "designed" to last 30 years.
I think the dtype "types" should potentially have different
member-structures. The ufunc sub-system needs an overhaul --- it's
member structures need upgrades. With generalized ufuncs and the
iteration protocols of Mark Wiebe we know a whole lot more about ufuncs
now. Ufuncs are the same 1995 structure that Jim Hugunin wrote. I
suppose you *could* just tack new functions on the end of structure and
keep growing the list (while leaving old, unused structures as unused or
deprecated) --- or you can take the opportunity to tidy up a bit. The
longer you leave everything the same, the harder you make the code-base and
the more costly maintenance becomes. I just don't see the value there
--- and I see a lot of pain.
Regarding the ufunc subsystem. We've argued before about the lack of
mulit-methods in NumPy. Continuing to add dunder-methods to try and get
around it will continue to make the system harder to maintain and more
brittle.
You mention making NumPy an interface to multiple things along with many
other ideas. I don't believe you can get there without real changes that
break things (at the very least semantic changes). I'm not excited about
those changes causing instability (which they will cause ---- to me the
burden of proof that they won't is on you who wants to make the change and
not on me to say how they will). I also think it will take much
longer to get there incrementally (if at all) than just creating something
on top of newer ideas.
Post by Nathaniel Smith
The main reason I personally am against having a big ABI-breaking release
is not that I hate ABI breakage a priori, it's that all the big features
that I care about and the are users are asking for seem to be ones that...
don't actually require doing that. At most they seem to get a mild benefit
from breaking some obscure corner cases. So the cost/benefits don't make
any sense to me.
So: can you give a concrete example of a change you have in mind where
breaking ABI would be the key enabler?
(I guess you might also be thinking of a separate issue that you sort of
allude to: Perhaps we will try to make changes which we think don't involve
breaking the ABI, but discover too late that we have failed to fully
understand the implications and have broken it by mistake. IIUC this is
what happened in the 1.4 timeframe when datetime64 was merged and
accidentally renumbered some of the NPY_* constants.
Yes, this is what I'm mainly worried about. But, more than that, I'm
concerned about general *semantic* and API changes at a rapid pace for a
community that is just looking for stability and bug-fixes from NumPy
itself --- with innovation happening elsewhere.
Post by Nathaniel Smith
Partially I am less worried about this because I have a fair amount of
confidence that our review and QA process has improved these days to the
point that we would not let a change like that slip through by accident --
we have a lot more active reviewers, people are sensitized to the issues,
we've successfully landed intrusive changes like Sebastian's indexing
rewrite, ... though this is very much second-hand impressions on my part,
and I'd welcome input from folks like Chuck who have a clearer view on how
things have changed from then to now.
But more importantly, even if this is true, then I can't see how your
proposal helps. If we aren't good enough at our jobs to predict when we'll
break ABI, then by assumption it makes no sense to pick one release and
decide that this is the one time that we'll break ABI.)
I don't understand your point. Picking a release to break the ABI allows
you to actually do things like change macros to functions and move
structures around to be more consistent with a new design that is easier to
maintain and allows more growth. It has nothing to do with "whether you
are good at your job". Everyone has strengths and weaknesses.
This kind of clean-up may be needed regularly --- every 3 years would not
be a crazy pattern, but it could also be every 5 years if you wanted more
discipline. I already knew we needed to break the ABI "soonish" when I
released NumPy 1.0. The fact that we haven't officially done it yet (but
have done it unofficially) is a great injustice to "what could be" and has
slowed development of NumPy tremendously.
We've gone back and forth on this. I'm fine if we disagree, but I just
hope the disagreement doesn't lead to lack of cooperation as we both have
the same ultimate interests in seeing array-computing in Python improve.
I just don't support *major* changes without breaking the ABI without a
whole lot of proof that it is possible (without hackiness). You have
mentioned on your roadmap a lot of what I would consider *major* changes.
Some of it you describe how to get there. The most important change
(improving the dtype system) you don't.
Part of my point is that we now *know* how to improve the dtype system.
Let's do it. Let's not try "yet again" to do it differently inside an old
system designed by a scientist who didn't understand type-theory or type
systems (that was me by the way). Look at data-shape in the blaze
project. Take that and build a Python type-system that also outputs
struct-string syntax for memory-views. That's the data-description system
that NumPy should be using --- not trying to hack on a mixed array-scalar,
dtype-object system that may never support everything we now know is
needed.
Trying to incrementing from where we are now will only lead to a
sub-optimal outcome and unfortunate instability when we already know what
to do differently. I doubt I will convince you --- certainly not via
email. I apologize in advance that I likely won't be able to respond in
depth to any more questions that are really just "prove to me that I can't"
kind of questions. Of course I can't prove that. All I'm saying is that
to me the evidence and my experience leads me to not be able to support
major changes like you have proposed without also intentionally breaking
the ABI (and thus calling it NumPy 2.0).
If I find time to write, I will try to use it to outline more specifically
what I think is a better approach to array- and table-computing in Python
that keeps the stability of NumPy and adds new features using different
approaches.
-Travis
From my perspective the incremental evolutionary approach in numpy (and
scipy) in the last few years has worked quite well, and I'm optimistic that
it will work in future if the developers can pull it off.

The main changes that I remember that needed adjustment in scipy (as
observer) or statsmodels (as maintainer) came from becoming more strict in
several cases. This mainly affects corner cases or cases where the
downstream code wasn't "clean". Some API breaking (with deprecation) and
some semantic changes are still needed independent of any big changes that
may or may not be arriving anytime soon.

This way we get improvements in a core library with the requirement that
every once in a while we need to adjust our code. (And with the occasional
unintended side effect where test coverage is not enough.)
The advantage is that we are getting the improvements with the regular
release cycles, and they keep numpy alive and competitive for another 10
years or more. In the meantime, other packages like pandas can cater and
expand to other use cases, or other packages can develop generic arrays and
out of core and distributed arrays.

I'm partially following some of the Julia mailing lists. Starting something
from scratch is a lot of work, and my guess is that similar approaches in
python will take some time to become mainstream. In the meantime we can
build something on an improving numpy.

---
The only thing I'm not so happy about in the last years is the
proliferation of object arrays, both in numpy code and in pandas. And I
hope that the (dtype) proposals help to get rid of some of those object
arrays.


Josef
Post by Travis Oliphant
Post by Nathaniel Smith
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
I've am very eager to understand how to help NumPy and the wider
community move forward however I can (my passions on this have not changed
since 1999, though what I myself spend time on has changed).
There are a lot of ways to think about approaching this, though. It's
hard to get all the ideas on the table, and it was unfortunate we couldn't
get everybody wyho are core NumPy devs together in person to have this
discussion as there are still a lot of questions unanswered and a lot of
thought that has gone into other approaches that was not brought up or
represented in the meeting (how does Numba fit into this, what about
data-shape, dynd, memory-views and Python type system, etc.). If NumPy
becomes just an interface-specification, then why don't we just do that
*outside* NumPy itself in a way that doesn't jeopardize the stability of
NumPy today. These are some of the real questions I have. I will try
to write up my thoughts in more depth soon, but I won't be able to respond
in-depth right now. I just wanted to comment because Nathaniel said I
disagree which is only partly true.
The three most important things for me are 1) let's make sure we have
representation from as wide of the community as possible (this is really
hard), 2) let's look around at the broader community and the prior art that
is happening in this space right now and 3) let's not pretend we are going
to be able to make all this happen without breaking ABI compatibility.
Let's just break ABI compatibility with NumPy 2.0 *and* have as much
fidelity with the API and semantics of current NumPy as possible (though
there will be some changes necessary long-term).
I don't think we should intentionally break ABI if we can avoid it, but
I also don't think we should spend in-ordinate amounts of time trying to
pretend that we won't break ABI (for at least some people), and most
importantly we should not pretend *not* to break the ABI when we actually
do. We did this once before with the roll-out of date-time, and it was
really un-necessary. When I released NumPy 1.0, there were several
things that I knew should be fixed very soon (NumPy was never designed to
not break ABI). Those problems are still there. Now, that we have
quite a bit better understanding of what NumPy *should* be (there have been
tremendous strides in understanding and community size over the past 10
years), let's actually make the infrastructure we think will last for the
next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
code-base that wasn't designed for it).
NumPy is a hard code-base. It has been since Numeric days in 1995.
I could be wrong, but my guess is that we will be passed by as a community
if we don't seize the opportunity to build something better than we can
build if we are forced to use a 20 year old code-base.
It is more important to not break people's code and to be clear when a
re-compile is necessary for dependencies. Those to me are the most
important constraints. There are a lot of great ideas that we all have
about what we want NumPy to be able to do. Some of this are pretty
transformational (and the more exciting they are, the harder I think they
are going to be to implement without breaking at least the ABI). There
is probably some CAP-like theorem around
Stability-Features-Speed-of-Development (pick 2) when it comes to Open
Source Software development and making feature-progress with NumPy *is
going* to create in-stability which concerns me.
I would like to see a little-bit-of-pain one time with a NumPy 2.0,
rather than a constant pain because of constant churn over many years
approach that Nathaniel seems to advocate. To me NumPy 2.0 is an
ABI-breaking release that is as API-compatible as possible and whose
semantics are not dramatically different.
There are at least 3 areas of compatibility (ABI, API, and semantic).
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.
API compatibility should be much more sacrosanct, but it is also
something that can also be managed. Any NumPy 2.0 should definitely
support the full NumPy API (though there could be deprecated swaths). I
think the community has done well in using deprecation and limiting the
public API to make this more manageable and I would love to see a NumPy 2.0
that solidifies a future-oriented API along with a back-ward compatible API
that is also available.
Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
Best,
-Travis
Post by Nathaniel Smith
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.
Attendees
=========
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
==========================================
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap
===================
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system,
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
Money
=====
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
[http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
==================================
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.
Other points that were discussed
================================
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.
Links to raw notes
==================
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
[
https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
]
[
https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
]
[
https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
]
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*
@teoliphant
512-222-5440
http://www.continuum.io
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*
@teoliphant
512-222-5440
http://www.continuum.io
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Marten van Kerkwijk
2015-08-31 04:12:46 UTC
Permalink
Hi Nathaniel, others,

I read the discussion of plans with interest. One item that struck me is
that while there are great plans to have a proper extensible and presumably
subclassable dtype, it is discouraged to subclass ndarray itself (rather,
it is encouraged to use a broader array interface). From my experience with
astropy in both Quantity (an ndarray subclass), Time (a separate class
containing high precision times using two ndarray float64), and Table
(initially holding structured arrays, but now sets of Columns, which
themselves are ndarray subclasses), I'm not convinced the broader, new
containers approach is that much preferable. Rather, it leads to a lot of
boiler-plate code to reimplement things ndarray does already (since one is
effectively just calling the methods on the underlying arrays).

I also think the idea that a dtype becomes something that also contains a
unit is a bit odd. Shouldn't dtype just be about how data is stored? Why
include meta-data such as units?

Instead, I think a quantity is most logically seen as numbers with a unit,
just like masked arrays are numbers with masks, and variables numbers with
uncertainties. Each of these cases adds extra information in different
forms, and all are quite easily thought of as subclasses of ndarray where
all operations do the normal operation, plus some extra work to keep the
extra information up to date.

Anyway, my suggestion would be to *encourage* rather than discourage
ndarray subclassing, and help this by making ndarray (even) better.

All the best,

Marten
Post by Travis Oliphant
Post by Travis Oliphant
Post by Nathaniel Smith
Hi Travis,
Thanks for taking the time to write up your thoughts!
I have many thoughts in return, but I will try to restrict myself to two
main ones :-).
1) On the question of whether work should be directed towards improving
There's plenty of room for debate about whether it's better engineering
practice to try and evolve an existing system in place versus starting
over, and I guess we have some fundamental disagreements there, but I
actually think this debate is a distraction -- we can agree to disagree,
because in fact we have to try both.
Yes, on this we agree. I think NumPy can improve *and* we can have new
innovative array objects. I don't disagree about that.
Post by Nathaniel Smith
At a practical level: NumPy *is* going to continue to evolve, because it
has users and people interested in evolving it; similarly, dynd and other
alternatives libraries will also continue to evolve, because they also have
people interested in doing it. And at a normative level, this is a good
thing! If NumPy and dynd both get better, than that's awesome: the worst
case is that NumPy adds the new features that we talked about at the
meeting, and dynd simultaneously becomes so awesome that everyone wants to
switch to it, and the result of this would be... that those NumPy features
are exactly the ones that will make the transition to dynd easier. Or if
some part of that plan goes wrong, then well, NumPy will still be there as
a fallback, and in the mean time we've actually fixed the major pain points
our users are begging us to fix.
You seem to be urging us all to make a double-or-nothing wager that your
extremely ambitious plans will all work out, with the entire numerical
Python ecosystem as the stakes. I think this ambition is awesome, but maybe
it'd be wise to hedge our bets a bit?
You are mis-characterizing my view. I think NumPy can evolve (though I
would personally rather see a bigger change to the underlying system like I
outlined before). But, I don't believe it can even evolve easily in the
direction needed without breaking ABI and that insisting on not breaking it
or even putting too much effort into not breaking it will continue to
create less-optimal solutions that are harder to maintain and do not take
advantage of knowledge this community now has.
I'm also very concerned that 'evolving' NumPy will create a situation
where there are regular semantic and subtle API changes that will cause
NumPy to be less stable for it's user-base. I've watched this happen.
This at a time that people are already looking around for new and different
approaches anyway.
Post by Nathaniel Smith
2) You really emphasize this idea of an ABI-breaking (but not
API-breaking) release, and I think this must indicate some basic gap in how
we're looking at things. Where I'm getting stuck here is that... I actually
can't think of anything important that we can't do now, but could if we
were allowed to break ABI compatibility. The kinds of things that break ABI
but keep API are like... rearranging what order the fields in a struct fall
in, or changing the numeric value of opaque constants like
NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a
few bytes per array by arranging the fields inside the ndarray struct more
optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a
vision of this ABI-breaking release as being something very different from
that, and I'm not clear on what this vision is.
We already broke the ABI with date-time changes --- it's still broken for
a certain percentage of users last I checked. So, part of my
disagreement is that we've tried this and it didn't work --- even though
smart people thought it would. I've had to deal with this personally and
I'm not enthusiastic about having to deal with this for the next 5 years
because of even more attempts to make changes while not breaking the ABI.
I think the group is more careful now --- but I still think the API is
broad enough and uses of NumPy deep enough that the effort involved in
trying not to break the ABI is just not worth the effort (because it's a
non-feature today). Adding new dtypes without breaking the ABI is tricky
(and to do it without breaking the ABI is ugly). I also continue to
believe that putting out a new ABI-breaking NumPy will allow re-compiling
*once* (with some porting changes needed) and not subtle breakages
requiring code-changes every time a release is made. If subtle changes
aren't made, then the new features won't come. Right now, I'd rather have
stability from NumPy than new features. New features can come from other
libraries.
One specific change that could easily be made in NumPy 2.0 (the current
code but with an ABI change) is that Dtypes should become true type objects
and array-scalars (which are the current type-objects) should become
instances of those dtypes. That is the biggest clean-up needed, I think on
the array-front. There should not be *both* array-scalars and dtype
objects. They are the same thing fundamentally. It was a mistake to
have both of them. I don't see how to make that change without breaking
the ABI. Perhaps it could be done in a creative way --- but why put the
effort into that and end up with an even more hacky code-base.
NumPy's ABI was influenced by and evolved from Numeric and Numarray. It
was not "designed" to last 30 years.
I think the dtype "types" should potentially have different
member-structures. The ufunc sub-system needs an overhaul --- it's
member structures need upgrades. With generalized ufuncs and the
iteration protocols of Mark Wiebe we know a whole lot more about ufuncs
now. Ufuncs are the same 1995 structure that Jim Hugunin wrote. I
suppose you *could* just tack new functions on the end of structure and
keep growing the list (while leaving old, unused structures as unused or
deprecated) --- or you can take the opportunity to tidy up a bit. The
longer you leave everything the same, the harder you make the code-base and
the more costly maintenance becomes. I just don't see the value there
--- and I see a lot of pain.
Regarding the ufunc subsystem. We've argued before about the lack of
mulit-methods in NumPy. Continuing to add dunder-methods to try and get
around it will continue to make the system harder to maintain and more
brittle.
You mention making NumPy an interface to multiple things along with many
other ideas. I don't believe you can get there without real changes that
break things (at the very least semantic changes). I'm not excited about
those changes causing instability (which they will cause ---- to me the
burden of proof that they won't is on you who wants to make the change and
not on me to say how they will). I also think it will take much
longer to get there incrementally (if at all) than just creating something
on top of newer ideas.
Post by Nathaniel Smith
The main reason I personally am against having a big ABI-breaking
release is not that I hate ABI breakage a priori, it's that all the big
features that I care about and the are users are asking for seem to be ones
that... don't actually require doing that. At most they seem to get a mild
benefit from breaking some obscure corner cases. So the cost/benefits don't
make any sense to me.
So: can you give a concrete example of a change you have in mind where
breaking ABI would be the key enabler?
(I guess you might also be thinking of a separate issue that you sort of
allude to: Perhaps we will try to make changes which we think don't involve
breaking the ABI, but discover too late that we have failed to fully
understand the implications and have broken it by mistake. IIUC this is
what happened in the 1.4 timeframe when datetime64 was merged and
accidentally renumbered some of the NPY_* constants.
Yes, this is what I'm mainly worried about. But, more than that, I'm
concerned about general *semantic* and API changes at a rapid pace for a
community that is just looking for stability and bug-fixes from NumPy
itself --- with innovation happening elsewhere.
Post by Nathaniel Smith
Partially I am less worried about this because I have a fair amount of
confidence that our review and QA process has improved these days to the
point that we would not let a change like that slip through by accident --
we have a lot more active reviewers, people are sensitized to the issues,
we've successfully landed intrusive changes like Sebastian's indexing
rewrite, ... though this is very much second-hand impressions on my part,
and I'd welcome input from folks like Chuck who have a clearer view on how
things have changed from then to now.
But more importantly, even if this is true, then I can't see how your
proposal helps. If we aren't good enough at our jobs to predict when we'll
break ABI, then by assumption it makes no sense to pick one release and
decide that this is the one time that we'll break ABI.)
I don't understand your point. Picking a release to break the ABI
allows you to actually do things like change macros to functions and move
structures around to be more consistent with a new design that is easier to
maintain and allows more growth. It has nothing to do with "whether you
are good at your job". Everyone has strengths and weaknesses.
This kind of clean-up may be needed regularly --- every 3 years would not
be a crazy pattern, but it could also be every 5 years if you wanted more
discipline. I already knew we needed to break the ABI "soonish" when I
released NumPy 1.0. The fact that we haven't officially done it yet (but
have done it unofficially) is a great injustice to "what could be" and has
slowed development of NumPy tremendously.
We've gone back and forth on this. I'm fine if we disagree, but I just
hope the disagreement doesn't lead to lack of cooperation as we both have
the same ultimate interests in seeing array-computing in Python improve.
I just don't support *major* changes without breaking the ABI without a
whole lot of proof that it is possible (without hackiness). You have
mentioned on your roadmap a lot of what I would consider *major* changes.
Some of it you describe how to get there. The most important change
(improving the dtype system) you don't.
Part of my point is that we now *know* how to improve the dtype system.
Let's do it. Let's not try "yet again" to do it differently inside an old
system designed by a scientist who didn't understand type-theory or type
systems (that was me by the way). Look at data-shape in the blaze
project. Take that and build a Python type-system that also outputs
struct-string syntax for memory-views. That's the data-description system
that NumPy should be using --- not trying to hack on a mixed array-scalar,
dtype-object system that may never support everything we now know is
needed.
Trying to incrementing from where we are now will only lead to a
sub-optimal outcome and unfortunate instability when we already know what
to do differently. I doubt I will convince you --- certainly not via
email. I apologize in advance that I likely won't be able to respond in
depth to any more questions that are really just "prove to me that I can't"
kind of questions. Of course I can't prove that. All I'm saying is that
to me the evidence and my experience leads me to not be able to support
major changes like you have proposed without also intentionally breaking
the ABI (and thus calling it NumPy 2.0).
If I find time to write, I will try to use it to outline more
specifically what I think is a better approach to array- and
table-computing in Python that keeps the stability of NumPy and adds new
features using different approaches.
-Travis
From my perspective the incremental evolutionary approach in numpy (and
scipy) in the last few years has worked quite well, and I'm optimistic that
it will work in future if the developers can pull it off.
The main changes that I remember that needed adjustment in scipy (as
observer) or statsmodels (as maintainer) came from becoming more strict in
several cases. This mainly affects corner cases or cases where the
downstream code wasn't "clean". Some API breaking (with deprecation) and
some semantic changes are still needed independent of any big changes that
may or may not be arriving anytime soon.
This way we get improvements in a core library with the requirement that
every once in a while we need to adjust our code. (And with the occasional
unintended side effect where test coverage is not enough.)
The advantage is that we are getting the improvements with the regular
release cycles, and they keep numpy alive and competitive for another 10
years or more. In the meantime, other packages like pandas can cater and
expand to other use cases, or other packages can develop generic arrays and
out of core and distributed arrays.
I'm partially following some of the Julia mailing lists. Starting
something from scratch is a lot of work, and my guess is that similar
approaches in python will take some time to become mainstream. In the
meantime we can build something on an improving numpy.
---
The only thing I'm not so happy about in the last years is the
proliferation of object arrays, both in numpy code and in pandas. And I
hope that the (dtype) proposals help to get rid of some of those object
arrays.
Josef
Post by Travis Oliphant
Post by Nathaniel Smith
Post by Travis Oliphant
Thanks for the write-up Nathaniel. There is a lot of great detail and
interesting ideas here.
I've am very eager to understand how to help NumPy and the wider
community move forward however I can (my passions on this have not changed
since 1999, though what I myself spend time on has changed).
There are a lot of ways to think about approaching this, though. It's
hard to get all the ideas on the table, and it was unfortunate we couldn't
get everybody wyho are core NumPy devs together in person to have this
discussion as there are still a lot of questions unanswered and a lot of
thought that has gone into other approaches that was not brought up or
represented in the meeting (how does Numba fit into this, what about
data-shape, dynd, memory-views and Python type system, etc.). If NumPy
becomes just an interface-specification, then why don't we just do that
*outside* NumPy itself in a way that doesn't jeopardize the stability of
NumPy today. These are some of the real questions I have. I will try
to write up my thoughts in more depth soon, but I won't be able to respond
in-depth right now. I just wanted to comment because Nathaniel said I
disagree which is only partly true.
The three most important things for me are 1) let's make sure we have
representation from as wide of the community as possible (this is really
hard), 2) let's look around at the broader community and the prior art that
is happening in this space right now and 3) let's not pretend we are going
to be able to make all this happen without breaking ABI compatibility.
Let's just break ABI compatibility with NumPy 2.0 *and* have as much
fidelity with the API and semantics of current NumPy as possible (though
there will be some changes necessary long-term).
I don't think we should intentionally break ABI if we can avoid it, but
I also don't think we should spend in-ordinate amounts of time trying to
pretend that we won't break ABI (for at least some people), and most
importantly we should not pretend *not* to break the ABI when we actually
do. We did this once before with the roll-out of date-time, and it was
really un-necessary. When I released NumPy 1.0, there were several
things that I knew should be fixed very soon (NumPy was never designed to
not break ABI). Those problems are still there. Now, that we have
quite a bit better understanding of what NumPy *should* be (there have been
tremendous strides in understanding and community size over the past 10
years), let's actually make the infrastructure we think will last for the
next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
code-base that wasn't designed for it).
NumPy is a hard code-base. It has been since Numeric days in 1995.
I could be wrong, but my guess is that we will be passed by as a community
if we don't seize the opportunity to build something better than we can
build if we are forced to use a 20 year old code-base.
It is more important to not break people's code and to be clear when a
re-compile is necessary for dependencies. Those to me are the most
important constraints. There are a lot of great ideas that we all have
about what we want NumPy to be able to do. Some of this are pretty
transformational (and the more exciting they are, the harder I think they
are going to be to implement without breaking at least the ABI). There
is probably some CAP-like theorem around
Stability-Features-Speed-of-Development (pick 2) when it comes to Open
Source Software development and making feature-progress with NumPy *is
going* to create in-stability which concerns me.
I would like to see a little-bit-of-pain one time with a NumPy 2.0,
rather than a constant pain because of constant churn over many years
approach that Nathaniel seems to advocate. To me NumPy 2.0 is an
ABI-breaking release that is as API-compatible as possible and whose
semantics are not dramatically different.
There are at least 3 areas of compatibility (ABI, API, and semantic).
ABI-compatibility is a non-feature in today's world. There are so many
distributions of the NumPy stack (and conda makes it trivial for anyone to
build their own or for you to build one yourself). Making less-optimal
software-engineering choices because of fear of breaking the ABI is not
something I'm supportive of at all. We should not break ABI every
release, but a release every 3 years that breaks ABI is not a problem.
API compatibility should be much more sacrosanct, but it is also
something that can also be managed. Any NumPy 2.0 should definitely
support the full NumPy API (though there could be deprecated swaths). I
think the community has done well in using deprecation and limiting the
public API to make this more manageable and I would love to see a NumPy 2.0
that solidifies a future-oriented API along with a back-ward compatible API
that is also available.
Semantic compatibility is the hardest. We have already broken this on
multiple occasions throughout the 1.x NumPy releases. Every time you
change the code, this can change. This is what I fear causing deep
instability over the course of many years. These are things like the
casting rule details, the effect of indexing changes, any change to the
calculations approaches. It is and has been the most at risk during any
code-changes. My view is that a NumPy 2.0 (with a new low-level
architecture) minimizes these changes to a single release rather than
unavoidably spreading them out over many, many releases.
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean
time, thanks for keeping the discussion going. It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy. It will be exciting to see what the next few years bring as well.
Best,
-Travis
Post by Nathaniel Smith
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.
Attendees
=========
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
==========================================
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap
===================
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system,
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
Money
=====
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
[http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
==================================
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.
Other points that were discussed
================================
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.
Links to raw notes
==================
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
[
https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
]
[
https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
]
[
https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
]
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*
@teoliphant
512-222-5440
http://www.continuum.io
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
*Travis Oliphant*
*Co-founder and CEO*
@teoliphant
512-222-5440
http://www.continuum.io
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-09-01 08:20:32 UTC
Permalink
On Sun, Aug 30, 2015 at 9:12 PM, Marten van Kerkwijk
Post by Marten van Kerkwijk
Hi Nathaniel, others,
I read the discussion of plans with interest. One item that struck me is
that while there are great plans to have a proper extensible and presumably
subclassable dtype, it is discouraged to subclass ndarray itself (rather, it
is encouraged to use a broader array interface). From my experience with
astropy in both Quantity (an ndarray subclass), Time (a separate class
containing high precision times using two ndarray float64), and Table
(initially holding structured arrays, but now sets of Columns, which
themselves are ndarray subclasses), I'm not convinced the broader, new
containers approach is that much preferable. Rather, it leads to a lot of
boiler-plate code to reimplement things ndarray does already (since one is
effectively just calling the methods on the underlying arrays).
I also think the idea that a dtype becomes something that also contains a
unit is a bit odd. Shouldn't dtype just be about how data is stored? Why
include meta-data such as units?
Instead, I think a quantity is most logically seen as numbers with a unit,
just like masked arrays are numbers with masks, and variables numbers with
uncertainties. Each of these cases adds extra information in different
forms, and all are quite easily thought of as subclasses of ndarray where
all operations do the normal operation, plus some extra work to keep the
extra information up to date.
The intuition behind the array/dtype split is that an array is just a
container: it knows how to shuffle bytes around, be reshaped, indexed,
etc., but it knows nothing about the meaning of the items it holds --
as far as it's concerned, each entry is just an opaque binary blobs.
If it wants to actually do anything with these blobs, it has to ask
the dtype for help.

The dtype, OTOH, knows how to interpret these blobs, and (in
cooperation with ufuncs) to perform operations on them, but it doesn't
need to know how they're stored, or about slicing or anything like
that -- all that's the container's job.

Think about it this way: does it make sense to have a sparse array of
numbers-with-units? how about a blosc-style compressed array of
numbers-with-units? If yes, then numbers-with-units are a special kind
of dtype, not a special kind of array.

Another way of getting this intuition: if I have 8 bytes, that could
be an int64, or it could be a float64. Which one it is doesn't affect
how it's stored at all -- either way it's stored as a chunk of 8
arbitrary bytes. What it affects is how we *interpret* these bytes --
e.g. there is one function called "int64 addition" which takes two 8
byte chunks and returns a new 8 byte chunk as the result, and a second
function called "float64 addition" which takes those same two 8 byte
chunks and returns a different one. The dtype tells you which of these
operations should be used for a particular array. What's special about
a float64-with-units? Well, it's 8 bytes, but the addition operation
is different from regular float64 addition: it has to do some extra
checks and possibly unit conversions. This is exactly what the ufunc
dtype dispatch and casting system is there for.

This also solves your problem with having to write lots of boilerplate
code, b/c if this is a dtype then it means you can just use the actual
ndarray class directly without subclassing or anything :-).
Post by Marten van Kerkwijk
Anyway, my suggestion would be to *encourage* rather than discourage ndarray
subclassing, and help this by making ndarray (even) better.
So, we very much need robust support for
objects-that-quack-like-an-array that are *not* ndarrays, because
ndarray subclasses are forced to use ndarray-style strided in-memory
storage, and there's huge demand for objects that expose an array-like
interface but that use a different storage strategy underneath: sparse
arrays, compressed arrays (like blosc), out-of-core arrays,
computed-on-demand arrays (like dask), distributed arrays, etc. etc.

And once we have solid support for duck-arrays and for user-defined
dtypes (as discussed above), then those two things remove a huge
amount of the motivation for subclassing ndarray.

At the same time, ndarray subclassing is... nearly unmaintainable,
AFAICT. The problem with subclassing is that you're basically taking
some interface, making a copy of it, and then monkeypatching the copy.
As you would expect, this is intrinsically very fragile, because it
breaks abstraction barriers. Suddenly things that used to be
implementation details -- like which methods are implemented in terms
of which other methods -- become part of the public API. And there's
never been any coherent, documentable theory of how ndarray
subclassing is *supposed* to work, so in practice it's just a bunch of
ad hoc hooks designed around the needs of np.matrix and np.ma. We get
a regular stream of bug reports asking us to tweak things one way or
another, and it feels like trying to cover the floor with a too-small
carpet -- we end up with an API that covers the need of whoever
complained most recently.

And there's the thing where as far as we can tell, 99% of the people
who have ever sat down and tried to subclass ndarray ended up
regretting it :-). Seriously, you are literally the only person who
I've ever heard say positive things about the experience, and I can't
really see why given how often I see you in the bug tracker
complaining about some weird breakage :-). So there aren't many people
motivated to work on it...

If someone has a good plan for how to fix all this then by all means,
speak up :-). But IMO it's better to write some boilerplate that you
can control than to import + monkeypatch, even if the latter seems
easier in the short run. And there's a lot we can do to reduce that
boilerplate -- e.g. when you want to implement a new sequence type in
Python you can write your __getitem__ and __len__ and then use
collections.abc.Sequence to fill in the rest of the interface; we've
been talking about adding something similar for arrays as part of the
__numpy_ufunc__ work.

-n
--
Nathaniel J. Smith -- http://vorpus.org
Marten van Kerkwijk
2015-09-03 00:03:52 UTC
Permalink
Hi Nathaniel,

Thanks for the detailed reply; it helped a lot to understand how one could,
indeed, have dtypes contain units. And if one had not just on-the-fly
conversion from int to float as part of an internal loop, but also
on-the-fly multiplication, then it would even be remarkably fast. Will be
interesting to think this through in more detail.

Still think subclassing ndarray is not all *that* bad (MaskedArray is a
different story...), and it may still be needed for my other examples, but
perhaps masked/uncertainties do work with the collections idea. Anyway, it
now makes sense to focus on dtype first.

Thanks again,

Marten
Stephan Hoyer
2015-09-03 20:28:52 UTC
Permalink
From my perspective, a major advantage to dtypes is composability. For
example, it's hard to write a library like dask.array (out of core arrays)
that can suppose holding any conceivable ndarray subclass (like MaskedArray
or quantity), but handling arbitrary dtypes is quite straightforward -- and
that dtype information can be directly passed on, without the container
library knowing anything about the library that implements the dtype.

Stephan

Antoine Pitrou
2015-08-25 19:21:59 UTC
Permalink
On Tue, 25 Aug 2015 03:03:41 -0700
Post by Nathaniel Smith
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[...]
Post by Nathaniel Smith
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
It should also be the opportunity to streamline datetime64 and
timedelta64 dtypes. Currently the unit information is IIRC hidden in
some weird metadata thing called the PyArray_DatetimeMetaData.

Also, thanks the notes. It has been an interesting read.

Regards

Antoine.
Feng Yu
2015-08-25 19:46:11 UTC
Permalink
Hi Nathaniel,

Thanks for the notes.

In some sense, the new dtype class(es) will provided a way of
formalizing these `weird` metadata, and probably exposing them to
Python.

May I add that please consider adding a way to declare the sorting
order (priority and direction) of fields in a structured array in the
new dtype as well?

Regards,

Yu
Post by Antoine Pitrou
On Tue, 25 Aug 2015 03:03:41 -0700
Post by Nathaniel Smith
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[...]
Post by Nathaniel Smith
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
It should also be the opportunity to streamline datetime64 and
timedelta64 dtypes. Currently the unit information is IIRC hidden in
some weird metadata thing called the PyArray_DatetimeMetaData.
Also, thanks the notes. It has been an interesting read.
Regards
Antoine.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-08-26 06:42:10 UTC
Permalink
Post by Antoine Pitrou
On Tue, 25 Aug 2015 03:03:41 -0700
Post by Nathaniel Smith
Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[...]
Post by Nathaniel Smith
Some features that would become straightforward to implement
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
It should also be the opportunity to streamline datetime64 and
timedelta64 dtypes. Currently the unit information is IIRC hidden in
some weird metadata thing called the PyArray_DatetimeMetaData.
Yeah, and PyArray_DatetimeMetaData is an "NpyAuxData", which is its
own personal little object system implemented in C with its own
reference counting system... the design of dtypes has great bones, but
the current implementation has a lot of, um, historical baggage.

-n
--
Nathaniel J. Smith -- http://vorpus.org
Francesc Alted
2015-08-26 11:14:19 UTC
Permalink
Hi,

Thanks Nathaniel and others for sparking this discussion as I think it is
very timely.
Post by Nathaniel Smith
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
Sorry to disagree here, but in my opinion NumPy *already* provides the
standard framework for working with arrays and array-like objects in Python
as its huge popularity shows. If what you mean is that there are too many
efforts trying to provide other, specialized data containers (things like
DataFrame in pandas, DataArray/Dataset in xarray or carray/ctable in bcolz
just to mention a few), then let me say that I am of the opinion that there
can't be a silver bullet for tackling all the problems that the PyData
community is facing.

The libraries using specialized data containers (pandas, xray, bcolz...)
may have more or less machinery on top of them so that conversion to NumPy
not necessarily happens internally (many times we don't want conversions
for efficiency), but it is the capability of producing NumPy arrays out of
them (or parts of them) what makes these specialized containers to be
incredible more useful to users because they can use NumPy to fill the
missing gaps, or just use NumPy as an intermediate container that acts as
input for other libraries.

On the subject on why I don't think a universal data container is feasible
for PyData, you just have to have a look at how many data structures Python
is providing in the language itself (tuples, lists, dicts, sets...), and
how many are added in the standard library (like those in the collections
sub-package). Every data container is designed to do a couple of things
(maybe three) well, but for other use cases it is the responsibility of the
user to choose the more appropriate depending on her needs. In the same
vein, I also think that it makes little sense to try to come with a
standard solution that is going to satisfy everyone's need. IMHO, and
despite all efforts, neither NumPy, NumPy 2.0, DyND, bcolz or any other is
going to offer the universal data container.

Instead of that, let me summarize what users/developers like me need from
NumPy for continue creating more specialized data containers:

1) Keep NumPy simple. NumPy is the truly cornerstone of PyData right now,
and it will be for the foreseeable future, so please keep it usable and
*minimal*. Before adding any more feature the increase in complexity
should carefully weighted.

2) Make NumPy more flexible. Any rewrite that allows arrays or dtypes to be
subclassed and extended more easily will be a huge win. *But* if in order
to allow flexibility you have to make NumPy much more complex, then point
1) should prevail.

3) Make of NumPy a sustainable project. Historically NumPy depended on
heroic efforts of individuals to make it what it is now: *an industry
standard*. But individual efforts, while laudable, are not enough, so
please, please, please continue the effort of constituting a governance
team that ensures the future of NumPy (and with it, the whole PyData
community).

Finally, the question on whether NumPy 2.0 or projects like DyND should be
chosen instead for implementing new features is still legitimate, and while
I have my own opinions (favourable to DyND), I still see (such is the price
of technological debt) a distant future where we will find NumPy as we know
it, allowing more innovation to happen in Python Data space.

Again, thanks to all those braves that are allowing others to build on top
of NumPy's shoulders.

--
Francesc Alted
Pauli Virtanen
2015-08-26 17:58:26 UTC
Permalink
26.08.2015, 14:14, Francesc Alted kirjoitti:
[clip]
Post by Francesc Alted
Post by Nathaniel Smith
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"
Sorry to disagree here, but in my opinion NumPy *already* provides the
standard framework for working with arrays and array-like objects in Python
as its huge popularity shows. If what you mean is that there are too many
efforts trying to provide other, specialized data containers (things like
DataFrame in pandas, DataArray/Dataset in xarray or carray/ctable in bcolz
just to mention a few), then let me say that I am of the opinion that there
can't be a silver bullet for tackling all the problems that the PyData
community is facing.
My reading of the above was that this was about multimethods, and
allowing different types of containers to interoperate beyond the array
interface and Python's builtin operator hooks.

The exact performance details of course vary, and an algorithm written
for in-memory arrays just fails for too large on-disk or distributed
arrays. However, a case for a minimal common API probably could be made,
esp. in algorithms mainly relying on linear algebra.

This is to a degree different from subclassing, as many of the
array-like objects you might want do not have a simple strided memory model.

Pauli
Irwin Zaid
2015-08-26 16:45:51 UTC
Permalink
Hello everyone,

Mark and I thought it would be good to weigh in here and also be explicitly
around to discuss DyND. To be clear, neither of us has strong feelings on
what NumPy *should* do -- we are both long-time NumPy users and we both see
NumPy being around for a while. But, as Francesc mentioned, there is also
the open question of where the community should be implementing new
features. It would certainly be nice to not have duplication of effort, but
a decision like that can only arise naturally from a broad consensus.

Travis covered DyND's history and it's relationship with Continuum pretty
well, so what's really missing here is what DyND is, where it is going, and
how long we think it'll take to get there. We'll try to stick to those topics.

We designed DyND to fill what we saw as fundamental gaps in NumPy. These are
not only missing features, but also limitations of its architecture. Many of
these gaps have been mentioned several times before in this thread and
elsewhere, but a brief list would include: better support for missing
values, variable-length strings, GPUs, more extensible types, categoricals,
more datetime features, ... Some of these were indeed on Nathaniel's list
and many of them are already working (albeit sometimes partially) in DyND.

And, yes, we strongly feel that NumPy's fundamental dependence on Python
itself is a limitation. Why should we not take the fantastic success of
NumPy and generalize it across other languages?

So, we see DyND is having a twofold purpose. The first is to expand upon the
kinds of data that NumPy can represent and do computations upon. The second
is to provide a standard array package that can cross the language barrier
and easily interoperate between C++, Python, or whatever you want.

DyND, at the moment, is quite functional in some areas and lacking a bit in
others. There is no doubt that it is still "experimental" and a bit
unstable. But, it has advanced by a lot recently, and we are steadily
working towards something like a version 1.0. In fact, DyND's internal C++
architecture stabilized some time ago -- what's missing now is really solid
coverage of some common use cases, alongside up-to-date Python bindings and
an easy installation process. All of these are in progress and advancing as
quick as we can make them.

On the other hand, we are also building out some other features. To give
just one example that might excite people, DyND now has Numba
interoperability -- one can write DyND's equivalent of a ufunc in Python
and, with a single decorator, have a broadcasting or reduction callable that
gets JITed or (soon) ahead-of-time compiled.

Over the next few months, we are hopeful that we can get DyND into a state
where it is largely usable by those familiar with NumPy semantics. The
reason why we can be a bit more aggressive in our timeline now is because of
the great support we are getting from Continuum.

With all that said, we are happy to be a part of of any broader conversation
involving NumPy and the community.

All the best,

Irwin and Mark
Antoine Pitrou
2015-08-26 17:11:01 UTC
Permalink
On Wed, 26 Aug 2015 16:45:51 +0000 (UTC)
Post by Irwin Zaid
So, we see DyND is having a twofold purpose. The first is to expand upon the
kinds of data that NumPy can represent and do computations upon. The second
is to provide a standard array package that can cross the language barrier
and easily interoperate between C++, Python, or whatever you want.
One possible limitation is that the lingua franca for language
interoperability is C, not C++. DyND doesn't have to be written in C,
but exposing a nice C API may help make it attractive to the various
language runtimes out there.

(even those languages whose runtime doesn't have a compile-time
interface to C generally have some kind of cffi or ctypes equivalent to
load external C routines at runtime)

Regards

Antoine.
Irwin Zaid
2015-08-26 17:20:13 UTC
Permalink
Post by Antoine Pitrou
One possible limitation is that the lingua franca for language
interoperability is C, not C++. DyND doesn't have to be written in C,
but exposing a nice C API may help make it attractive to the various
language runtimes out there.
That is absolutely true and a C API is on the long-term roadmap. At the
moment, a C API is not needed for DyND to be stable and usable from Python,
which is one reason we aren't doing it now.

Irwin
Mark Wiebe
2015-08-26 17:44:19 UTC
Permalink
Post by Antoine Pitrou
On Wed, 26 Aug 2015 16:45:51 +0000 (UTC)
Post by Irwin Zaid
So, we see DyND is having a twofold purpose. The first is to expand upon
the
Post by Irwin Zaid
kinds of data that NumPy can represent and do computations upon. The
second
Post by Irwin Zaid
is to provide a standard array package that can cross the language
barrier
Post by Irwin Zaid
and easily interoperate between C++, Python, or whatever you want.
One possible limitation is that the lingua franca for language
interoperability is C, not C++. DyND doesn't have to be written in C,
but exposing a nice C API may help make it attractive to the various
language runtimes out there.
(even those languages whose runtime doesn't have a compile-time
interface to C generally have some kind of cffi or ctypes equivalent to
load external C routines at runtime)
I kind of like the path LLVM has chosen here, of a stable C API and an
unstable C++ API. This has both pros and cons though, so I'm not sure what
will be right for DyND in the long term.

-Mark
Post by Antoine Pitrou
Regards
Antoine.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Mark Wiebe
2015-08-26 17:41:59 UTC
Permalink
I thought I'd add a little more specifically about the kind of
graphics/point cloud work I'm doing right now at Thinkbox, and how it
relates. To echo Francesc's point about NumPy already being an industry
standard, within the VFX/graphics industry there is a reference platform
definition on Linux, and the most recent iteration of that specifies a
version of NumPy. It also includes a bunch of other open source libraries
worth taking a look at if you haven't seen them before:
http://www.vfxplatform.com/

Point cloud/particle system data, mesh geometry, numerical grids (both
dense and sparse), and many other primitive components in graphics are
built out of arrays. What NumPy represents for that kind of data is
amazing. The extra baggage of an API tied to the CPython GIL can be a hard
pill to swallow, though, and this is one of the reasons I'm hopeful that as
DyND continues maturing, it can make inroads into places NumPy hasn't been
able to.

Thanks,
Mark
Post by Irwin Zaid
Hello everyone,
Mark and I thought it would be good to weigh in here and also be explicitly
around to discuss DyND. To be clear, neither of us has strong feelings on
what NumPy *should* do -- we are both long-time NumPy users and we both see
NumPy being around for a while. But, as Francesc mentioned, there is also
the open question of where the community should be implementing new
features. It would certainly be nice to not have duplication of effort, but
a decision like that can only arise naturally from a broad consensus.
Travis covered DyND's history and it's relationship with Continuum pretty
well, so what's really missing here is what DyND is, where it is going, and
how long we think it'll take to get there. We'll try to stick to those topics.
We designed DyND to fill what we saw as fundamental gaps in NumPy. These are
not only missing features, but also limitations of its architecture. Many of
these gaps have been mentioned several times before in this thread and
elsewhere, but a brief list would include: better support for missing
values, variable-length strings, GPUs, more extensible types, categoricals,
more datetime features, ... Some of these were indeed on Nathaniel's list
and many of them are already working (albeit sometimes partially) in DyND.
And, yes, we strongly feel that NumPy's fundamental dependence on Python
itself is a limitation. Why should we not take the fantastic success of
NumPy and generalize it across other languages?
So, we see DyND is having a twofold purpose. The first is to expand upon the
kinds of data that NumPy can represent and do computations upon. The second
is to provide a standard array package that can cross the language barrier
and easily interoperate between C++, Python, or whatever you want.
DyND, at the moment, is quite functional in some areas and lacking a bit in
others. There is no doubt that it is still "experimental" and a bit
unstable. But, it has advanced by a lot recently, and we are steadily
working towards something like a version 1.0. In fact, DyND's internal C++
architecture stabilized some time ago -- what's missing now is really solid
coverage of some common use cases, alongside up-to-date Python bindings and
an easy installation process. All of these are in progress and advancing as
quick as we can make them.
On the other hand, we are also building out some other features. To give
just one example that might excite people, DyND now has Numba
interoperability -- one can write DyND's equivalent of a ufunc in Python
and, with a single decorator, have a broadcasting or reduction callable that
gets JITed or (soon) ahead-of-time compiled.
Over the next few months, we are hopeful that we can get DyND into a state
where it is largely usable by those familiar with NumPy semantics. The
reason why we can be a bit more aggressive in our timeline now is because of
the great support we are getting from Continuum.
With all that said, we are happy to be a part of of any broader conversation
involving NumPy and the community.
All the best,
Irwin and Mark
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...