[Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Discussion:

Nathaniel Smith

2015-08-25 10:03:41 UTC

Hi all,

These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!

(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)

Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.

Attendees
=========

Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)

Joining remotely for all or part: Stephan Hoyer, Julian Taylor.

Formalizing our governance/decision making
==========================================

This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.

I'll send out a proper draft of this shortly for further discussion.

Development roadmap
===================

General consensus:

Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
happen anyway.)

Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
future).

And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
Python"

This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.

Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
object".)

Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.

Vision: Chandni can then come along and combine them by doing

a = alice_array([...], dtype=darryl_dtype)

and it just works.

Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".

Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sub-goals:
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
story for how to let it handle third-party array classes:
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
extensions (__numpy_concatenate__?)
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.

This would solve a lot of problems for projects like:
- blosc
- dask
- distarray
- numpy.ma
- pandas
- scipy.sparse
- xray

Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
great.

The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
object.

So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Python-style methods.

Some of the pieces involved in doing this:

- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.

- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
backwards-compatible way.)

- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
directly).

- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
[https://github.com/numpy/numpy/issues/6240])

- We need to migrate the current dtypes over to the new system,
which can be done in stages:

- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
table"

- Then move each of them into their own classes

- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python

- And vice-versa, it should be possible to subclass dtype at the
Python level

- etc.

Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.

Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.

Some features that would become straightforward to implement
(e.g. even in third-party libraries) if this were fixed:
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
astronomical calendars)
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now

I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.

Money
=====

There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"

This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
question.

Points of general agreement:

- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.

- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?

- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
the project.

- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
extremely difficult time filling it:
[http://cds.nyu.edu/research-engineer/]]

- General consensus though was that there isn't much to be done
about this though, except try it and see.

- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)

More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================

- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.

- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
potentially ABI-breaking.)

- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.

- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
possible.
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.

- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
alternatives.

- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)

- On the question of using Cython inside numpy core:
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.

- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.

- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
decades ago.
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say

#define NUMPY_API_VERSION 4

and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
versions.

- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.

Policies that should be documented
==================================

...together with some notes about what the contents of the document
should be:

How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").

In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.

(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)

- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
things change.

Deprecations and breakage policy:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.

- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
basis.

- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?

- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).

- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.

- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Python exception.

Other points that were discussed
================================

- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
any change.
- For now: document prominently, but no change in behavior.

Links to raw notes
==================

Main page:
[https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]

Notes from the meeting proper:
[https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing]

Slides from the followup BoF:
[https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp]

Notes from the followup BoF:
[https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit]

-n

--
Nathaniel J. Smith -- http://vorpus.org

Charles R Harris

2015-08-25 16:43:19 UTC