[Numpy-discussion] Accelerate or OpenBLAS for numpy / scipy wheels?

Post by Matthew Brett
Hi,
I just succeeded in getting an automated dual arch build of numpy and
scipy, using OpenBLAS. See the last three build jobs in these two
https://travis-ci.org/matthew-brett/numpy-wheels/builds/140388119
https://travis-ci.org/matthew-brett/scipy-wheels/builds/140684673
Tests are passing on 32 and 64-bit.
I didn't upload these to the usual Rackspace container at
wheels.scipy.org to avoid confusion.
So, I guess the question now is - should we switch to shipping
OpenBLAS wheels for the next release of numpy and scipy? Or should we
stick with the Accelerate framework that comes with OSX?
In favor of the Accelerate build : faster to build, it's what we've
been doing thus far.
In favor of OpenBLAS build : allows us to commit to one BLAS / LAPACK
library cross platform, when we have the Windows builds working.
Faster to fix bugs with good support from main developer. No
multiprocessing crashes for Python 2.7.

I'm still a bit nervous about OpenBLAS, see
https://github.com/scipy/scipy/issues/6286. That was with version 0.2.18,
which is pretty recent.

Chuck

Matthew Brett

2016-06-28 12:55:07 UTC

Hi,

On Tue, Jun 28, 2016 at 5:25 AM, Charles R Harris

I'm still a bit nervous about OpenBLAS, see
https://github.com/scipy/scipy/issues/6286. That was with version 0.2.18,
which is pretty recent.

Well - we are committed to OpenBLAS already for the Linux wheels, so
if that failure was due to an error in OpenBLAS, we'll have to report
it and get it fixed / fix it ourselves upstream.

Cheers,

Matthew

Ralf Gommers

2016-06-28 14:33:33 UTC

Post by Matthew Brett
Hi,
On Tue, Jun 28, 2016 at 5:25 AM, Charles R Harris

Faster to build isn't really an argument right? Should be the same build
time except for building OpenBLAS itself once per OpenBLAS version. And
only applies to building wheels for releases - nothing changes for source
builds done by users on OS X. If build time ever becomes a real issue, then
dropping the dual arch stuff is probably the way to go - the 32-bit builds
make very little sense these days.

What we've been doing thus far - that is the more important argument.
There's a risk in switching, we may encounter new bugs or lose some
performance in particular functions.

Post by Matthew Brett
In favor of OpenBLAS build : allows us to commit to one BLAS / LAPACK
library cross platform,

This doesn't really matter too much imho, we have to support Accelerate
either way.

Post by Matthew Brett
when we have the Windows builds working.

Post by Matthew Brett
Faster to fix bugs with good support from main developer. No
multiprocessing crashes for Python 2.7.

This is probably the main reason to make the switch, if we decide to do
that.

Post by Matthew Brett
I'm still a bit nervous about OpenBLAS, see

Post by Charles R Harris
https://github.com/scipy/scipy/issues/6286. That was with version

0.2.18,

Post by Charles R Harris
which is pretty recent.

Well - we are committed to OpenBLAS already for the Linux wheels, so
if that failure was due to an error in OpenBLAS, we'll have to report
it and get it fixed / fix it ourselves upstream.

Indeed. And those wheels have been downloaded a lot already, without any
issues being reported.

I'm +0 on the proposal - the risk seems acceptable, but the reasons to make
the switch are also not super compelling.

Ralf

Matthew Brett

2016-06-28 15:15:12 UTC

Hi,

Post by Matthew Brett
Hi,
On Tue, Jun 28, 2016 at 5:25 AM, Charles R Harris

Yes, that's true, but as you know, the OSX system and Python.org
Pythons are still dual arch, so technically a matching wheel should
also be dual arch. I agree that we're near the point where there's
near zero likelihood that the 32-bit arch will ever get exercised.

Post by Ralf Gommers
What we've been doing thus far - that is the more important argument.
There's a risk in switching, we may encounter new bugs or lose some
performance in particular functions.

Post by Matthew Brett
In favor of OpenBLAS build : allows us to commit to one BLAS / LAPACK
library cross platform,

This doesn't really matter too much imho, we have to support Accelerate
either way.

Post by Matthew Brett
when we have the Windows builds working.

Post by Matthew Brett
Faster to fix bugs with good support from main developer. No
multiprocessing crashes for Python 2.7.

This is probably the main reason to make the switch, if we decide to do
that.

Post by Matthew Brett
I'm still a bit nervous about OpenBLAS, see

Post by Charles R Harris
https://github.com/scipy/scipy/issues/6286. That was with version 0.2.18,
which is pretty recent.

Well - we are committed to OpenBLAS already for the Linux wheels, so
if that failure was due to an error in OpenBLAS, we'll have to report
it and get it fixed / fix it ourselves upstream.

Indeed. And those wheels have been downloaded a lot already, without any
issues being reported.
I'm +0 on the proposal - the risk seems acceptable, but the reasons to make
the switch are also not super compelling.

I guess I'm about +0.5 (multiprocessing, simplifying mainstream blas /
lapack support) - I'm floating it now because I hadn't got the build
machinery working before.

Cheers,

Matthew

Chris Barker

2016-06-28 15:50:39 UTC

Post by Ralf Gommers
dropping the dual arch stuff is probably the way to go - the 32-bit

builds

Post by Ralf Gommers
make very little sense these days.

Yes, that's true, but as you know, the OSX system and Python.org
Pythons are still dual arch, so technically a matching wheel should
also be dual arch.

but as they say, practicality beat purity...

It's not clear yet whether 3.6 will be built dual arch at this point, but
in any case, no one is going to go back and change the builds on 2.7 or 3.4
or 3.5 ....

But that doesn't mean we necessarily need to support dual arch downstream.
Personally, I"d drop it and see if anyone screams.

Though it's actually a bit tricky, at least with my knowledge to build a 64
bit only extension against the dual-arch build. At least the only way I
figured out was to hack the install. ( I did this a while back when I
needed a 32bit-only build -- ironic?)

Post by Matthew Brett
This doesn't really matter too much imho, we have to support Accelerate

Post by Ralf Gommers
either way.

do we? -- so if we go OpenBlas, and someone want to do a simple build from
source, what happens? Do they get accelerate? or would we ship OpenBlas
source itself? or would they need to install OpenBlas some other way?

Post by Matthew Brett
Faster to fix bugs with good support from main developer. No

Post by Matthew Brett
multiprocessing crashes for Python 2.7.

this seems to be the compelling one.

How does the performance compare?

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Ralf Gommers

2016-06-28 17:50:39 UTC

Post by Ralf Gommers
This doesn't really matter too much imho, we have to support Accelerate

Post by Ralf Gommers
either way.

do we? -- so if we go OpenBlas, and someone want to do a simple build from
source, what happens? Do they get accelerate?

Indeed, unless they go through the effort of downloading a separate BLAS
and LAPACK, and figuring out how to make that visible to numpy.distutils.
Very few users will do that.

Post by Chris Barker
or would we ship OpenBlas source itself?

Definitely don't want to do that.

Post by Chris Barker
or would they need to install OpenBlas some other way?

Yes, or MKL, or ATLAS, or BLIS. We have support for all these, and that's a
good thing. Making a uniform choice for our official binaries on various
OSes doesn't reduce the need or effort for supporting those other options.

Post by Matthew Brett
Faster to fix bugs with good support from main developer. No

Post by Matthew Brett
multiprocessing crashes for Python 2.7.

this seems to be the compelling one.
How does the performance compare?

For most routines performance seems to be comparable, and both are much
better than ATLAS. When there's a significant difference, I have the
impression that OpenBLAS is more often the slower one (example:
https://github.com/xianyi/OpenBLAS/issues/533).

Ralf

Andrew Jaffe

2016-06-29 09:49:25 UTC

Post by Ralf Gommers
This doesn't really matter too much imho, we have to support Accelerate
either way.

do we? -- so if we go OpenBlas, and someone want to do a simple
build from source, what happens? Do they get accelerate?
Indeed, unless they go through the effort of downloading a separate BLAS
and LAPACK, and figuring out how to make that visible to
numpy.distutils. Very few users will do that.
or would we ship OpenBlas source itself?
Definitely don't want to do that.
or would they need to install OpenBlas some other way?
Yes, or MKL, or ATLAS, or BLIS. We have support for all these, and
that's a good thing. Making a uniform choice for our official binaries
on various OSes doesn't reduce the need or effort for supporting those
other options.

Post by Matthew Brett
Faster to fix bugs with good support from main developer. No
multiprocessing crashes for Python 2.7.

this seems to be the compelling one.
How does the performance compare?
For most routines performance seems to be comparable, and both are much
better than ATLAS. When there's a significant difference, I have the
https://github.com/xianyi/OpenBLAS/issues/533).

In that case:

-1

(but this seems so obvious that I'm probably missing the point of the +1s)

Nathaniel Smith

2016-06-29 19:55:11 UTC

Post by Andrew Jaffe

Post by Ralf Gommers
This doesn't really matter too much imho, we have to support Accelerate
either way.

Post by Matthew Brett
Faster to fix bugs with good support from main developer.

Post by Andrew Jaffe

Post by Matthew Brett
multiprocessing crashes for Python 2.7.

-1
(but this seems so obvious that I'm probably missing the point of the +1s)

Speed is important, but it's far from the only consideration, especially
since differences between the top tier libraries are usually rather small.
(And note that even though that bug is still listed as open, it has a link
to a commit that appears to have fixed it by implementing the missing
kernels.)

The advantage of openblas is that it's open source, fixable, and we already
focus energy on supporting it for Linux (and probably windows too soon).

Accelerate is closed, so when we hit bugs then there's often nothing we can
do except file a bug with apple and hope that it gets fixed within a year
or two. This isn't hypothetical -- we've hit cases where accelerate gave
wrong answers. Numpy actually carries some scary code right now to work
around one of these bugs by monkeypatching (!) accelerate using dynamic
linker trickiness. And, of course, there's the thing where accelerate
totally breaks multiprocessing. Apple has said that they don't consider
this a bug. Which is probably not much comfort to the new users who are
getting obscure hangs when they try to use Python's most obvious and
commonly recommended concurrency library. If you sum across our user base,
I'm 99% sure that this means accelerate is slower than openblas on net,
because you need a *lot* of code to get 10% speedups before it cancels out
one person spending 3 days trying to figure out why their code is silently
hanging for no reason.

This probably makes me sound more negative about accelerate then I actually
am -- it does work well most of the time, and obviously lots of people are
using it successfully with numpy. But for our official binaries, my vote is
we should switch to openblas, because these binaries are likely to be used
by non-experts who are likely to hit the multiprocessing issue, and because
when we're already struggling to do sufficient QA on our releases then it
makes sense to focus our efforts on a single blas library.

-n

Sturla Molden

2016-07-01 09:16:10 UTC

Post by Nathaniel Smith
Accelerate is closed, so when we hit bugs then there's often nothing we
can do except file a bug with apple and hope that it gets fixed within a
year or two. This isn't hypothetical -- we've hit cases where accelerate
gave wrong answers. Numpy actually carries some scary code right now to
work around one of these bugs by monkeypatching (!) accelerate using
dynamic linker trickiness. And, of course, there's the thing where
accelerate totally breaks multiprocessing.

Yes, those are the cons.

Post by Nathaniel Smith
Apple has said that they don't consider this a bug.

Theoretically they are right, but from a practical perspective...

Sturla

Sturla Molden

2016-07-01 09:18:48 UTC

Post by Nathaniel Smith
Speed is important, but it's far from the only consideration, especially
since differences between the top tier libraries are usually rather small.

It is not even the more important consideration. I would say that
correctness matters most. Everything else comes second.

Sturla

Sturla Molden

2016-06-29 21:06:54 UTC

Post by Ralf Gommers
For most routines performance seems to be comparable, and both are much
better than ATLAS. When there's a significant difference, I have the
<a
href="https://github.com/xianyi/OpenBLAS/issues/533">https://github.com/xianyi/OpenBLAS/issues/533</a>).

Accelerate is in general better optimized for level-1 and level-2 BLAS than
OpenBLAS. There are two reasons for this:

First, OpenBLAS does not use AVX for these kernels, but Accelerate does.
This is the more important difference. It seems the OpenBLAS devs are now
working on this.

Second, the thread pool in OpenBLAS is not as scalable on small tasks as
the "Grand Central Dispatch" (GCD) used by Accelerate. The GCD thread-pool
used by Accelerate is actually quite unique in having a very tiny overhead:
It takes only 16 extra opcodes (IIRC) for running a task on the global
parallel queue instead of the current thread. (Even if my memory is not
perfect and it is not exactly 16 opcodes, it is within that order of
magnitude.) GCD can do this because the global queues and threadpool is
actually built into the kernel of the OS. On the other hand, OpenBLAS and
MKL depends on thread pools managed in userspace, for which the scheduler
in the OS have no special knowledge. When you need fine-grained parallelism
and synchronization, there is nothing like GCD. Even a user-space spinlock
will have bigger overhead than a sequential queue in GCD. With a userspace
threadpool all threads are scheduled on a round robin basis, but with GCD
the scheduler has special knowledge about the tasks put on the queues, and
executes them as fast as possible. Accelerate therefore has an unique
advantage when running level-1 and 2 BLAS routines, with which OpenBLAS or
MKL probably never can properly compete. Programming with GCD can actually
often be counter-intuitive to someone used to deal with OpenMP, MPI or
pthreads. For example it is often better to enqueue a lot of small tasks
instead of splitting up the computation into large chunks of work. When
parallelising a tight loop, a chunk size of 1 can be great on GCD but is
likely to be horrible on OpenMP and anything else that has userspace
threads.

Sturla

Ralf Gommers

2016-07-01 21:54:47 UTC

https://github.com/xianyi/OpenBLAS/issues/533</a>).
Accelerate is in general better optimized for level-1 and level-2 BLAS than
First, OpenBLAS does not use AVX for these kernels, but Accelerate does.
This is the more important difference. It seems the OpenBLAS devs are now
working on this.
Second, the thread pool in OpenBLAS is not as scalable on small tasks as
the "Grand Central Dispatch" (GCD) used by Accelerate. The GCD thread-pool
It takes only 16 extra opcodes (IIRC) for running a task on the global
parallel queue instead of the current thread. (Even if my memory is not
perfect and it is not exactly 16 opcodes, it is within that order of
magnitude.) GCD can do this because the global queues and threadpool is
actually built into the kernel of the OS. On the other hand, OpenBLAS and
MKL depends on thread pools managed in userspace, for which the scheduler
in the OS have no special knowledge. When you need fine-grained parallelism
and synchronization, there is nothing like GCD. Even a user-space spinlock
will have bigger overhead than a sequential queue in GCD. With a userspace
threadpool all threads are scheduled on a round robin basis, but with GCD
the scheduler has special knowledge about the tasks put on the queues, and
executes them as fast as possible. Accelerate therefore has an unique
advantage when running level-1 and 2 BLAS routines, with which OpenBLAS or
MKL probably never can properly compete. Programming with GCD can actually
often be counter-intuitive to someone used to deal with OpenMP, MPI or
pthreads. For example it is often better to enqueue a lot of small tasks
instead of splitting up the computation into large chunks of work. When
parallelising a tight loop, a chunk size of 1 can be great on GCD but is
likely to be horrible on OpenMP and anything else that has userspace
threads.

Thanks Sturla, interesting details as always. You didn't state your
preference by the way, do you have one?

We're building binaries for the average user, so I'd say the AVX thing is
of relevance for the decision to be made, the GCD one less so (people who
care about that will not have any trouble building their own numpy).

So far the score is: one +1, one +0.5, one +0, one -1 and one "still a bit
nervous". Any other takers?

Ralf

Sturla Molden

2016-07-01 23:55:24 UTC