[Numpy-discussion] deterministic, reproducible matmul / __matmult

Discussion:

[Numpy-discussion] deterministic, reproducible matmul / __matmult_

Jason Newton

2016-07-11 17:01:49 UTC

Hello

I'm a long time user of numpy - but an issue I've had with it is
making sure I can reproduce the results of a floating point matrix
multiplication in other languages/modules (like c or GPU) in another,
or across installations. I take great pains in doing this type of
work because it allows me to both prototype with python/numpy and to
also in a fairly strong/useful capacity, use it as a strict reference
implementation. For me - small differences accumulate such that
allclose type thinking starts failing in a few iterations of
algorithms, things diverge.

I've had success with einsum before for some cases (by chance) where
no difference was ever observed between it and Eigen (C++), but I'm
not sure if I should use this any longer. The new @ operator is very
tempting to use too, in prototyping.

Does the ML have any ideas on how one could get a matmul that will not
allow any funny business on the evaluation of the products? Funny
business here is something like changing the evaluation order
additions of terms. I want strict IEEE 754 compliance - no 80 bit
registers, and perhaps control of the rounding mode, no unsafe math
optimizations.

I'm definitely willing to sacrifice performance (esp multi threaded
based enhancements which already cause problems in reduction ordering)
in order to get these guarantees. I was looking around and found a
few BLAS's that might be worth a mention, comments on these would also
be welcome:

http://bebop.cs.berkeley.edu/reproblas/
https://exblas.lip6.fr/

-Jason

Pauli Virtanen

2016-07-11 19:27:37 UTC

Permalink

Post by Jason Newton
Does the ML have any ideas on how one could get a matmul that will not
allow any funny business on the evaluation of the products? Funny
business here is something like changing the evaluation order additions
of terms. I want strict IEEE 754 compliance - no 80 bit registers, and
perhaps control of the rounding mode, no unsafe math optimizations.

If you link Numpy with a BLAS and LAPACK libraries that have been
compiled for this purpose, and turn on the compiler flags that enforce
strict IEEE (and disable SSE) when compiling Numpy, you probably will get
reproducible builds. Numpy itself just offloads the dot computations to
BLAS, so if your BLAS is reproducible, things should mainly be OK.

You may also need to turn off the SSE optimizations in Numpy, because
these can make results depend on memory alignment --- not in dot
products, but in other computations.

Out of curiosity, what is the application where this is necessary?
Maybe there is a numerically stable formulation?

--
Pauli Virtanen

Jason Newton

2016-07-18 21:15:54 UTC

Permalink

Post by Pauli Virtanen

The matrix multiplication of the reference blas hard to follow, so
that's good. Generalized Inverse is a little more difficult. I've
actually had problems building numpy w/ ref blas under windows...,
anyone know where and how to take a ready made cblas dll and get an
existing numpy installation (e.g. anaconda) to use it?

Post by Pauli Virtanen
You may also need to turn off the SSE optimizations in Numpy, because
these can make results depend on memory alignment --- not in dot
products, but in other computations.
Out of curiosity, what is the application where this is necessary?
Maybe there is a numerically stable formulation?

This can come up in recursive processes or anywhere where parallelism
(vectorization or other kinds) and the need for reproducable code
exists (there are many reasons you'd want this, see slides below). As
I mentioned, I desire to have reproducable calculations when involving
alternative implementations with things like c modules, MPI, FPGAs,
GPUs coming into the picture. You can usually figure out a strategy
to do this when you write everything yourself, but I'd love to write
things in numpy and have it just choose simple / straight forward
implementations that are usually what everything boils down to onto
these other devices, at least in meaningful peaces. There may be
other times where it gets more complicated than what i mention here,
but it is still very useful as a building block for those cases (which
are often multiple accumulators/partitioning, tree like reduction
ordering).

I looked into reproblas further and also I asked the authors of
repoblas to add a cblas wrapper in the hopes I might sometime have
numpy working on top of it. I read alot of their research recently
and it's pretty cool - they get better accuracy than most
implementations and the performance is pretty good. Check out slides
here http://people.eecs.berkeley.edu/~hdnguyen/public/talks/ARITH21_Fast_Sum.pdf
(skip to slide 24 for accuracy) - the takeaway here is you actually do
quite a bit better in precision/accuracy with their sum method, than
the naive and alternative implementations. The cpu performance is
worth the trade and really not bad at all for most operations and
purposes - especially since they hand scalability very well.

They also just posted a monster of a technical report here
https://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.html -
this details recent performance using avx and sse, if anyone is
interested.

I'd love to have a library like this that I could just tell numpy to
switch out to at runtime - at the start of a script.

-Jason