[Numpy-discussion] Numexpr-3.0 proposal

Discussion:

Robert McLeod

2016-02-14 22:19:49 UTC

Hello everyone,

I've done some work on making a new version of Numexpr that would fix some
of the limitations of the original virtual machine with regards to data
types and operation/function count. Basically I re-wrote the Python and C
sides to use 4-byte words, instead of null-terminated strings, for
operations and passing types. This means the number of operations and
types isn't significantly limited anymore.

Francesc Alted suggested I should come here and get some advice from the
community. I wrote a short proposal on the Wiki here:

https://github.com/pydata/numexpr/wiki/Numexpr-3.0-Branch-Overview

One can see my branch here:

https://github.com/robbmcleod/numexpr/tree/numexpr-3.0

If anyone has any comments they'd be welcome. Questions from my side for
the group:

1.) Numpy casting: I downloaded the Numpy source and after browsing it
seems the best approach is probably to just use
numpy.core.numerictypes.find_common_type?

2.) Can anyone foresee any issues with casting build-in Python types (i.e.
float and integer) to their OS dependent numpy equivalents? Numpy already
seems to do this.

3.) Is anyone enabling the Intel VML library? There are a number of
comments in the code that suggest it's not accelerating the code. It also
seems to cause problems with bundling numexpr with cx_freeze.

4.) I took a stab at converting from distutils to setuputils but this seems
challenging with numpy as a dependency. I wonder if anyone has tried
monkey-patching so that setup.py build_ext uses distutils and then pass the
interpreter.pyd/so as a data file, or some other such chicanery?

(I was going to ask about attaching a debugger, but I just noticed:
https://wiki.python.org/moin/DebuggingWithGdb )

Ciao,

Robert

--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der UniversitÃ€t Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225
***@unibas.ch
***@bsse.ethz.ch <***@ethz.ch>
***@gmail.com

Ralf Gommers

2016-02-15 06:28:30 UTC

Permalink

Not sure what you mean, since numpexpr already uses setuptools:
https://github.com/pydata/numexpr/blob/master/setup.py#L22. What is the
real goal you're trying to achieve?

This monkeypatching is a bad idea:
https://github.com/robbmcleod/numexpr/blob/numexpr-3.0/setup.py#L19. Both
setuptools and numpy.distutils already do that, and that's already one too
many. So you definitely don't want to add a third place.... You can use the
-j (--parallel) flag to numpy.distutils instead, see
http://docs.scipy.org/doc/numpy-dev/user/building.html#parallel-builds

Ralf

Robert McLeod

2016-02-16 08:48:45 UTC

Permalink

Post by Ralf Gommers

https://github.com/pydata/numexpr/blob/master/setup.py#L22. What is the
real goal you're trying to achieve?
https://github.com/robbmcleod/numexpr/blob/numexpr-3.0/setup.py#L19. Both
setuptools and numpy.distutils already do that, and that's already one too
many. So you definitely don't want to add a third place.... You can use the
-j (--parallel) flag to numpy.distutils instead, see
http://docs.scipy.org/doc/numpy-dev/user/building.html#parallel-builds
Ralf

Dear Ralf,

Yes, this appears to be a bad idea. I was just trying to think about if I
could use the more object-oriented approach that I am familiar with in
setuptools to easily build wheels for Pypi. Thanks for the comments and
links; I didn't know I could parallelize the numpy build.

Robert

Gregor Thalhammer

2016-02-15 09:43:53 UTC

Permalink

Post by Robert McLeod
Hello everyone,
I've done some work on making a new version of Numexpr that would fix some of the limitations of the original virtual machine with regards to data types and operation/function count. Basically I re-wrote the Python and C sides to use 4-byte words, instead of null-terminated strings, for operations and passing types. This means the number of operations and types isn't significantly limited anymore.
https://github.com/pydata/numexpr/wiki/Numexpr-3.0-Branch-Overview <https://github.com/pydata/numexpr/wiki/Numexpr-3.0-Branch-Overview>
https://github.com/robbmcleod/numexpr/tree/numexpr-3.0 <https://github.com/robbmcleod/numexpr/tree/numexpr-3.0>
1.) Numpy casting: I downloaded the Numpy source and after browsing it seems the best approach is probably to just use numpy.core.numerictypes.find_common_type?
2.) Can anyone foresee any issues with casting build-in Python types (i.e. float and integer) to their OS dependent numpy equivalents? Numpy already seems to do this.
3.) Is anyone enabling the Intel VML library? There are a number of comments in the code that suggest it's not accelerating the code. It also seems to cause problems with bundling numexpr with cx_freeze.

Dear Robert,

thanks for your effort on improving numexpr. Indeed, vectorized math libraries (VML) can give a large boost in performance (~5x), except for a couple of basic operations (add, mul, div), which current compilers are able to vectorize automatically. With recent gcc even more functions are vectorized, see https://sourceware.org/glibc/wiki/libmvec <https://sourceware.org/glibc/wiki/libmvec> But you need special flags depending on the platform (SSE, AVX present?), runtime detection of processor capabilities would be nice for distributing binaries. Some time ago, since I lost access to Intels MKL, I patched numexpr to use Accelerate/Veclib on os x, which is preinstalled on each mac, see https://github.com/geggo/numexpr.git <https://github.com/geggo/numexpr.git> veclib_support branch.

As you increased the opcode size, I could imagine providing a bit to switch (during runtime) between internal functions and vectorized ones, that would be handy for tests and benchmarks.

Gregor

Post by Robert McLeod
4.) I took a stab at converting from distutils to setuputils but this seems challenging with numpy as a dependency. I wonder if anyone has tried monkey-patching so that setup.py build_ext uses distutils and then pass the interpreter.pyd/so as a data file, or some other such chicanery?
(I was going to ask about attaching a debugger, but I just noticed: https://wiki.python.org/moin/DebuggingWithGdb <https://wiki.python.org/moin/DebuggingWithGdb> )
Ciao,
Robert
--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der UniversitÃ€t Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225 <tel:%2B41.061.387.3225>
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Robert McLeod

2016-02-16 09:04:17 UTC

Permalink

On Mon, Feb 15, 2016 at 10:43 AM, Gregor Thalhammer <

Post by Gregor Thalhammer
Dear Robert,
thanks for your effort on improving numexpr. Indeed, vectorized math
libraries (VML) can give a large boost in performance (~5x), except for a
couple of basic operations (add, mul, div), which current compilers are
able to vectorize automatically. With recent gcc even more functions are
vectorized, see https://sourceware.org/glibc/wiki/libmvec But you need
special flags depending on the platform (SSE, AVX present?), runtime
detection of processor capabilities would be nice for distributing
binaries. Some time ago, since I lost access to Intels MKL, I patched
numexpr to use Accelerate/Veclib on os x, which is preinstalled on each
mac, see https://github.com/geggo/numexpr.git veclib_support branch.
As you increased the opcode size, I could imagine providing a bit to
switch (during runtime) between internal functions and vectorized ones,
that would be handy for tests and benchmarks.

Dear Gregor,

Your suggestion to separate the opcode signature from the library used to
execute it is very clever. Based on your suggestion, I think that the
natural evolution of the opcodes is to specify them by function signature
and library, using a two-level dict, i.e.

numexpr.interpreter.opcodes['exp_f8f8f8'][gnu] = some_enum
numexpr.interpreter.opcodes['exp_f8f8f8'][msvc] = some_enum +1
numexpr.interpreter.opcodes['exp_f8f8f8'][vml] = some_enum + 2
numexpr.interpreter.opcodes['exp_f8f8f8'][yeppp] = some_enum +3

I want to procedurally generate opcodes.cpp and interpreter_body.cpp. If I
do it the way you suggested funccodes.hpp and all the many #define's
regarding function codes in the interpreter can hopefully be removed and
hence simplify the overall codebase. One could potentially take it a step
further and plan (optimize) each expression, similar to what FFTW does with
regards to matrix shape. That is, the basic way to control the library
would be with a singleton library argument, i.e.:

result = ne.evaluate( "A*log(foo**2 / bar**2", lib=vml )

However, we could also permit a tuple to be passed in, where each element
of the tuple reflects the library to use for each operation in the AST tree:

result = ne.evaluate( "A*log(foo**2 / bar**2", lib=(gnu,gnu,gnu,yeppp,gnu) )

In this case the ops are (mul,mul,div,log,mul). The op-code picking is
done by the Python side, and this tuple could be potentially optimized by
numexpr rather than hand-optimized, by trying various permutations of the
linked C math libraries. The wisdom from the planning could be pickled and
saved in a wisdom file. Currently Numexpr has cacheDict in util.py but
there's no reason this can't be pickled and saved to disk. I've done a
similar thing by creating wrappers for PyFFTW already.

Robert

Francesc Alted

2016-02-16 09:52:00 UTC

Permalink

Post by Robert McLeod
On Mon, Feb 15, 2016 at 10:43 AM, Gregor Thalhammer <

Yes, by using a two level dictionary you can access the functions
implementing opcodes much faster and hence you can add much more opcodes
without too much slow-down.

Post by Robert McLeod
I want to procedurally generate opcodes.cpp and interpreter_body.cpp. If
I do it the way you suggested funccodes.hpp and all the many #define's
regarding function codes in the interpreter can hopefully be removed and
hence simplify the overall codebase. One could potentially take it a step
further and plan (optimize) each expression, similar to what FFTW does with
regards to matrix shape. That is, the basic way to control the library
result = ne.evaluate( "A*log(foo**2 / bar**2", lib=vml )
However, we could also permit a tuple to be passed in, where each element
result = ne.evaluate( "A*log(foo**2 / bar**2", lib=(gnu,gnu,gnu,yeppp,gnu) )
In this case the ops are (mul,mul,div,log,mul). The op-code picking is
done by the Python side, and this tuple could be potentially optimized by
numexpr rather than hand-optimized, by trying various permutations of the
linked C math libraries. The wisdom from the planning could be pickled and
saved in a wisdom file. Currently Numexpr has cacheDict in util.py but
there's no reason this can't be pickled and saved to disk. I've done a
similar thing by creating wrappers for PyFFTW already.

I like the idea of various permutations of linked C math libraries to be
probed by numexpr during the initial iteration and then cached somehow.
That will probably require run-time detection of available C math libraries
(think that a numexpr binary will be able to run on different machines with
different libraries and computing capabilities), but in exchange, it will
allow for the fastest execution paths independently of the machine that
runs the code.

--
Francesc Alted