Discussion:
[Numpy-discussion] Linking other libm-Implementation
Nils Becker
2016-02-08 00:39:28 UTC
Permalink
Hi all,

I wanted to know if there is any sane way to build numpy while linking to a
different implementation of libm?
A drop-in replacement for libm (e.g. openlibm) should in principle work, I
guess, but I did not manage to actually make it work. As far as I
understand the build code, setting MATHLIB=openlibm should suffice, but it
did not. The build works fine, but in the end when running numpy apparently
the functions of the system libm.so are used. I could not verify this
directly (as I do not know how) but noticed that there is no performance
difference between the builds - while there is one with pure C programs
linked against libm and openlibm.
Using amdlibm would require some work as the functions are prefixed with
"_amd", I guess? Using intels libimf should work when using intels
compiler, but I did not try this. With gcc I did not get it to work.

A quite general question: At the moment the performance and the accuracy of
the base mathematical functions depends on the platform and
libm-Implementation of the system. Although there are functions defined in
npy_math, they are only used as fall-backs, if they are not provided by a
library. (correct me if I am wrong here)
Is there some plan to change this in the future and provide defined
behaviour (specified accuracy and/or speed) across platforms? As I
understood it Julia started openlibm for this reason (which is based on
fdlibm/msun, same as npy_math).

Cheers
Nils
Nathaniel Smith
2016-02-08 01:15:06 UTC
Permalink
Post by Nils Becker
Hi all,
I wanted to know if there is any sane way to build numpy while linking to a
different implementation of libm?
A drop-in replacement for libm (e.g. openlibm) should in principle work, I
guess, but I did not manage to actually make it work. As far as I understand
the build code, setting MATHLIB=openlibm should suffice, but it did not. The
build works fine, but in the end when running numpy apparently the functions
of the system libm.so are used. I could not verify this directly (as I do
not know how) but noticed that there is no performance difference between
the builds - while there is one with pure C programs linked against libm and
openlibm.
Using amdlibm would require some work as the functions are prefixed with
"_amd", I guess? Using intels libimf should work when using intels compiler,
but I did not try this. With gcc I did not get it to work.
A quite general question: At the moment the performance and the accuracy of
the base mathematical functions depends on the platform and
libm-Implementation of the system. Although there are functions defined in
npy_math, they are only used as fall-backs, if they are not provided by a
library. (correct me if I am wrong here)
Is there some plan to change this in the future and provide defined
behaviour (specified accuracy and/or speed) across platforms? As I
understood it Julia started openlibm for this reason (which is based on
fdlibm/msun, same as npy_math).
The npy_math functions are used if otherwise unavailable OR if someone
has at some point noticed that say glibc 2.4-2.10 has a bad quality
tan (or whatever) and added a special case hack that checks for those
particular library versions and uses our built-in version instead.
It's not the most convenient setup to maintain, so there's been some
discussion of trying openlibm instead [1], but AFAIK you're the first
person to find the time to actually sit down and try doing it :-).

You should be able to tell what math library you're linked to by
running ldd (on linux) or otool (on OS X) against the .so / .dylib
files inside your built copy of numpy -- e.g.

ldd numpy/core/umath.cpython-34m.so

(exact filename and command will vary depending on python version and platform).

-n

[1] https://github.com/numpy/numpy/search?q=openlibm&type=Issues&utf8=%E2%9C%93
--
Nathaniel J. Smith -- https://vorpus.org
Nils Becker
2016-02-08 11:04:27 UTC
Permalink
Post by Nathaniel Smith
The npy_math functions are used if otherwise unavailable OR if someone
has at some point noticed that say glibc 2.4-2.10 has a bad quality
tan (or whatever) and added a special case hack that checks for those
particular library versions and uses our built-in version instead.
It's not the most convenient setup to maintain, so there's been some
discussion of trying openlibm instead [1], but AFAIK you're the first
person to find the time to actually sit down and try doing it :-).
You should be able to tell what math library you're linked to by
running ldd (on linux) or otool (on OS X) against the .so / .dylib
files inside your built copy of numpy -- e.g.
ldd numpy/core/umath.cpython-34m.so
(exact filename and command will vary depending on python version and platform).
-n
[1]
https://github.com/numpy/numpy/search?q=openlibm&type=Issues&utf8=%E2%9C%93
Ok, I with a little help from someone, at least I got it to work somehow.
Apparently linking to openlibm is not a problem, MATHLIB=openlibm does the
job. The resulting .so-files are linked to openlibm AND libm. I do not know
why, maybe you would have to call gcc with -nostdlib and explicitly include
everything you need.
When running such a build of numpy, however, only the functions in libm are
called.

What did the job was to export LD_PRELOAD=/usr/lib/libopenlibm.so. In that
case the functions from openlibm are used. This works with any build of
numpy and needs no rebuilding. Of course its hacky and not a solution but
at the moment it seems by far the easiest way to use a different libm
implementation. This does also work with intels libimf. It does not work
with amdlibm as they use the prefix amd_ in function names which would
require real changes to the build system.

Very superficial benchmarks (see below) seem devastating for gnu libm. It
seems that openlibm (compiled with gcc -mtune=native -O3) performs really
well and intels libm implementation is the best (on my intel CPU). I did
not check the accuracy of the functions, though.

My own code uses a lot of trigonometric and complex functions (optics
calculations). I'd guess it could go 25% faster by just using a better libm
implementation. Therefore, I have an interest in getting sane linking to a
defined libm implementation to work.

Apparently openlibm seems quite a good choice for numpy, at least
performance wise. However, I did not find any documentation or tests of the
accuracy of its functions. A benchmarking and testing (for accuracy) code
for libms would probably be a good starting point for a discussion. I could
maybe help with that - but apparently not with any linking/building stuff
(I just don't get it).

Benchmark:

gnu libm.so
3000 x sin(double[100000]): 6.68215647800389 s
3000 x log(double[100000]): 8.86350397899514 s
3000 x exp(double[100000]): 6.560557693999726 s

openlibm.so
3000 x sin(double[100000]): 4.5058218560006935 s
3000 x log(double[100000]): 4.106520485998772 s
3000 x exp(double[100000]): 4.597905882001214 s

Intel libimf.so
3000 x sin(double[100000]): 4.282402812998043 s
3000 x log(double[100000]): 4.008453270995233 s
3000 x exp(double[100000]): 3.301279639999848 s
Nathaniel Smith
2016-02-08 17:36:33 UTC
Permalink
On Feb 8, 2016 3:04 AM, "Nils Becker" <***@gmail.com> wrote:
[...]
Post by Nils Becker
Very superficial benchmarks (see below) seem devastating for gnu libm. It
seems that openlibm (compiled with gcc -mtune=native -O3) performs really
well and intels libm implementation is the best (on my intel CPU). I did
not check the accuracy of the functions, though.
Post by Nils Becker
My own code uses a lot of trigonometric and complex functions (optics
calculations). I'd guess it could go 25% faster by just using a better libm
implementation. Therefore, I have an interest in getting sane linking to a
defined libm implementation to work.

On further thought: I guess that to do this we actually will need to change
the names of the functions in openlibm and then use those names when
calling from numpy. So long as we're using the regular libm symbol names,
it doesn't matter what library the python extensions themselves are linked
to; the way ELF symbol lookup works, the libm that the python interpreter
is linked to will be checked *before* checking the libm that numpy is
linked to, so the symbols will all get shadowed.

I guess statically linking openlibm would also work, but not sure that's a
great idea since we'd need it multiple places.
Post by Nils Becker
Apparently openlibm seems quite a good choice for numpy, at least
performance wise. However, I did not find any documentation or tests of the
accuracy of its functions. A benchmarking and testing (for accuracy) code
for libms would probably be a good starting point for a discussion. I could
maybe help with that - but apparently not with any linking/building stuff
(I just don't get it).
Post by Nils Becker
gnu libm.so
3000 x sin(double[100000]): 6.68215647800389 s
3000 x log(double[100000]): 8.86350397899514 s
3000 x exp(double[100000]): 6.560557693999726 s
openlibm.so
3000 x sin(double[100000]): 4.5058218560006935 s
3000 x log(double[100000]): 4.106520485998772 s
3000 x exp(double[100000]): 4.597905882001214 s
Intel libimf.so
3000 x sin(double[100000]): 4.282402812998043 s
3000 x log(double[100000]): 4.008453270995233 s
3000 x exp(double[100000]): 3.301279639999848 s
I would be highly suspicious that this speed comes at the expense of
accuracy... My impression is that there's a lot of room to make
speed/accuracy tradeoffs in these functions, and modern glibc's libm has
seen a fair amount of scrutiny by people who have access to the same code
that openlibm is based off of. But then again, maybe not :-).

If these are the operations that you care about optimizing, an even better
approach might be to figure out how to integrate a vector math library here
like yeppp (BSD licensed) or MKL. Libm tries to optimize log(scalar); these
are libraries that specifically try to optimize log(vector). Adding this
would require changing numpy's code to use these new APIs though. (Very new
gcc can also try to do this in some cases but I don't know how good at it
it is... Julian might.)

-n
Julian Taylor
2016-02-08 17:54:46 UTC
Permalink
[...]
Post by Nils Becker
Very superficial benchmarks (see below) seem devastating for gnu libm.
It seems that openlibm (compiled with gcc -mtune=native -O3) performs
really well and intels libm implementation is the best (on my intel
CPU). I did not check the accuracy of the functions, though.
Post by Nils Becker
My own code uses a lot of trigonometric and complex functions (optics
calculations). I'd guess it could go 25% faster by just using a better
libm implementation. Therefore, I have an interest in getting sane
linking to a defined libm implementation to work.
On further thought: I guess that to do this we actually will need to
change the names of the functions in openlibm and then use those names
when calling from numpy. So long as we're using the regular libm symbol
names, it doesn't matter what library the python extensions themselves
are linked to; the way ELF symbol lookup works, the libm that the python
interpreter is linked to will be checked *before* checking the libm that
numpy is linked to, so the symbols will all get shadowed.
I guess statically linking openlibm would also work, but not sure that's
a great idea since we'd need it multiple places.
Post by Nils Becker
Apparently openlibm seems quite a good choice for numpy, at least
performance wise. However, I did not find any documentation or tests of
the accuracy of its functions. A benchmarking and testing (for accuracy)
code for libms would probably be a good starting point for a discussion.
I could maybe help with that - but apparently not with any
linking/building stuff (I just don't get it).
Post by Nils Becker
gnu libm.so
3000 x sin(double[100000]): 6.68215647800389 s
3000 x log(double[100000]): 8.86350397899514 s
3000 x exp(double[100000]): 6.560557693999726 s
openlibm.so
3000 x sin(double[100000]): 4.5058218560006935 s
3000 x log(double[100000]): 4.106520485998772 s
3000 x exp(double[100000]): 4.597905882001214 s
Intel libimf.so
3000 x sin(double[100000]): 4.282402812998043 s
3000 x log(double[100000]): 4.008453270995233 s
3000 x exp(double[100000]): 3.301279639999848 s
I would be highly suspicious that this speed comes at the expense of
accuracy... My impression is that there's a lot of room to make
speed/accuracy tradeoffs in these functions, and modern glibc's libm has
seen a fair amount of scrutiny by people who have access to the same
code that openlibm is based off of. But then again, maybe not :-).
If these are the operations that you care about optimizing, an even
better approach might be to figure out how to integrate a vector math
library here like yeppp (BSD licensed) or MKL. Libm tries to optimize
log(scalar); these are libraries that specifically try to optimize
log(vector). Adding this would require changing numpy's code to use
these new APIs though. (Very new gcc can also try to do this in some
cases but I don't know how good at it it is... Julian might.)
-n
which version of glibm was used here? There are significant difference
in performance between versions.
Also the input ranges are very important for these functions, depending
on input the speed of these functions can vary by factors of 1000.

glibm now includes vectorized versions of most math functions, does
openlibm have vectorized math?
Thats where most speed can be gained, a lot more than 25%.
Gregor Thalhammer
2016-02-08 22:32:21 UTC
Permalink
[...]
Very superficial benchmarks (see below) seem devastating for gnu libm. It seems that openlibm (compiled with gcc -mtune=native -O3) performs really well and intels libm implementation is the best (on my intel CPU). I did not check the accuracy of the functions, though.
My own code uses a lot of trigonometric and complex functions (optics calculations). I'd guess it could go 25% faster by just using a better libm implementation. Therefore, I have an interest in getting sane linking to a defined libm implementation to work.
On further thought: I guess that to do this we actually will need to change the names of the functions in openlibm and then use those names when calling from numpy. So long as we're using the regular libm symbol names, it doesn't matter what library the python extensions themselves are linked to; the way ELF symbol lookup works, the libm that the python interpreter is linked to will be checked *before* checking the libm that numpy is linked to, so the symbols will all get shadowed.
I guess statically linking openlibm would also work, but not sure that's a great idea since we'd need it multiple places.
Apparently openlibm seems quite a good choice for numpy, at least performance wise. However, I did not find any documentation or tests of the accuracy of its functions. A benchmarking and testing (for accuracy) code for libms would probably be a good starting point for a discussion. I could maybe help with that - but apparently not with any linking/building stuff (I just don't get it).
gnu libm.so
3000 x sin(double[100000]): 6.68215647800389 s
3000 x log(double[100000]): 8.86350397899514 s
3000 x exp(double[100000]): 6.560557693999726 s
openlibm.so
3000 x sin(double[100000]): 4.5058218560006935 s
3000 x log(double[100000]): 4.106520485998772 s
3000 x exp(double[100000]): 4.597905882001214 s
Intel libimf.so
3000 x sin(double[100000]): 4.282402812998043 s
3000 x log(double[100000]): 4.008453270995233 s
3000 x exp(double[100000]): 3.301279639999848 s
I would be highly suspicious that this speed comes at the expense of accuracy... My impression is that there's a lot of room to make speed/accuracy tradeoffs in these functions, and modern glibc's libm has seen a fair amount of scrutiny by people who have access to the same code that openlibm is based off of. But then again, maybe not :-).
If these are the operations that you care about optimizing, an even better approach might be to figure out how to integrate a vector math library here like yeppp (BSD licensed) or MKL. Libm tries to optimize log(scalar); these are libraries that specifically try to optimize log(vector). Adding this would require changing numpy's code to use these new APIs though. (Very new gcc can also try to do this in some cases but I don't know how good at it it is... Julian might.)
Years ago I made the vectorized math functions from Intels Vector Math Library (VML), part of MKL, available for numpy, see https://github.com/geggo/uvml
Not particularly difficult, you not even have to change numpy. For some cases (e.g., exp) I have seen speedups up to 5x-10x. Unfortunately MKL is not free, and free vector math libraries like yeppp implement much fewer functions or do not support the required strided memory layout. But to improve performance, numexpr, numba or theano are much better.

Gregor
-n
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
Nils Becker
2016-02-09 10:21:51 UTC
Permalink
Post by Julian Taylor
which version of glibm was used here? There are significant difference
in performance between versions.
Also the input ranges are very important for these functions, depending
on input the speed of these functions can vary by factors of 1000.
glibm now includes vectorized versions of most math functions, does
openlibm have vectorized math?
Thats where most speed can be gained, a lot more than 25%.
glibc 2.22 was used running on archlinux. As far as I know openlibm does
not include special vectorized functions. (for reference vectorized
operations in glibc: https://sourceware.org/glibc/wiki/libmvec).
Post by Julian Taylor
Years ago I made the vectorized math functions from Intels Vector Math
Library (VML), part of MKL, available for numpy, see
https://github.com/geggo/uvml
Not particularly difficult, you not even have to change numpy. For some
cases (e.g., exp) I have seen speedups up to 5x-10x. Unfortunately MKL is
not free, and free vector math libraries like yeppp implement much fewer
functions or do not support the required strided memory layout. But to
improve performance, numexpr, numba or theano are much better.
Gregor
Thank you very much for the link! I did not know about
numpy.set_numeric_ops.
You are right, vectorized operations can push down calculation time per
element by factors. The benchmarks done for the yeppp-project also indicate
that (however far you would trust them:
http://www.yeppp.info/benchmarks.html). But I would agree that this domain
should be left to specialized tools like numexpr as fully exploiting the
speedup depends on the expression, that should be calculated. It is not
suitable as a standard for numpy.

Still, I think it would be good to give the possibility to choose the libm
numpy links against. And be it simply to allow to choose or guarantee a
specific accuracy/performance on different platforms and systems.
Maybe maintaining a de-facto libm in npy_math could be replaced with a
dependency on e.g. openlibm. But such a decision would require a thorough
benchmark/testing of the available solutions. Especially with respect to
the accuracy-performance-tradeoff that was mentioned.

Cheers
Nils
Gregor Thalhammer
2016-02-09 17:02:41 UTC
Permalink
Post by Julian Taylor
which version of glibm was used here? There are significant difference
in performance between versions.
Also the input ranges are very important for these functions, depending
on input the speed of these functions can vary by factors of 1000.
glibm now includes vectorized versions of most math functions, does
openlibm have vectorized math?
Thats where most speed can be gained, a lot more than 25%.
glibc 2.22 was used running on archlinux. As far as I know openlibm does not include special vectorized functions. (for reference vectorized operations in glibc: https://sourceware.org/glibc/wiki/libmvec <https://sourceware.org/glibc/wiki/libmvec>).
Years ago I made the vectorized math functions from Intels Vector Math Library (VML), part of MKL, available for numpy, see https://github.com/geggo/uvml <https://github.com/geggo/uvml>
Not particularly difficult, you not even have to change numpy. For some cases (e.g., exp) I have seen speedups up to 5x-10x. Unfortunately MKL is not free, and free vector math libraries like yeppp implement much fewer functions or do not support the required strided memory layout. But to improve performance, numexpr, numba or theano are much better.
Gregor
Thank you very much for the link! I did not know about numpy.set_numeric_ops.
You are right, vectorized operations can push down calculation time per element by factors. The benchmarks done for the yeppp-project also indicate that (however far you would trust them: http://www.yeppp.info/benchmarks.html <http://www.yeppp.info/benchmarks.html>). But I would agree that this domain should be left to specialized tools like numexpr as fully exploiting the speedup depends on the expression, that should be calculated. It is not suitable as a standard for bumpy.
Why should numpy not provide fast transcendental math functions? For linear algebra it supports fast implementations, even non-free (MKL). Wouldn’t it be nice if numpy outperforms C?
Still, I think it would be good to give the possibility to choose the libm numpy links against. And be it simply to allow to choose or guarantee a specific accuracy/performance on different platforms and systems.
Maybe maintaining a de-facto libm in npy_math could be replaced with a dependency on e.g. openlibm. But such a decision would require a thorough benchmark/testing of the available solutions. Especially with respect to the accuracy-performance-tradeoff that was mentioned.
Intel publishes accuracy/performance charts for VML/MKL:
https://software.intel.com/sites/products/documentation/doclib/mkl/vm/functions/_accuracyall.html <https://software.intel.com/sites/products/documentation/doclib/mkl/vm/functions/_accuracyall.html>

For GNU libc it is more difficult to find similarly precise data, I only could find:
http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html <http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html>

Gregor
Cheers
Nils
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Daπid
2016-02-09 15:06:04 UTC
Permalink
Post by Nathaniel Smith
I would be highly suspicious that this speed comes at the expense of
accuracy... My impression is that there's a lot of room to make
speed/accuracy tradeoffs in these functions, and modern glibc's libm has
seen a fair amount of scrutiny by people who have access to the same code
that openlibm is based off of. But then again, maybe not :-).
I did some digging, and I found this:

http://julia-programming-language.2336112.n4.nabble.com/Is-the-accuracy-of-Julia-s-elementary-functions-exp-sin-known-td32736.html

In short: according to their devs, most openlibm functions are
accurate to less than 1ulp, while GNU libm is rounded to closest
float.


/David.
Matthew Brett
2016-02-09 19:13:58 UTC
Permalink
Post by Daπid
Post by Nathaniel Smith
I would be highly suspicious that this speed comes at the expense of
accuracy... My impression is that there's a lot of room to make
speed/accuracy tradeoffs in these functions, and modern glibc's libm has
seen a fair amount of scrutiny by people who have access to the same code
that openlibm is based off of. But then again, maybe not :-).
http://julia-programming-language.2336112.n4.nabble.com/Is-the-accuracy-of-Julia-s-elementary-functions-exp-sin-known-td32736.html
In short: according to their devs, most openlibm functions are
accurate to less than 1ulp, while GNU libm is rounded to closest
float.
So GNU libm has max error <= 0.5 ULP, openlibm has <= 1 ULP, and OSX
is (almost always) somewhere in-between.

So, is <= 1 ULP good enough?

Cheers,

Matthew
Nils Becker
2016-02-10 19:01:31 UTC
Permalink
Post by Gregor Thalhammer
It is not suitable as a standard for numpy.
Why should numpy not provide fast transcendental math functions? For
linear algebra it supports fast implementations, even non-free (MKL).
Wouldn’t it be nice if numpy outperforms C?

Floating point operations that make use of vector extensions of modern
processors may behave subtly different. This especially concerns floating
point exceptions, where sometimes silent infinities are generated instead
of raising a divide-by-zero exception (best description I could find on the
spot:
https://randomascii.wordpress.com/2012/04/21/exceptional-floating-point/,
also see the notes on C99-compliance of the new vector expressions in
glibc: https://sourceware.org/glibc/wiki/libmvec).
I think the default in numpy should strive to be mostly standard compliant.
But of course an option to activate vector math operations would be nice -
if that is necessary with packages like numexpr is another question.
One other point is the extended/long double type which is normally not
supported by those libraries (as vector extensions cannot handle them).
https://software.intel.com/sites/products/documentation/doclib/mkl/vm/functions/_accuracyall.html
Post by Gregor Thalhammer
For GNU libc it is more difficult to find similarly precise data, I only
could find:
http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html
http://julia-programming-language.2336112.n4.nabble.com/Is-the-accuracy-of-Julia-s-elementary-functions-exp-sin-known-td32736.html

Thank you for looking that up! I did not knew about the stuff published by
Intel yet.
Post by Gregor Thalhammer
So GNU libm has max error <= 0.5 ULP, openlibm has <= 1 ULP, and OSX
is (almost always) somewhere in-between.
So, is <= 1 ULP good enough?
Calculating transcendental functions correctly rounded is very, very hard
and to my knowledge there is no complete libm implementation that
guarantees the necessary accuracy for all possible inputs. One effort
was/is the Correctly Rounded LibM (crlibm [1]) which tried to prove the
accuracy of their algorithms. However, the performance impact to achieve
that last ulp in all rounding modes can be severe.
Assessing accuracy of a function implementation is hard. Testing all
possible inputs is not feasible (2^32/64 for single/double) and proving
accuracy bounds may be even harder.
Most of the time one samples accuracy with random numbers from a certain
range. This generates tables like the ones for GNU libm or Intel.
This is a kind of "faithful" accuracy as you believe that the accuracy you
tested on a sample extends to the whole argument range. The error in worst
case may be (much) bigger.

That being said, I believe the values given by GNU libm for example are
very trustworthy. libm is not always correctly rounded (which would be <=
0.5ulp in round-to-nearest), however, the error bounds given in the table
seem to cover all worst cases.
Common single-argument functions (sin, cos) are correctly rounded and even
complex two-argument functions (cpow) are at most 5ulp off. I do not think
that other implementations are more accurate.
So libm is definitely good enough, accuracy-wise.

In any case I would like to build a testing framework to compare some libms
and check accuracy/performance (at least Intel has a history of
underestimating their error bounds in transcendental functions [2]). crlibm
offers worst-case arguments for some functions which could be used to
complement randomized sampling. Maybe I have some time in the next weeks...

[1] http://lipforge.ens-lyon.fr/www/crlibm/
[2]
https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/
Loading...