Discussion:
[Numpy-discussion] Cython-based OpenMP-accelerated quartic polynomial solver
Juha Jeronen
2015-09-29 14:35:01 UTC
Permalink
Hi all,

I recently developed a Cython-based, OpenMP-accelerated quartic (and
cubic, quadratic) polynomial solver to address a personal research need
for quickly solving a very large number of independent low-degree
polynomial equations for both real and complex coefficients.

For example, on an Intel i7-4710MQ, running with 8 threads this code
solves approximately 1.2e7 quartic polynomial equations per second.
(With OMP_NUM_THREADS=4 the solver gives approximately 8.9e6 solutions
per second, so it seems HyperThreading in this case helps about 30%.)

The algorithms for cubics and quadratics come from Numerical Recipes
(3rd ed.), and the quartic problem is internally reduced to a cubic and
two quadratics, using well-known standard tricks.


Since to my understanding, for solving polynomial equations NumPy only
provides roots(), which works one problem at a time, I'm writing to
inquire whether there would be interest to include this functionality to
NumPy, if I contribute the code (and clean it up for integration)?


I have some previous open source contribution experience. I have
authored the IVTC and Phosphor deinterlacers for VLC, and a modular
postprocessing filter framework for the Panda3D game engine (posted at
the forums on panda3d.org, projected for version 1.10).

Currently the polynomial solver is available in a git repository hosted
by our university, the relevant files being:

https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/polysolve2.pyx
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/polysolve.py
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/polysolve2example.py
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/setup.py
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/compile.sh

I'm working on a research grant, which allows me to hold the copyright;
license is BSD.

Polysolve2 is the fast Cython+OpenMP version, while polysolve is an
earlier (slower) experimental version using only NumPy array operations.
The example module contains a usage example, and setup (in the standard
manner) instructs distutils and Cython as to how to build polysolve2
(including setting the relevant math flags and enabling OpenMP).

(The setup script includes also some stuff specific to my solver for the
Ziegler problem; basically, the "tworods" module can be ignored. I
apologize for the inconvenience.)


Thoughts?


Best regards,

-J

-------------------------------------------------
Juha Jeronen, Ph.D.
***@jyu.fi
University of Jyväskylä
Department of Mathematical Information Technology
-------------------------------------------------
Chris Barker - NOAA Federal
2015-09-30 00:48:58 UTC
Permalink
This sounds pretty cool -- and I've had a use case. So it would be
nice to get into Numpy.

But: I doubt we'd want OpenMP dependence in Numpy itself.

But maybe a pure Cython non-MP version?

Are we putting Cuthon in Numpy now? I lost track.

-CHB

Sent from my iPhone
Post by Juha Jeronen
Hi all,
I recently developed a Cython-based, OpenMP-accelerated quartic (and cubic, quadratic) polynomial solver to address a personal research need for quickly solving a very large number of independent low-degree polynomial equations for both real and complex coefficients.
For example, on an Intel i7-4710MQ, running with 8 threads this code solves approximately 1.2e7 quartic polynomial equations per second. (With OMP_NUM_THREADS=4 the solver gives approximately 8.9e6 solutions per second, so it seems HyperThreading in this case helps about 30%.)
The algorithms for cubics and quadratics come from Numerical Recipes (3rd ed.), and the quartic problem is internally reduced to a cubic and two quadratics, using well-known standard tricks.
Since to my understanding, for solving polynomial equations NumPy only provides roots(), which works one problem at a time, I'm writing to inquire whether there would be interest to include this functionality to NumPy, if I contribute the code (and clean it up for integration)?
I have some previous open source contribution experience. I have authored the IVTC and Phosphor deinterlacers for VLC, and a modular postprocessing filter framework for the Panda3D game engine (posted at the forums on panda3d.org, projected for version 1.10).
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/polysolve2.pyx
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/polysolve.py
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/polysolve2example.py
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/setup.py
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/compile.sh
I'm working on a research grant, which allows me to hold the copyright; license is BSD.
Polysolve2 is the fast Cython+OpenMP version, while polysolve is an earlier (slower) experimental version using only NumPy array operations. The example module contains a usage example, and setup (in the standard manner) instructs distutils and Cython as to how to build polysolve2 (including setting the relevant math flags and enabling OpenMP).
(The setup script includes also some stuff specific to my solver for the Ziegler problem; basically, the "tworods" module can be ignored. I apologize for the inconvenience.)
Thoughts?
Best regards,
-J
-------------------------------------------------
Juha Jeronen, Ph.D.
University of Jyväskylä
Department of Mathematical Information Technology
-------------------------------------------------
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Charles R Harris
2015-09-30 01:37:37 UTC
Permalink
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal <
Post by Chris Barker - NOAA Federal
This sounds pretty cool -- and I've had a use case. So it would be
nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes.

<snip>

Chuck
Juha Jeronen
2015-09-30 09:08:31 UTC
Permalink
Hi,
Post by Charles R Harris
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal
This sounds pretty cool -- and I've had a use case. So it would be
nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes.
True.

As I said, only the algorithms come from NR, and the code was written
from scratch. I haven't even looked at any of their code, precisely due
to the restrictive license.


In this particular case, the referred pages (pp. 227-229, 3rd ed.)
contain only a description of the solution process in English, with
equations - there is no example code printed into the book in section
5.6, Quadratic and Cubic Equations, which in its entirety is contained
on pp. 227-229.

Furthermore, on p. 228, equation (5.6.12), which contains the solutions
of the cubic equation, is attributed to Francois Viete. The reference
given is the treatise "De emendatione", published in 1615.

The same solution, also with attribution to Francois Viete, is given in
Wikipedia, but without especially defining temporary variables to
eliminate some common subexpressions:

https://en.wikipedia.org/wiki/Cubic_function#Trigonometric_.28and_hyperbolic.29_method


The solution to the quadratic equation is the standard one, but with
special care taken to avoid catastrophic cancellation in computing the
other root.

Wikipedia mentions that in the standard formula, this issue exists:

https://en.wikipedia.org/wiki/Quadratic_equation#Avoiding_loss_of_significance

but does not give an explicit remedy. References given are

Kahan, Willian (November 20, 2004), On the Cost of Floating-Point
Computation Without Extra-Precise Arithmetic. (
http://www.cs.berkeley.edu/~wkahan/Qdrtcs.pdf , see esp. p. 8, which
provides the same remedy as NR )

Higham, Nicholas (2002), Accuracy and Stability of Numerical Algorithms
(2nd ed.), SIAM, p. 10, ISBN 978-0-89871-521-7.


(Browsing through Kahan's text, I see now that I could improve the
discriminant computation to handle marginal cases better. Might do this
in a future version.)


For both of these cases, I've used NR as a reference, because the
authors have taken special care to choose only numerically stable
approaches - which is absolutely critical, yet something that sadly not
all mathematics texts care about.


The quartic equation is not treated in NR. The reduction trick is
reported in Wikipedia:

https://en.wikipedia.org/wiki/Quartic_function#Solving_by_factoring_into_quadratics

where the original reference given is

Brookfield, G. (2007). "Factoring quartic polynomials: A lost art".
Mathematics Magazine 80 (1): 67–70.

(This seemed an elegant approach, given stable solvers for cubics and
quadratics.)


-J

-------------------------------------------------
Juha Jeronen, Ph.D.
***@jyu.fi
University of Jyväskylä
Department of Mathematical Information Technology
-------------------------------------------------
Matthieu Brucher
2015-09-30 09:35:33 UTC
Permalink
Yes, obviously, the code has NR parts, so it can't be licensed as BSD
as it is...

Matthieu
Post by Charles R Harris
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal
Post by Chris Barker - NOAA Federal
This sounds pretty cool -- and I've had a use case. So it would be
nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes.
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Juha Jeronen
2015-09-30 09:49:56 UTC
Permalink
Hi,

What qualifies as an NR part?

(See my previous message concerning the references; NR is not the only
source which has these algorithms. Again, the code itself is original to
this solver, I haven't even looked at the code provided with NR.)

-J


-------------------------------------------------
Juha Jeronen, Ph.D.
***@jyu.fi
University of Jyväskylä
Department of Mathematical Information Technology
-------------------------------------------------
Post by Matthieu Brucher
Yes, obviously, the code has NR parts, so it can't be licensed as BSD
as it is...
Matthieu
Post by Charles R Harris
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal
Post by Chris Barker - NOAA Federal
This sounds pretty cool -- and I've had a use case. So it would be
nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes.
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Kern
2015-09-30 17:14:00 UTC
Permalink
On Wed, Sep 30, 2015 at 10:35 AM, Matthieu Brucher <
Post by Matthieu Brucher
Yes, obviously, the code has NR parts, so it can't be licensed as BSD
as it is...
It's not obvious to me, especially after Juha's further clarifications.

--
Robert Kern
Matthieu Brucher
2015-09-30 17:16:19 UTC
Permalink
After, I agree with you.
Post by Robert Kern
On Wed, Sep 30, 2015 at 10:35 AM, Matthieu Brucher
Post by Matthieu Brucher
Yes, obviously, the code has NR parts, so it can't be licensed as BSD
as it is...
It's not obvious to me, especially after Juha's further clarifications.
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Juha Jeronen
2015-09-30 08:20:11 UTC
Permalink
Hi,
Post by Chris Barker - NOAA Federal
This sounds pretty cool -- and I've had a use case. So it would be
nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
It is indeed a good point not to add new dependencies for a small
feature such as this. From a software engineering point of view, I fully
agree.

Note however that with the current version of the code, not having
OpenMP will severely limit the performance, especially on quad-core
machines, since an important factor in the speed comes from the parallel
processing of the independent problem instances.

But I suppose there may be another technical option to support multiple
cores (doesn't NumPy already do this in some of its routines?). OpenMP
was just the most convenient solution from the implementation viewpoint.
Post by Chris Barker - NOAA Federal
But maybe a pure Cython non-MP version?
That is of course possible. IIRC, changing the processing to occur on a
single core should only require replacing Cython.parallel.prange() with
range() in the array processing loops.

(Note range(), not xrange(), even for Python 2.x - Cython compiles a
loop using range() into an efficient C loop if some simple compile-time
conditions are fulfilled.)
Post by Chris Barker - NOAA Federal
Are we putting Cuthon in Numpy now? I lost track.
I thought so? But then again, I'm just a user :)


-J

-------------------------------------------------
Juha Jeronen, Ph.D.
***@jyu.fi
University of Jyväskylä
Department of Mathematical Information Technology
-------------------------------------------------
Daπid
2015-09-30 09:27:58 UTC
Permalink
Post by Chris Barker - NOAA Federal
Are we putting Cuthon in Numpy now? I lost track.
I thought so? But then again, I'm just a user :)
As of this moment, there are three Cython modules in Numpy, so yes.
Releases ship the C generated modules, so it is not a dependency.

./numpy/distutils/tests/pyrex_ext/primes.pyx
./numpy/random/mtrand/mtrand.pyx
./tools/allocation_tracking/alloc_hook.pyx

In any case, we should always include a single threaded version because
sometimes the computations are already parallelised at a higher level.

Is there a nice way to ship both versions? After all, most implementations
of BLAS and friends do spawn OpenMP threads, so I don't think it would be
outrageous to take advantage of it in more places; provided there is a nice
way to fallback to a serial version when it is not available.


/David.
Nathaniel Smith
2015-09-30 16:20:43 UTC
Permalink
On Sep 30, 2015 2:28 AM, "Daπid" <***@gmail.com> wrote:
[...]
Post by Daπid
Is there a nice way to ship both versions? After all, most
implementations of BLAS and friends do spawn OpenMP threads, so I don't
think it would be outrageous to take advantage of it in more places;
provided there is a nice way to fallback to a serial version when it is not
available.

This is incorrect -- the only common implementation of BLAS that uses
*OpenMP* threads is OpenBLAS, and even then it's not the default -- it only
happens if you run it in a special non-default configuration.

The challenges to providing transparent multithreading in numpy generally
are:

- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to
fix this but they still haven't applied it; alternatively there's a
workaround you can use in multiprocessing (not using fork mode), but this
requires every user update their code and the workaround has other
limitations. We're unlikely to use OpenMP while this is the case.

- parallel code in general is not very composable. If someone is calling a
numpy operation from one thread, great, transparently using multiple
threads internally is a win. If they're exploiting some higher-level
structure in their problem to break it into pieces and process each in
parallel, and then using numpy on each piece, then numpy spawning threads
internally will probably destroy performance. And numpy is too low-level to
know which case it's in. This problem exists to some extent already with
multi-threaded BLAS, so people use various BLAS-specific knobs to manage it
in ad hoc ways, but this doesn't scale.

(Ironically OpenMP is more composable then most approaches to threading,
but only if everyone is using it and, as per above, not everyone is and we
currently can't.)

-n
Sturla Molden
2015-09-30 22:07:33 UTC
Permalink
Post by Nathaniel Smith
This is incorrect -- the only common implementation of BLAS that uses
*OpenMP* threads is OpenBLAS,
MKL and ACML also use OpenMP.

Sturla
Juha Jeronen
2015-10-01 00:20:13 UTC
Permalink
Post by Nathaniel Smith
The challenges to providing transparent multithreading in numpy
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch
to fix this but they still haven't applied it; alternatively there's a
workaround you can use in multiprocessing (not using fork mode), but
this requires every user update their code and the workaround has
other limitations. We're unlikely to use OpenMP while this is the case.
Ah, I didn't know this. Thanks.
Post by Nathaniel Smith
- parallel code in general is not very composable. If someone is
calling a numpy operation from one thread, great, transparently using
multiple threads internally is a win. If they're exploiting some
higher-level structure in their problem to break it into pieces and
process each in parallel, and then using numpy on each piece, then
numpy spawning threads internally will probably destroy performance.
And numpy is too low-level to know which case it's in. This problem
exists to some extent already with multi-threaded BLAS, so people use
various BLAS-specific knobs to manage it in ad hoc ways, but this
doesn't scale.
Very good point. I've had both kinds of use cases myself.

It would be nice if there was some way to tell NumPy to either use
additional threads or not, but that adds complexity. It's also not a
good solution, considering that any higher-level code building on NumPy,
if it is designed to be at all reusable, may find *itself* in either
role. Only the code that, at any particular point of time in the
development of a software project, happens to form the top level at that
time, has the required context...

Then again, the matter is further complicated by considering codes that
run on a single machine, versus codes that run on a cluster. Threads
being local to each node in a cluster, it may make sense in a solver
targeted for a cluster to split the problem at the process level,
distribute the processes across the network, and use the threading
capability to accelerate computation on each node.

A complex issue with probably no easy solutions :)


-J
Sturla Molden
2015-10-01 00:32:49 UTC
Permalink
Post by Juha Jeronen
Then again, the matter is further complicated by considering codes that
run on a single machine, versus codes that run on a cluster.Threads
being local to each node in a cluster,
You can run MPI programs on a single machine and you get OpenMP
implementations for clusters. Just pick an API and stick with it.

Sturla
Juha Jeronen
2015-10-02 09:47:30 UTC
Permalink
Post by Sturla Molden
Post by Juha Jeronen
Then again, the matter is further complicated by considering codes that
run on a single machine, versus codes that run on a cluster.Threads
being local to each node in a cluster,
You can run MPI programs on a single machine and you get OpenMP
implementations for clusters. Just pick an API and stick with it.
Mm. I've quite often run MPI locally (it's nice for multicore scientific
computing on Python), but I had no idea that OpenMP had cluster
implementations. Thanks for the tip.

I've got the impression that the way these APIs market themselves is
that MPI is for processes, while OpenMP is for threads, even if this is
not completely true across all implementations.

(If I wanted maximal control over what each process/thread is doing, I'd
go for ZeroMQ :) )


-J
Sturla Molden
2015-10-02 19:41:27 UTC
Permalink
Post by Juha Jeronen
Mm. I've quite often run MPI locally (it's nice for multicore scientific
computing on Python), but I had no idea that OpenMP had cluster
implementations. Thanks for the tip.
Intel has been selling one, I think there are others too.

OpenMP has a flush pragma for synchronizing shared variables. This means
that OpenMP is not restricted to shared memory hardware. A "pragma omp
flush" can just as well invoke some IPC mechanism, even network
communication.

Sturla
Sturla Molden
2015-10-02 20:00:23 UTC
Permalink
Post by Sturla Molden
OpenMP has a flush pragma for synchronizing shared variables. This means
that OpenMP is not restricted to shared memory hardware. A "pragma omp
flush" can just as well invoke some IPC mechanism, even network
communication.
By the way, while this is the case for C and Fortran, it is certainly not
the case for Cython. In a Cython prange block, a shared variable is
accessed by dereferencing its address. This requires shared memory. Pure
OpenMP in C does not, because shared variables are not accessed through
pointers, but are rather normal variables that are synchronized with a
pragma.

Cython actually requires that there is a shared address space, and it
invokes something that strictly speaking has undefined behavior under the
OpenMP standard. So thus, a prange block in Cython is expected to work
correctly on a laptop with a multicore processor, but it is not expected to
work correctly on a cluster.

IIRC, Intel's cluster OpenMP is based on MPI, which means the compiler will
internally translate code with OpenMP pragmas into equivalent code that
calls MPI functions. A program written for OpenMP can then run on any
cluster that provides an MPI implementation.
Sturla Molden
2015-10-02 20:15:05 UTC
Permalink
Post by Sturla Molden
Cython actually requires that there is a shared address space, and it
invokes something that strictly speaking has undefined behavior under the
OpenMP standard. So thus, a prange block in Cython is expected to work
correctly on a laptop with a multicore processor, but it is not expected to
work correctly on a cluster.
OpenMP does not guarrantee that dereferencing a pointer in a parallel block
will dereference the same object across all processors. It only guarrantees
that the value of a shared object can be synchronized. There are many who
use OpenMP and think only in terms of threads that do this incorrectly.
Cython is actually among those.

S.M.
Daπid
2015-10-01 06:54:14 UTC
Permalink
Post by Nathaniel Smith
[...]
Post by Daπid
Is there a nice way to ship both versions? After all, most
implementations of BLAS and friends do spawn OpenMP threads, so I don't
think it would be outrageous to take advantage of it in more places;
provided there is a nice way to fallback to a serial version when it is not
available.
This is incorrect -- the only common implementation of BLAS that uses
*OpenMP* threads is OpenBLAS, and even then it's not the default -- it only
happens if you run it in a special non-default configuration.
Right, sorry. I wanted to say they spawn parallel threads. What do you mean
by a non default configuration? Setting he OMP_NUM_THREADS?
Post by Nathaniel Smith
The challenges to providing transparent multithreading in numpy generally
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to
fix this but they still haven't applied it; alternatively there's a
workaround you can use in multiprocessing (not using fork mode), but this
requires every user update their code and the workaround has other
limitations. We're unlikely to use OpenMP while this is the case.
Any idea when is this going to be released?

As I understand it, OpenBLAS doesn't have this problem, am I right?
Post by Nathaniel Smith
- parallel code in general is not very composable. If someone is calling a
numpy operation from one thread, great, transparently using multiple
threads internally is a win. If they're exploiting some higher-level
structure in their problem to break it into pieces and process each in
parallel, and then using numpy on each piece, then numpy spawning threads
internally will probably destroy performance. And numpy is too low-level to
know which case it's in. This problem exists to some extent already with
multi-threaded BLAS, so people use various BLAS-specific knobs to manage it
in ad hoc ways, but this doesn't scale.
(Ironically OpenMP is more composable then most approaches to threading,
but only if everyone is using it and, as per above, not everyone is and we
currently can't.)
That is what I meant with providing also a single threaded version.
<https://mail.scipy.org/mailman/listinfo/numpy-discussion>The user can
choose if they want the parallel or the serial, depending on the case.
Nathaniel Smith
2015-10-01 07:05:22 UTC
Permalink
Post by Daπid
Post by Nathaniel Smith
[...]
Post by Daπid
Is there a nice way to ship both versions? After all, most
implementations of BLAS and friends do spawn OpenMP threads, so I don't
think it would be outrageous to take advantage of it in more places;
provided there is a nice way to fallback to a serial version when it is not
available.
This is incorrect -- the only common implementation of BLAS that uses
*OpenMP* threads is OpenBLAS, and even then it's not the default -- it only
happens if you run it in a special non-default configuration.
Right, sorry. I wanted to say they spawn parallel threads. What do you mean
by a non default configuration? Setting he OMP_NUM_THREADS?
I don't remember the details -- I think it might be a special setting
you have to enable when you build OpenBLAS.
Post by Daπid
Post by Nathaniel Smith
The challenges to providing transparent multithreading in numpy generally
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to
fix this but they still haven't applied it; alternatively there's a
workaround you can use in multiprocessing (not using fork mode), but this
requires every user update their code and the workaround has other
limitations. We're unlikely to use OpenMP while this is the case.
Any idea when is this going to be released?
Which? The gcc patch? I spent 2 full release cycles nagging them and
they still can't be bothered to make a decision either way, so :-(. If
anyone has some ideas for how to get traction in gcc-land then I'm
happy to pass on details...
Post by Daπid
As I understand it, OpenBLAS doesn't have this problem, am I right?
Right, in the default configuration then OpenBLAS will use its own
internal thread pool code, and that code has the fixes needed to work
with fork-based multiprocessing. Of course if you configure OpenBLAS
to use OpenMP instead of its internal thread code then this no longer
applies...

-n
--
Nathaniel J. Smith -- http://vorpus.org
Daπid
2015-10-02 11:05:30 UTC
Permalink
Post by Nathaniel Smith
Post by Daπid
Post by Nathaniel Smith
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to
fix this but they still haven't applied it; alternatively there's a
workaround you can use in multiprocessing (not using fork mode), but
this
Post by Daπid
Post by Nathaniel Smith
requires every user update their code and the workaround has other
limitations. We're unlikely to use OpenMP while this is the case.
Any idea when is this going to be released?
Which? The gcc patch? I spent 2 full release cycles nagging them and
they still can't be bothered to make a decision either way, so :-(. If
anyone has some ideas for how to get traction in gcc-land then I'm
happy to pass on details...
:(

Have you tried asking Python-dev for help with this? Hopefully they would
have some weight there.
Sturla Molden
2015-10-05 21:52:23 UTC
Permalink
Post by Daπid
Have you tried asking Python-dev for help with this? Hopefully they
would have some weight there.
It seems both GCC dev and Apple (for GCD and Accelerate) has taken a
similar stance on this. There is tiny set of functions the POSIX
standard demands should work on both sides of a fork without exec, but
OpenMP, GCD, BLAS or LAPAPCK are not included. As long as there is no
bug, it is hard to convince them to follow Intel and allow fork-based
multiprocessing.

As it stands now, using Intel compilers and MKL is the only way to make
this work, but Intel's development tools are not freeware.

Sturla
Nathaniel Smith
2015-10-05 22:05:33 UTC
Permalink
Post by Daπid
Have you tried asking Python-dev for help with this? Hopefully they
would have some weight there.
It seems both GCC dev and Apple (for GCD and Accelerate) has taken a similar
stance on this. There is tiny set of functions the POSIX standard demands
should work on both sides of a fork without exec, but OpenMP, GCD, BLAS or
LAPAPCK are not included. As long as there is no bug, it is hard to convince
them to follow Intel and allow fork-based multiprocessing.
To be clear, the GCC devs are open to supporting fork+OpenMP in
principle, they just aren't willing to do it in a way that risks
breaking strict POSIX or OpenMP compatibility. But that isn't even the
problem -- we have a patch that is strictly compatible with POSIX and
OpenMP. The problem is that with the patch, the cases that would
formerly have deadlocked instead leak some memory. This is not a big
deal IMO for a variety of reasons (mostly that a one time leak per
child process is tiny, esp. compared to the current situation with
deadlocks), but it means the patch needs someone serious in the GCC
community to take a look at it carefully, understand what the
tradeoffs actually are, and make a judgement call. And so far we
haven't convinced anyone to do this.

-n
--
Nathaniel J. Smith -- http://vorpus.org
Nathaniel Smith
2015-10-05 22:13:30 UTC
Permalink
Post by Nathaniel Smith
Post by Daπid
Have you tried asking Python-dev for help with this? Hopefully they
would have some weight there.
It seems both GCC dev and Apple (for GCD and Accelerate) has taken a similar
stance on this. There is tiny set of functions the POSIX standard demands
should work on both sides of a fork without exec, but OpenMP, GCD, BLAS or
LAPAPCK are not included. As long as there is no bug, it is hard to convince
them to follow Intel and allow fork-based multiprocessing.
To be clear, the GCC devs are open to supporting fork+OpenMP in
principle, they just aren't willing to do it in a way that risks
breaking strict POSIX or OpenMP compatibility. But that isn't even the
problem -- we have a patch that is strictly compatible with POSIX and
OpenMP. The problem is that with the patch, the cases that would
formerly have deadlocked instead leak some memory. This is not a big
deal IMO for a variety of reasons (mostly that a one time leak per
child process is tiny, esp. compared to the current situation with
deadlocks), but it means the patch needs someone serious in the GCC
community to take a look at it carefully, understand what the
tradeoffs actually are, and make a judgement call. And so far we
haven't convinced anyone to do this.
Since this discussion's come around again, I finally got curious
enough to check the Intel OpenMP runtime's new(ish) open source
releases, and it turns out that they leak memory in exactly the same
way as the gcc patch :-).
--
Nathaniel J. Smith -- http://vorpus.org
Daπid
2015-10-06 08:14:15 UTC
Permalink
Post by Nathaniel Smith
- parallel code in general is not very composable. If someone is calling a
numpy operation from one thread, great, transparently using multiple
threads internally is a win. If they're exploiting some higher-level
structure in their problem to break it into pieces and process each in
parallel, and then using numpy on each piece, then numpy spawning threads
internally will probably destroy performance. And numpy is too low-level to
know which case it's in. This problem exists to some extent already with
multi-threaded BLAS, so people use various BLAS-specific knobs to manage it
in ad hoc ways, but this doesn't scale.
One idea: what about creating a "parallel numpy"? There are a few
algorithms that can benefit from parallelisation. This library would mimic
Numpy's signature, and the user would be responsible for choosing the
single threaded or the parallel one by just changing np.function(x, y) to
pnp.function(x, y)

If that were deemed a good one, what would be the best parallelisation
scheme? OpenMP? Threads?
Stephan Hoyer
2015-10-06 08:20:18 UTC
Permalink
Post by Daπid
One idea: what about creating a "parallel numpy"? There are a few
algorithms that can benefit from parallelisation. This library would mimic
Numpy's signature, and the user would be responsible for choosing the
single threaded or the parallel one by just changing np.function(x, y) to
pnp.function(x, y)
I would recommend taking a look at dask.array [1], which in many cases
works exactly like a parallel NumPy, though it also does lazy and
out-of-core computation. It's a new project, but it's remarkably mature --
we use it as an alternative array backend (to numpy) in xray, and it's also
being used by scikit-image.

[1] http://dask.pydata.org/en/latest/array.html
Post by Daπid
If that were deemed a good one, what would be the best parallelisation
scheme? OpenMP? Threads?
Dask uses threads. That works pretty well as long as all the hard work is
calling into something that releases the GIL (which includes NumPy, of
course).
Sturla Molden
2015-09-30 22:04:50 UTC
Permalink
Post by Daπid
Is there a nice way to ship both versions? After all, most
implementations of BLAS and friends do spawn OpenMP threads, so I don't
think it would be outrageous to take advantage of it in more places;
Some do, others don't.

ACML uses OpenMP.

GotoBLAS uses OpenMP.

Intel MKL uses Intel TBB and OpenMP (both of them).

OpenBLAS will by default use an internal threadpool. It can be
configured to use OpenMP instead.

ATLAS uses its own threadpool.

Apple Accelerate Framework uses a kernel thread-pool called the Grand
Central Dispatch (GCD). A library called libdispatch uses kqueue to
access the GCD.



There are two principal problems with using OpenMP in NumPy:

One is that the GNU OpenMP threadpool is not fork-safe, and can break
multiprocessing on some platforms (e.g. when using Python 2.7 on Linux).
Anything that uses GCD has this nasty effect on Apple and FreeBSD as
well. Note that the problem is actually in multiprocessing. It is not
present on Windows (there is no fork system call) and it is avoidable
even on Linux with Python 3.4 or later. Also the default builds of NumPy
and SciPy on MacOSX uses GCD by default.

The other problem is that NumPy with its iterator objects and gufuncs is
not particularly fit for multithreading. There will be a lot of
contention for the iterator object. Without a major redesign, one thread
would do useful work while most of the others would be busy-waiting on a
spinlock and fighting for iterator access. Not to mention that the
iterator object would be false shared between the processors, which
would trash the performance if you have more than one CPU, even when
compared to using a single thread. This means that for multithreading to
be useful in NumPy, the loops will have to be redesigned so that the
work sharing between the threads is taken care of ahead of creating the
iterator, i.e. allowing us to use one iterator per thread. Each iterator
would then iterate over a slice of the original array. This is ok, but
we cannot simply do this by adding OpenMP pragmas in the C code.

Given the two problems mentioned above, it would likely be better to use
a fork-safe threadpool instead of OpenMP.


Sturla
Juha Jeronen
2015-10-01 00:32:31 UTC
Permalink
Post by Sturla Molden
One is that the GNU OpenMP threadpool is not fork-safe, and can break
multiprocessing on some platforms (e.g. when using Python 2.7 on
Linux). Anything that uses GCD has this nasty effect on Apple and
FreeBSD as well. Note that the problem is actually in multiprocessing.
It is not present on Windows (there is no fork system call) and it is
avoidable even on Linux with Python 3.4 or later. Also the default
builds of NumPy and SciPy on MacOSX uses GCD by default.
The other problem is that NumPy with its iterator objects and gufuncs
is not particularly fit for multithreading. There will be a lot of
contention for the iterator object. Without a major redesign, one
thread would do useful work while most of the others would be
busy-waiting on a spinlock and fighting for iterator access. Not to
mention that the iterator object would be false shared between the
processors, which would trash the performance if you have more than
one CPU, even when compared to using a single thread. This means that
for multithreading to be useful in NumPy, the loops will have to be
redesigned so that the work sharing between the threads is taken care
of ahead of creating the iterator, i.e. allowing us to use one
iterator per thread. Each iterator would then iterate over a slice of
the original array. This is ok, but we cannot simply do this by adding
OpenMP pragmas in the C code.
Thanks for the details.
Post by Sturla Molden
Given the two problems mentioned above, it would likely be better to
use a fork-safe threadpool instead of OpenMP.
Sounds good. Out of curiosity, are there any standard fork-safe
threadpools, or would this imply rolling our own?


Also, after Chris's, Nathaniel's and your replies, I'm no longer sure an
attempt at transparent multithreading belongs in the polynomial solver.
Although the additional performance would be (very) nice when running it
from a single-threaded program - after all, quad-cores are common these
days - it is not just this specific solver that encounters the issue,
but instead the same considerations apply to basically all NumPy operations.

So maybe it would be better, at least at first, to make a pure-Cython
version with no attempt at multithreading?


-J
Sturla Molden
2015-10-01 00:52:12 UTC
Permalink
Post by Juha Jeronen
Sounds good. Out of curiosity, are there any standard fork-safe
threadpools, or would this imply rolling our own?
You have to roll your own.

Basically use pthreads_atfork to register a callback that shuts down the
threadpool before a fork and another callback that restarts it. Python's
threading module does not expose the pthreads_atfork function, so you
must call it from Cython.

I believe there is also a tiny atfork module in PyPI.
Post by Juha Jeronen
So maybe it would be better, at least at first, to make a pure-Cython
version with no attempt at multithreading?
I would start by making a pure Cython version that works correctly. The
next step would be to ensure that it releases the GIL. After that you
can worry about parallel processing, or just tell the user to use
threads or joblib.


Sturla
Juha Jeronen
2015-10-02 09:58:49 UTC
Permalink
Post by Sturla Molden
Post by Juha Jeronen
Sounds good. Out of curiosity, are there any standard fork-safe
threadpools, or would this imply rolling our own?
You have to roll your own.
Basically use pthreads_atfork to register a callback that shuts down
the threadpool before a fork and another callback that restarts it.
Python's threading module does not expose the pthreads_atfork
function, so you must call it from Cython.
I believe there is also a tiny atfork module in PyPI.
Ok. Thanks. This approach fixes the issue of the threads not being there
for the child process.

I think it still leaves open the issue of creating the correct number of
threads in the pools for each of the processes when the pool is
restarted (so that in total there will be as many threads as cores
(physical or virtual, whichever the user desires)). But this is again
something that requires context...
Post by Sturla Molden
Post by Juha Jeronen
So maybe it would be better, at least at first, to make a pure-Cython
version with no attempt at multithreading?
I would start by making a pure Cython version that works correctly.
The next step would be to ensure that it releases the GIL. After that
you can worry about parallel processing, or just tell the user to use
threads or joblib.
First version done and uploaded:

https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysolve_for_numpy

OpenMP support removed; this version uses only Cython.

The example program has been renamed to main.py, and setup.py has been
cleaned, removing the irrelevant module.

This folder contains only the files for the polynomial solver.


As I suspected, removing OpenMP support only required changing a few
lines, and dropping the import for Cython.parallel. The "prange"s have
been replaced with "with nogil" and "range".

Note that both the original version and this version release the GIL
when running the processing loops.


It may be better to leave this single-threaded for now. Using Python
threads isn't that difficult and joblib sounds nice, too.

What's the next step?

-J
Daπid
2015-10-02 10:07:10 UTC
Permalink
Post by Juha Jeronen
https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysolve_for_numpy
Small comment: now you are checking if the input is a scalar or a ndarray,
but it should also accept any array-like. If I pass a list, I expect it to
work, internally converting it into an array.
Juha Jeronen
2015-10-02 10:31:47 UTC
Permalink
Post by Juha Jeronen
https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysolve_for_numpy
Small comment: now you are checking if the input is a scalar or a
ndarray, but it should also accept any array-like. If I pass a list, I
expect it to work, internally converting it into an array.
Good catch.

Is there an official way to test for array-likes? Or should I always
convert with asarray()? Or something else?


-J
Nathan Goldbaum
2015-09-30 13:57:37 UTC
Permalink
Hi,
Post by Chris Barker - NOAA Federal
This sounds pretty cool -- and I've had a use case. So it would be
nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
It is indeed a good point not to add new dependencies for a small feature
such as this. From a software engineering point of view, I fully agree.
Note however that with the current version of the code, not having OpenMP
will severely limit the performance, especially on quad-core machines,
since an important factor in the speed comes from the parallel processing
of the independent problem instances.
But I suppose there may be another technical option to support multiple
cores (doesn't NumPy already do this in some of its routines?). OpenMP was
just the most convenient solution from the implementation viewpoint.
But maybe a pure Cython non-MP version?
That is of course possible. IIRC, changing the processing to occur on a
single core should only require replacing Cython.parallel.prange() with
range() in the array processing loops.
(Note range(), not xrange(), even for Python 2.x - Cython compiles a loop
using range() into an efficient C loop if some simple compile-time
conditions are fulfilled.)
Are we putting Cuthon in Numpy now? I lost track.
I thought so? But then again, I'm just a user :)
For what it's worth (no idea what the order of magnitude of the technical
risk is for something like this in a library like numpy), it's actually
quite simple to dynamically test for OpenMP support at install time.

Here's what we do in yt:

https://bitbucket.org/yt_analysis/yt/src/f8934e13abaf349f58c52c076320d44ee4f61a7f/yt/utilities/lib/setup.py?at=yt&fileviewer=file-view-default#setup.py-8

Basically, just try to compile a simple OpenMP test program in a
subprocess. If that succeeds, then great, we can add -fopenmp as a
compilation flag. If not, don't do that.

This was implemented when Apple switched from GCC 4.2 to LLVM/Clang in OS X
10.7. We've had exactly one bug report about this, and it wasn't even
related to OpenMP, instead it was an issue with the way we were using
subprocess which broke when someone had set "CC = ccache gcc".

Hope that's helpful,

Nathan Goldbaum
-J
-------------------------------------------------
Juha Jeronen, Ph.D.
University of JyvÀskylÀ
Department of Mathematical Information Technology
-------------------------------------------------
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker
2015-09-30 16:20:34 UTC
Permalink
Note however that with the current version of the code, not having OpenMP
will severely limit the performance, especially on quad-core machines,
since an important factor in the speed comes from the parallel processing
of the independent problem instances.
It seems you got much more than 4 times performance improvement -- which is
the most you'll get from four cores, presumably :-). not that another 4
times or so isn't a good thing.

But I suppose there may be another technical option to support multiple
cores
python threads with nogil?
For what it's worth (no idea what the order of magnitude of the technical
risk is for something like this in a library like numpy), it's actually
quite simple to dynamically test for OpenMP support at install time.
Basically, just try to compile a simple OpenMP test program in a
subprocess. If that succeeds, then great, we can add -fopenmp as a
compilation flag. If not, don't do that.
this sounds like an compilation-tiem tiem test, not isntall time. And
install time isn't really helpful either, we really want plan old wheels,
etc to work.

We'd need a run-time check.

-Chris
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Sturla Molden
2015-09-30 22:30:42 UTC
Permalink
Post by Chris Barker
We'd need a run-time check.
We need to amend the compiler classes in numpy.distutils with OpenMP
relevant information (compiler flags and libraries).

The OpenMP support libraries must be statically linked.


Sturla
Sturla Molden
2015-09-30 23:02:02 UTC
Permalink
Post by Chris Barker
python threads with nogil?
This is often the easiest, particularly if we construct a fork-safe
threadpool.


Sturla
Juha Jeronen
2015-09-30 23:53:52 UTC
Permalink
On Wed, Sep 30, 2015 at 6:57 AM, Nathan Goldbaum
Note however that with the current version of the code, not
having OpenMP will severely limit the performance, especially
on quad-core machines, since an important factor in the speed
comes from the parallel processing of the independent problem
instances.
It seems you got much more than 4 times performance improvement
True, good point. That's the power of a short explicit algorithm for a
specific problem :)

(And maybe some Cython, too.)
-- which is the most you'll get from four cores, presumably :-). not
that another 4 times or so isn't a good thing.
True in general :)

To be precise, in this case, HyperThreading seems to offer some extra
performance.

And there seems to be something wonky with the power management on my
laptop. The number I posted (1.2e7 quartics per second) seemed slightly
low compared to earlier informal benchmarking I had done, and indeed,
benchmarking with 8 threads again I'm now getting 1.6e7 quartics per
second (i7-4710MQ at 2.5 GHz, RAM at 1333 MHz, running on Linux Mint 17.2).

Here are some more complete benchmark results. Because the polynomials
to be solved are randomly generated during each run of the program,
unless otherwise noted, I've rerun the benchmark three times for each
number of threads and picked the lowest-performance result, rounded down.

Number of threads vs. quartics solved per second:

1: 3.0e6
2: 6.1e6
4: 1.1e7
8: 1.6e7 (note: higher than previously!)

Threads vs. cubics:

1: 8.4e6
2: 1.6e7
4: 3.0e7
8: 4.6e7

Threads vs. quadratics:

1: 2.5e7
2: 4.8e7
4: 8.5e7 (typical across 9 benchmarks; lowest 5.7e7)
8: 9.5e7

From these, in general the weak scaling seems pretty good, and for
quadratics and cubics, HyperThreading offers some extra performance,
namely about 50% over the 4-thread result in both cases. The total
speedup for quartics is 5.3x (3.6x without HT), and for cubics, 5.4x
(3.5x without HT).

"About 4x, as expected" is still a good characterization, though :)


For quadratics, 4 threads seems to be the useful maximum; improvement
with HT is only 11%, with a total speedup of 3.8x (vs. 3.4x without HT).
These require much fewer arithmetic operations than the higher-order
cases, so a possible explanation is that this case is memory-bound.
Running the numbers on a simplistic estimate, the total of input and
output is far from hitting the DDR3 bandwidth limit, but of course
that's ignoring (at least) latencies and the pattern of data accesses.

Testing the completely silly "linear" case that is there for documenting
code structure only, I get:

1: 1.9e8
2: 3.1e8
4: 3.6e8
8: 2.9e8

which should give a reasonable first approximation for the maximum
practical throughput on this machine, when performing almost no arithmetic.

Anyway, returning from the sidetrack...
But I suppose there may be another technical option to support
multiple cores
python threads with nogil?
...maybe?

I hadn't thought about this option. I suppose that in pseudocode, the
code structure would need to be something like this?

---8<---8<---8<---

from threading import Thread

cdef fast_solver(...) nogil:
# ...do stuff without touching Python objects...

def my_thread_entry(j_start, j_end):
cdef unsigned int j
with nogil:
for j in range(j_start, j_end):
fast_solver(...) # process local slice

t1 = Thread( target=my_thread_entry, args=(0, j_split) )
t2 = Thread( target=my_thread_entry, args=(j_split, j_end_global) )
t1.start()
t2.start()
t1.join()
t2.join()

---8<---8<---8<---


-J
Sturla Molden
2015-09-30 22:50:08 UTC
Permalink
Post by Nathan Goldbaum
Basically, just try to compile a simple OpenMP test program in a
subprocess. If that succeeds, then great, we can add -fopenmp as a
compilation flag. If not, don't do that.
Unless you use GCC on Linux, it will be more complex than that. I.e. do
you need -fopenmp, -openmp or /openmp? And which libraries do you need
to link, if any? Link GOMP? Link some other OpenMP runtime libraries?
Link pthreads? There is three different pthread implementations to
choose between on Windows, depending on GCC version (MinGW, MinGW-w64,
and Cygwin need different pthread libs for OpenMP). It gets messy. The
only clean way is to add the correct flags and libraries into the
compiler classes in numpy.distutils, and then let distutils and bento
query those.


STurla
Pauli Virtanen
2015-09-30 10:21:03 UTC
Permalink
Post by Juha Jeronen
I recently developed a Cython-based, OpenMP-accelerated quartic (and
cubic, quadratic) polynomial solver to address a personal research need
for quickly solving a very large number of independent low-degree
polynomial equations for both real and complex coefficients.
My 2c in this context would be to also think how this best fits
in how collections of polynomials are represented in Numpy.

AFAICS, Numpy's polynomial module supports evaluation of polynomial
collections (cf. numpy.polynomial.polynomial.polyval),
but the corresponding root finding routine
(numpy.polynomial.polynomial.polyroots) only deals with one
polynomial at a time.

The present code in principle could be used to speed up the latter
routine after it is generalized to multiple polynomials. The general
case is probably doable just by coding up the companion matrix
approach using low-level routines (or maybe with the new vectorized
np.linalg routines).

Some relevant code elsewhere for similar purposes can be
found in scipy.inteprolate.PPoly/BPoly (and in the future BSpline).

https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.PPoly.html

However, since it's piecewise, there's purposefully support only
for real-valued roots.
--
Pauli Virtanen
Juha Jeronen
2015-09-30 23:56:15 UTC
Permalink
Post by Pauli Virtanen
Post by Juha Jeronen
I recently developed a Cython-based, OpenMP-accelerated quartic (and
cubic, quadratic) polynomial solver to address a personal research need
for quickly solving a very large number of independent low-degree
polynomial equations for both real and complex coefficients.
My 2c in this context would be to also think how this best fits
in how collections of polynomials are represented in Numpy.
AFAICS, Numpy's polynomial module supports evaluation of polynomial
collections (cf. numpy.polynomial.polynomial.polyval),
but the corresponding root finding routine
(numpy.polynomial.polynomial.polyroots) only deals with one
polynomial at a time.
The present code in principle could be used to speed up the latter
routine after it is generalized to multiple polynomials. The general
case is probably doable just by coding up the companion matrix
approach using low-level routines (or maybe with the new vectorized
np.linalg routines).
Some relevant code elsewhere for similar purposes can be
found in scipy.inteprolate.PPoly/BPoly (and in the future BSpline).
https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.PPoly.html
However, since it's piecewise, there's purposefully support only
for real-valued roots.
Thanks.

It sounds like numpy.polynomial.polynomial.polyroots() is a good
candidate place for this, if as you suggest, it is first generalized to
handle multiple polynomials.

-J
Slavin, Jonathan
2015-10-02 12:52:01 UTC
Permalink
​Personally I like atleast_1d, which will convert a scalar into a 1d array
but will leave arrays untouched (i.e. won't change the dimensions. Not
sure what the advantages/disadvantages are relative to asarray.

Jon​
Date: Fri, 2 Oct 2015 13:31:47 +0300
Subject: Re: [Numpy-discussion] Cython-based OpenMP-accelerated quartic
polynomial solver
Post by Juha Jeronen
https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysolve_for_numpy
Small comment: now you are checking if the input is a scalar or a ndarray,
but it should also accept any array-like. If I pass a list, I expect it to
work, internally converting it into an array.
Good catch.
Is there an official way to test for array-likes? Or should I always
convert with asarray()? Or something else?
-J
--
________________________________________________________
Jonathan D. Slavin Harvard-Smithsonian CfA
***@cfa.harvard.edu 60 Garden Street, MS 83
phone: (617) 496-7981 Cambridge, MA 02138-1516
cell: (781) 363-0035 USA
________________________________________________________
Ryan May
2015-10-02 14:18:16 UTC
Permalink
numpy.asanyarray() would be my preferred goto, as it will leave subclasses
of ndarray untouched; asarray() and atleast_1d() force ndarray. It's nice
to do the whenever possible.

Ryan
Post by Slavin, Jonathan
​Personally I like atleast_1d, which will convert a scalar into a 1d array
but will leave arrays untouched (i.e. won't change the dimensions. Not
sure what the advantages/disadvantages are relative to asarray.
Jon​
Date: Fri, 2 Oct 2015 13:31:47 +0300
Subject: Re: [Numpy-discussion] Cython-based OpenMP-accelerated quartic
polynomial solver
Post by Juha Jeronen
https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysolve_for_numpy
Small comment: now you are checking if the input is a scalar or a
ndarray, but it should also accept any array-like. If I pass a list, I
expect it to work, internally converting it into an array.
Good catch.
Is there an official way to test for array-likes? Or should I always
convert with asarray()? Or something else?
-J
--
________________________________________________________
Jonathan D. Slavin Harvard-Smithsonian CfA
phone: (617) 496-7981 Cambridge, MA 02138-1516
cell: (781) 363-0035 USA
________________________________________________________
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Ryan May
Juha Jeronen
2015-10-06 11:06:42 UTC
Permalink
Hi all,

Thanks Jon and Ryan for the suggestions. Both asanyarray() or
atleast_1d() sound good.

There's the technical detail that Cython needs to know the object type
(e.g. in the parameters to quartic_z_array()), so I think atleast_1d()
may be safer. I've taken this option for now.

This simplified the code somewhat. The *_scalar() functions have been
removed, as they are no longer needed. The *_array() versions have been
renamed, removing the _array suffix.

The return values have stayed as they were - if there is only one
problem to solve, the singleton dimension is dropped, and otherwise a 2D
array is returned.

(The exception is linear(), which does not need the second dimension,
since the solution for each problem is unique. It will return a scalar
in the case of a single problem, or a 1D array in the case of multiple
problems.)


I've pushed the new version. It's available from the same repository:

https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysolve_for_numpy


Other comments? The next step?

-J


P.S.:

I'm not sure how exact the object type must be - i.e. whether Cython
wants to know that the object stores its data somewhat like an ndarray,
or that its C API exactly matches that of an ndarray.

In Cython there are some surprising details regarding this kind of
things, such as the ctypedef'ing of scalar types. For example, see the
comments about complex support near the beginning of polysolve2.pyx, and
the commented-out SSE2 intrinsics experiment in
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods/tworods.pyx
.

In the latter, it was somewhat tricky to get Cython to recognize __m128d
- turns out it's close enough for Cython to know that it behaves like a
double, although it actually contains two doubles.

(Note that these ctypedefs never end up in the generated C code; Cython
uses them as extra context information when mapping the Python code into C.)

(And no need to worry, I'm not planning to put any SSE into polysolve2 :) )
Post by Ryan May
numpy.asanyarray() would be my preferred goto, as it will leave
subclasses of ndarray untouched; asarray() and atleast_1d() force
ndarray. It's nice to do the whenever possible.
Ryan
On Fri, Oct 2, 2015 at 6:52 AM, Slavin, Jonathan
​ Personally I like atleast_1d, which will convert a scalar into a
1d array but will leave arrays untouched (i.e. won't change the
dimensions. Not sure what the advantages/disadvantages are
relative to asarray.
Jon​
On Fri, Oct 2, 2015 at 7:05 AM,
Date: Fri, 2 Oct 2015 13:31:47 +0300
Subject: Re: [Numpy-discussion] Cython-based
OpenMP-accelerated quartic polynomial solver
Post by Juha Jeronen
https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysolve_for_numpy
Small comment: now you are checking if the input is a scalar
or a ndarray, but it should also accept any array-like. If I
pass a list, I expect it to work, internally converting it
into an array.
Good catch.
Is there an official way to test for array-likes? Or should I
always convert with asarray()? Or something else?
-J
--
________________________________________________________
Jonathan D. Slavin Harvard-Smithsonian CfA
Garden Street, MS 83
phone: (617) 496-7981 <tel:%28617%29%20496-7981> Cambridge,
MA 02138-1516
cell: (781) 363-0035 <tel:%28781%29%20363-0035> USA
________________________________________________________
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Ryan May
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...