[Numpy-discussion] Design feedback solicitation

Discussion:

Pavlyk, Oleksandr

2016-06-17 15:08:19 UTC

Hi,

I am new to this list, so I will start with an introduction. My name is Oleksandr Pavlyk. I now work at Intel Corp. on the Intel Distribution for Python, and previously worked at Wolfram Research for 12 years. My latest project was to write a mirror to numpy.random, named numpy.random_intel. The module uses MKL to sample from different distributions for efficiency. It provides support for different underlying algorithms for basic pseudo-random number generation, i.e. in addition to MT19937, it also provides SFMT19937, MT2203, etc.
I recently published a blog about it:
https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-generation-in-intel-distribution-for-python

I originally attempted to simply replace numpy.random in the Intel Distribution for Python with the new module, but due to fixed seed backwards incompatibility this results in numerous test failures in numpy, scipy, pandas and other modules.

Unlike numpy.random, the new module generates a vector of random numbers at a time, which can be done faster than repeatedly generating the same number of variates one at a time.

The source code for the new module is not upstreamed yet, and this email is meant to solicit early community feedback to allow for faster acceptance of the proposed changes.

Thank you,
Oleksandr

Robert Kern

2016-06-17 15:22:45 UTC

Permalink

On Fri, Jun 17, 2016 at 4:08 PM, Pavlyk, Oleksandr <

Post by Pavlyk, Oleksandr
Hi,
I am new to this list, so I will start with an introduction. My name is

Oleksandr Pavlyk. I now work at Intel Corp. on the Intel Distribution for
Python, and previously worked at Wolfram Research for 12 years. My latest
project was to write a mirror to numpy.random, named numpy.random_intel.
The module uses MKL to sample from different distributions for efficiency.
It provides support for different underlying algorithms for basic
pseudo-random number generation, i.e. in addition to MT19937, it also
provides SFMT19937, MT2203, etc.
https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-generation-in-intel-distribution-for-python

Post by Pavlyk, Oleksandr
I originally attempted to simply replace numpy.random in the Intel

Distribution for Python with the new module, but due to fixed seed
backwards incompatibility this results in numerous test failures in numpy,
scipy, pandas and other modules.

Post by Pavlyk, Oleksandr
Unlike numpy.random, the new module generates a vector of random numbers

at a time, which can be done faster than repeatedly generating the same
number of variates one at a time.

Post by Pavlyk, Oleksandr
The source code for the new module is not upstreamed yet, and this email

is meant to solicit early community feedback to allow for faster acceptance
of the proposed changes.

Cool! You can find pertinent discussion here:

https://github.com/numpy/numpy/issues/6967

And the current effort for adding new core PRNGs here:

https://github.com/bashtage/ng-numpy-randomstate

--
Robert Kern

Charles R Harris

2016-06-17 20:41:57 UTC

Permalink

Post by Robert Kern
On Fri, Jun 17, 2016 at 4:08 PM, Pavlyk, Oleksandr <

Post by Pavlyk, Oleksandr
Hi,
I am new to this list, so I will start with an introduction. My name is

Post by Pavlyk, Oleksandr
I originally attempted to simply replace numpy.random in the Intel

Distribution for Python with the new module, but due to fixed seed
backwards incompatibility this results in numerous test failures in numpy,
scipy, pandas and other modules.

Post by Pavlyk, Oleksandr
Unlike numpy.random, the new module generates a vector of random numbers

at a time, which can be done faster than repeatedly generating the same
number of variates one at a time.

Post by Pavlyk, Oleksandr
The source code for the new module is not upstreamed yet, and this email

is meant to solicit early community feedback to allow for faster acceptance
of the proposed changes.
https://github.com/numpy/numpy/issues/6967
https://github.com/bashtage/ng-numpy-randomstate

I wonder if the easiest thing to do at this point might be to implement a
new redesigned random module and keep the old one around for backward
compatibility? Not that that would make everything easy, but at least folks
could choose to use the new functions for speed and versatility if they
needed them. The current random module is pretty stable so maintenance
should not be too onerous.

Chuck

Pavlyk, Oleksandr

2016-07-15 01:53:19 UTC

Permalink

Hi Robert,

Thank you for the pointers.

I think numpy.random should have a mechanism to choose between methods for generating the underlying randomness dynamically, at a run-time, as well as an extensible framework, where developers could add more methods. The default would be MT19937 for backwards compatibility. It is important to be able to do this at a run-time, as it would allow one to use different algorithms in different threads (like different members of the parallel Mersenne twister family of generators, see MT2203).

The framework should allow to define randomness as a bit stream, a stream of fixed size integers, or a stream of uniform reals (32 or 64 bits). This is a lot of like MKLâs abstract method for basic pseudo-random number generation.

https://software.intel.com/en-us/node/590373

Each method should provide routines to sample from uniform distributions over reals (in floats and doubles), as well as over integers.

All remaining non-uniform distributions build on top of these uniform streams.

I think it is pretty important to refactor numpy.random to allow the underlying generators to produce a given number of independent variates at a time. There could be convenience wrapper functions to allow to get one variate for backwards compatibility, but this change in design would allow for better efficiency, as sampling a vector of random variates at once is often faster than repeated sampling of one at a time due to set-up cost, vectorization, etc.

Finally, methods to sample particular distribution should uniformly support method keyword argument. Because method names vary from distribution to distribution, it should ideally be programmatically discoverable which methods are supported for a given distribution. For instance, the standard normal distribution could support method=âInversionâ, method=âBox-Mullerâ, method=âZigguratâ, method=âBox-Muller-Marsagliaâ (the one used in numpy.random right now), as well as bunch of non-named methods based on transformed rejection method (see http://statistik.wu-wien.ac.at/anuran/ )

It would also be good if one could dynamically register a new method to sample from a non-uniform distribution. This would allow, for instance, to automatically add methods to sample certain non-uniform distribution by directly calling into MKL (or other library), when available, instead of building them from uniforms (which may remain a fall-through method).

The linked project is a good start, but the choice of the underlying algorithm needs to be made at a run-time,
as far as I understood, and the only provided interface to query random variates is one at a time, just like it is currently the case
in numpy.random.

Oleksandr

From: NumPy-Discussion [mailto:numpy-discussion-***@scipy.org] On Behalf Of Robert Kern
Sent: Friday, June 17, 2016 10:23 AM
To: Discussion of Numerical Python <numpy-***@scipy.org>
Subject: Re: [Numpy-discussion] Design feedback solicitation

Post by Pavlyk, Oleksandr
Hi,
I am new to this list, so I will start with an introduction. My name is Oleksandr Pavlyk. I now work at Intel Corp. on the Intel Distribution for Python, and previously worked at Wolfram Research for 12 years. My latest project was to write a mirror to numpy.random, named numpy.random_intel. The module uses MKL to sample from different distributions for efficiency. It provides support for different underlying algorithms for basic pseudo-random number generation, i.e. in addition to MT19937, it also provides SFMT19937, MT2203, etc.
https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-generation-in-intel-distribution-for-python
I originally attempted to simply replace numpy.random in the Intel Distribution for Python with the new module, but due to fixed seed backwards incompatibility this results in numerous test failures in numpy, scipy, pandas and other modules.
Unlike numpy.random, the new module generates a vector of random numbers at a time, which can be done faster than repeatedly generating the same number of variates one at a time.
The source code for the new module is not upstreamed yet, and this email is meant to solicit early community feedback to allow for faster acceptance of the proposed changes.

Cool! You can find pertinent discussion here:

https://github.com/numpy/numpy/issues/6967

And the current effort for adding new core PRNGs here:

https://github.com/bashtage/ng-numpy-randomstate

--
Robert Kern

Robert Kern

2016-07-15 02:14:53 UTC

Permalink

On Fri, Jul 15, 2016 at 2:53 AM, Pavlyk, Oleksandr <

Post by Pavlyk, Oleksandr
Hi Robert,
Thank you for the pointers.
I think numpy.random should have a mechanism to choose between methods

for generating the underlying randomness dynamically, at a run-time, as
well as an extensible framework, where developers could add more methods.
The default would be MT19937 for backwards compatibility. It is important
to be able to do this at a run-time, as it would allow one to use different
algorithms in different threads (like different members of the parallel
Mersenne twister family of generators, see MT2203).

Post by Pavlyk, Oleksandr
The framework should allow to define randomness as a bit stream, a stream

of fixed size integers, or a stream of uniform reals (32 or 64 bits). This
is a lot of like MKLâs abstract method for basic pseudo-random number
generation.

Post by Pavlyk, Oleksandr
Each method should provide routines to sample from uniform distributions

over reals (in floats and doubles), as well as over integers.

Post by Pavlyk, Oleksandr
All remaining non-uniform distributions build on top of these uniform streams.

ng-numpy-randomstate does all of these.

Post by Pavlyk, Oleksandr
I think it is pretty important to refactor numpy.random to allow the

underlying generators to produce a given number of independent variates at
a time. There could be convenience wrapper functions to allow to get one
variate for backwards compatibility, but this change in design would allow
for better efficiency, as sampling a vector of random variates at once is
often faster than repeated sampling of one at a time due to set-up cost,
vectorization, etc.

The underlying C implementation is an implementation detail, so the
refactoring that you suggest has no backwards compatibility constraints.

Post by Pavlyk, Oleksandr
Finally, methods to sample particular distribution should uniformly

support method keyword argument. Because method names vary from
distribution to distribution, it should ideally be programmatically
discoverable which methods are supported for a given distribution. For
instance, the standard normal distribution could support
method=âInversionâ, method=âBox-Mullerâ, method=âZigguratâ,
method=âBox-Muller-Marsagliaâ (the one used in numpy.random right now), as
well as bunch of non-named methods based on transformed rejection method
(see http://statistik.wu-wien.ac.at/anuran/ )

That is one of the items under discussion. I personally prefer that one
simply exposes named methods for each different scheme (e.g.
ziggurat_normal(), etc.).

Post by Pavlyk, Oleksandr
It would also be good if one could dynamically register a new method to

sample from a non-uniform distribution. This would allow, for instance, to
automatically add methods to sample certain non-uniform distribution by
directly calling into MKL (or other library), when available, instead of
building them from uniforms (which may remain a fall-through method).

Post by Pavlyk, Oleksandr
The linked project is a good start, but the choice of the underlying

algorithm needs to be made at a run-time,

That's what happens. You instantiate the RandomState class that you want.

Post by Pavlyk, Oleksandr
as far as I understood, and the only provided interface to query random

variates is one at a time, just like it is currently the case

Post by Pavlyk, Oleksandr
in numpy.random.

--
Robert Kern