Discussion:
[Numpy-discussion] Proposal: numpy.random.random_seed
Stephan Hoyer
2016-05-17 03:54:29 UTC
Permalink
I have recently encountered several use cases for randomly generate random
number seeds:

1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].

2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request [2],
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that it
provides an obvious alternative to reusing the private numpy.random._mtrand
[3].

The implementation of this function (and the corresponding method on
RandomState) is almost trivial, and I've already written such a utility for
my code:

def random_seed():
# numpy.random uses uint32 seeds
np.random.randint(2 ** 32)

The advantage of adding a new method is that it avoids the need for
explanation by making the intent of code using this pattern obvious. So I
think it is a good candidate for inclusion in numpy.random.

Any opinions?

[1]
https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/dask/utils.py#L336
[2] https://github.com/pydata/pandas/pull/13161
[3] On a side note, if there's no longer a good reason to keep this object
private, perhaps we should expose it in our public API. It would certainly
be useful -- scikit-learn is already using it (see links in the pandas PR
above).
Stephan Hoyer
2016-05-17 04:32:35 UTC
Permalink
Looking at the dask helper function again reminds me of an important cavaet
to this approach, which was pointed out to me by Clark Fitzgerald.

If you generate a moderately large number of random seeds in this fashion,
you are quite likely to have collisions due to the Birthday Paradox. For
example, you have a 50% chance of encountering at least one collision if
you generate only 77,000 seeds:
https://en.wikipedia.org/wiki/Birthday_attack

The docstring for this function should document this limitation of the
approach, which is still appropriate for a small number of seeds. Our
implementation can also encourage creating these seeds in a single
vectorized call to random_seed, which can significantly reduce the
likelihood of collisions between seeds generated in a single call to
random_seed with something like the following:

def random_seed(size):
base = np.random.randint(2 ** 32)
offset = np.arange(size)
return (base + offset) % (2 ** 32)

In principle, I believe this could generate the full 2 ** 32 unique seeds
without any collisions.

Cryptography experts, please speak up if I'm mistaken here.
Post by Stephan Hoyer
I have recently encountered several use cases for randomly generate random
1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request [2],
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that it
provides an obvious alternative to reusing the private numpy.random._mtrand
[3].
The implementation of this function (and the corresponding method on
RandomState) is almost trivial, and I've already written such a utility for
# numpy.random uses uint32 seeds
np.random.randint(2 ** 32)
The advantage of adding a new method is that it avoids the need for
explanation by making the intent of code using this pattern obvious. So I
think it is a good candidate for inclusion in numpy.random.
Any opinions?
[1]
https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/dask/utils.py#L336
[2] https://github.com/pydata/pandas/pull/13161
[3] On a side note, if there's no longer a good reason to keep this object
private, perhaps we should expose it in our public API. It would certainly
be useful -- scikit-learn is already using it (see links in the pandas PR
above).
Robert Kern
2016-05-17 07:18:52 UTC
Permalink
Post by Stephan Hoyer
I have recently encountered several use cases for randomly generate
1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].

Can you clarify the use case here? I don't really know what you are doing
here, but I'm pretty sure this is not the right approach.
Post by Stephan Hoyer
2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request [2],
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that it
provides an obvious alternative to reusing the private numpy.random._mtrand
[3].

It's only pseudo-private. This is an authorized use of it.

However, for this case, I usually just pass around the the numpy.random
module itself and let duck-typing take care of the rest.
Post by Stephan Hoyer
[3] On a side note, if there's no longer a good reason to keep this
object private, perhaps we should expose it in our public API. It would
certainly be useful -- scikit-learn is already using it (see links in the
pandas PR above).

Adding a public get_global_random_state() function might be in order.
Originally, I wanted there to be *some* barrier to entry, but just grabbing
it to use as a default RandomState object is definitely an intended use of
it. It's not going to disappear.

--
Robert Kern
Stephan Hoyer
2016-05-17 08:09:28 UTC
Permalink
Post by Stephan Hoyer
Post by Stephan Hoyer
1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
Can you clarify the use case here? I don't really know what you are doing
here, but I'm pretty sure this is not the right approach.
Here's a contrived example. Suppose I've written a simulator for cars that
consists of a number of loosely connected components (e.g., an engine,
brakes, etc.). The behavior of each component of our simulator is
stochastic, but we want everything to be fully reproducible, so we need to
use seeds or RandomState objects.

We might write our simulate_car function like the following:

def simulate_car(engine_config, brakes_config, seed=None):
rs = np.random.RandomState(seed)
engine = simulate_engine(engine_config, seed=rs.random_seed())
brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
...

The problem with passing the same RandomState object (either explicitly or
dropping the seed argument entirely and using the global state) to both
simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
change what I do inside simulate_engine, it also effects the brakes.

The dask use case is actually pretty different -- the intent is to create
many random numbers in parallel using multiple threads or processes
(possibly in a distributed fashion). I know that skipping ahead is the
standard way to get independent number streams for parallel sampling, but
that isn't exposed in numpy.random, and setting distinct seeds seems like a
reasonable alternative for scientific computing use cases.
Post by Stephan Hoyer
It's only pseudo-private. This is an authorized use of it.
However, for this case, I usually just pass around the the numpy.random
module itself and let duck-typing take care of the rest.
I like the duck-typing approach. That's very elegant.

If this is an authorized use of the global RandomState object, let's
document it! Otherwise cautious library maintainers like myself will
discourage using it :).
Post by Stephan Hoyer
Post by Stephan Hoyer
[3] On a side note, if there's no longer a good reason to keep this
object private, perhaps we should expose it in our public API. It would
certainly be useful -- scikit-learn is already using it (see links in the
pandas PR above).
Adding a public get_global_random_state() function might be in order.
Yes, possibly.
Robert Kern
2016-05-17 08:49:45 UTC
Permalink
Post by Stephan Hoyer
Post by Robert Kern
Post by Stephan Hoyer
1. When writing a library of stochastic functions that take a seed as
an input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
Post by Stephan Hoyer
Post by Robert Kern
Can you clarify the use case here? I don't really know what you are
doing here, but I'm pretty sure this is not the right approach.
Post by Stephan Hoyer
Here's a contrived example. Suppose I've written a simulator for cars
that consists of a number of loosely connected components (e.g., an engine,
brakes, etc.). The behavior of each component of our simulator is
stochastic, but we want everything to be fully reproducible, so we need to
use seeds or RandomState objects.
Post by Stephan Hoyer
rs = np.random.RandomState(seed)
engine = simulate_engine(engine_config, seed=rs.random_seed())
brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
...
The problem with passing the same RandomState object (either explicitly
or dropping the seed argument entirely and using the global state) to both
simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
change what I do inside simulate_engine, it also effects the brakes.

That's a little too contrived, IMO. In most such simulations, the different
components interact with each other in the normal course of the simulation;
that's why they are both joined together in the same simulation instead of
being two separate runs. Unless if the components are being run across a
process or thread boundary (a la dask below) where true nondeterminism
comes into play, then I don't think you want these semi-independent
streams. This seems to be the advice du jour from the agent-based modeling
community.
Post by Stephan Hoyer
The dask use case is actually pretty different -- the intent is to create
many random numbers in parallel using multiple threads or processes
(possibly in a distributed fashion). I know that skipping ahead is the
standard way to get independent number streams for parallel sampling, but
that isn't exposed in numpy.random, and setting distinct seeds seems like a
reasonable alternative for scientific computing use cases.

Forget about integer seeds. Those are for human convenience. If you're not
jotting them down in your lab notebook in pen, you don't want an integer
seed.

What you want is a function that returns many RandomState objects that are
hopefully spread around the MT19937 space enough that they are essentially
independent (in the absence of true jumpahead). The better implementation
of such a function would look something like this:

def spread_out_prngs(n, root_prng=None):
if root_prng is None:
root_prng = np.random
elif not isinstance(root_prng, np.random.RandomState):
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
for i in range(n):
seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32
under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs

Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
streams:

https://github.com/bashtage/ng-numpy-randomstate

--
Robert Kern
j***@gmail.com
2016-05-17 12:34:26 UTC
Permalink
Post by Robert Kern
Post by Stephan Hoyer
Post by Robert Kern
Post by Stephan Hoyer
1. When writing a library of stochastic functions that take a seed as
an input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
Post by Stephan Hoyer
Post by Robert Kern
Can you clarify the use case here? I don't really know what you are
doing here, but I'm pretty sure this is not the right approach.
Post by Stephan Hoyer
Here's a contrived example. Suppose I've written a simulator for cars
that consists of a number of loosely connected components (e.g., an engine,
brakes, etc.). The behavior of each component of our simulator is
stochastic, but we want everything to be fully reproducible, so we need to
use seeds or RandomState objects.
Post by Stephan Hoyer
rs = np.random.RandomState(seed)
engine = simulate_engine(engine_config, seed=rs.random_seed())
brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
...
The problem with passing the same RandomState object (either explicitly
or dropping the seed argument entirely and using the global state) to both
simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
change what I do inside simulate_engine, it also effects the brakes.
That's a little too contrived, IMO. In most such simulations, the
different components interact with each other in the normal course of the
simulation; that's why they are both joined together in the same simulation
instead of being two separate runs. Unless if the components are being run
across a process or thread boundary (a la dask below) where true
nondeterminism comes into play, then I don't think you want these
semi-independent streams. This seems to be the advice du jour from the
agent-based modeling community.
similar usecase where I had to switch to using several RandomStates

In a Monte Carlo experiment with increasing sample size, I want two random
variables, x, y, to have the same the same draws in the common initial
observations.

If I draw x and y sequentially, and then increase the number of
observations for the simulation, then it completely changes the draws for
second variable if they use a common RandomState.

With separate random states, increasing from 1000 to 1200 observations,
leaves the first 1000 draws unchanged.
(This reduces the Monte Carlo noise for example when calculating the power
of a hypothesis test as function of the sample size.)

Josef
Post by Robert Kern
Post by Stephan Hoyer
The dask use case is actually pretty different -- the intent is to
create many random numbers in parallel using multiple threads or processes
(possibly in a distributed fashion). I know that skipping ahead is the
standard way to get independent number streams for parallel sampling, but
that isn't exposed in numpy.random, and setting distinct seeds seems like a
reasonable alternative for scientific computing use cases.
Forget about integer seeds. Those are for human convenience. If you're not
jotting them down in your lab notebook in pen, you don't want an integer
seed.
What you want is a function that returns many RandomState objects that are
hopefully spread around the MT19937 space enough that they are essentially
independent (in the absence of true jumpahead). The better implementation
root_prng = np.random
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32
under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs
Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
https://github.com/bashtage/ng-numpy-randomstate
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2016-05-17 17:24:05 UTC
Permalink
On May 17, 2016 1:50 AM, "Robert Kern" <***@gmail.com> wrote:
[...]
Post by Robert Kern
What you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
Post by Robert Kern
root_prng = np.random
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
seed_array = root_prng.randint(1<<32, size=624) #
dtype=np.uint32 under 1.11
Post by Robert Kern
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be
a method RandomState.random_state() that generates and returns a new child
RandomState.
Post by Robert Kern
Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
Post by Robert Kern
https://github.com/bashtage/ng-numpy-randomstate
... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for
RandomState.random_state you really want a tree of independent RNGs and the
fancy new PRNGs only provide a single flat namespace :-/. And even more
annoyingly, the tree API is actually a nicer API, because with a flat
namespace you have to know up front about all possible RNGs your code will
use, which is an unfortunate global coupling that makes it difficult to
compose programs out of independent pieces, while the
RandomState.random_state approach composes beautifully. Maybe there's some
clever way to allocate a 64-bit namespace to make it look tree-like? I'm
not sure 64 bits is really enough...

-n
Robert Kern
2016-05-17 17:41:31 UTC
Permalink
Post by Nathaniel Smith
[...]
Post by Robert Kern
What you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
Post by Nathaniel Smith
Post by Robert Kern
root_prng = np.random
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
seed_array = root_prng.randint(1<<32, size=624) #
dtype=np.uint32 under 1.11
Post by Nathaniel Smith
Post by Robert Kern
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would
be a method RandomState.random_state() that generates and returns a new
child RandomState.

I disagree. This is a workaround in the absence of proper jumpahead or
guaranteed-independent streams. I would not encourage it.
Post by Nathaniel Smith
Post by Robert Kern
Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
Post by Nathaniel Smith
Post by Robert Kern
https://github.com/bashtage/ng-numpy-randomstate
... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for
RandomState.random_state you really want a tree of independent RNGs and the
fancy new PRNGs only provide a single flat namespace :-/. And even more
annoyingly, the tree API is actually a nicer API, because with a flat
namespace you have to know up front about all possible RNGs your code will
use, which is an unfortunate global coupling that makes it difficult to
compose programs out of independent pieces, while the
RandomState.random_state approach composes beautifully. Maybe there's some
clever way to allocate a 64-bit namespace to make it look tree-like? I'm
not sure 64 bits is really enough...

MT19937 doesn't have a "tree" any more than the others. It's the same flat
state space. You are just getting the illusion of a tree by hoping that you
never collide. You ought to think about precisely the same global coupling
issues with MT19937 as you do with guaranteed-independent streams.
Hope-and-prayer isn't really a substitute for properly engineering your
problem. It's just a moral hazard to promote this method to the main API.

--
Robert Kern
Nathaniel Smith
2016-05-18 00:14:33 UTC
Permalink
Post by Robert Kern
Post by Nathaniel Smith
[...]
Post by Robert Kern
What you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
root_prng = np.random
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
seed_array = root_prng.randint(1<<32, size=624) #
dtype=np.uint32 under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be
a method RandomState.random_state() that generates and returns a new child
RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or
guaranteed-independent streams. I would not encourage it.
Post by Nathaniel Smith
Post by Robert Kern
Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will at
least make the chance of collision tiny. And it can be easily rewritten to
https://github.com/bashtage/ng-numpy-randomstate
... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for RandomState.random_state
you really want a tree of independent RNGs and the fancy new PRNGs only
provide a single flat namespace :-/. And even more annoyingly, the tree API
is actually a nicer API, because with a flat namespace you have to know up
front about all possible RNGs your code will use, which is an unfortunate
global coupling that makes it difficult to compose programs out of
independent pieces, while the RandomState.random_state approach composes
beautifully. Maybe there's some clever way to allocate a 64-bit namespace to
make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat
state space. You are just getting the illusion of a tree by hoping that you
never collide. You ought to think about precisely the same global coupling
issues with MT19937 as you do with guaranteed-independent streams.
Hope-and-prayer isn't really a substitute for properly engineering your
problem. It's just a moral hazard to promote this method to the main API.
Nonsense.

If your definition of "hope and prayer" includes assuming that we
won't encounter a random collision in a 2**19937 state space, then
literally all engineering is hope-and-prayer. A collision could
happen, but if it does it's overwhelmingly more likely to happen
because of a flaw in the mathematical analysis, or a bug in the
implementation, or because random quantum fluctuations caused you and
your program to suddenly be transported to a parallel world where 1 +
1 = 1, than that you just got unlucky with your random state. And all
of these hazards apply equally to both MT19937 and more modern PRNGs.

...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API, so whether or not it
turns out to be possible I think we should at least be allowed to have
a discussion about whether there's some way to give it to them. It's
not even 100% out of the question that we conclude that existing PRNGs
are buggy because they don't take this use case into account -- it
would be far from the first time that numpy found itself going beyond
the limits of older numerical tools that weren't designed to build the
kind of large composable systems that numpy gets used for.

MT19937's state space is large enough that you could explicitly encode
a "tree seed" into it, even if you don't trust the laws of probability
-- e.g., you start with a RandomState with id [], then its children
have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
1], ..., [1, 0], ..., and you write these into the state (probably
after sending them through some bit-mixing permutation), to guarantee
non-collision.

Or if you do trust the laws of probability, then the
randomly-sample-a-PRNG approach is not 100% out of the question even
for more modern PRNGs. For example, let's take a PRNG with a 64-bit
stream id and a 64-bit state space per stream. Suppose that we know
that in our application each PRNG will be used to draw 2**48 64-bit
samples (= 2 pebibytes of random data), and that we will use 2**32
PRNGs (= total of 8 yobibytes of random data). If we randomly
initialize each PRNG with a 128-bit (stream id, state) pair, then the
chance that two of our streams will overlap is about the chance of
getting a collision in an 80-bit state space (128 - 48) on 2**32
draws, which can be approximated[1] as

1 - exp(-(2**32)**2 / (2 * 2**80))
= 1 - exp(-2**64 / 2 ** 81)
= 1 - exp(-1/2**17)
= 7.6-chances-in-a-million

To put this number in some kind of context, Sandia says that on 128
pebibyte clusters (which I assume is the kind of system you use for a
simulation that needs multiple yobibytes of random data!) they see a
double-bit (!) DRAM fault ~every 4 minutes [2]. Also, by the time
we've drawn 2**48 samples from a single 64-bit stream than we've
already violated independence because in 2**48 draws you should see a
repeated value with probability ~1.0 (by the same birthday
approximation as above), but we see them with probability 0.

Alternatively, for a PRNG with a 256 bit state space, or 128-bit
stream id + 128-bit state space, then one can draw 2**64 64-bit
samples without violating independence, *and* do this for, say, 2**70
randomly initialized streams, and the probability of collision would
still be on the order of 10**-16.

Unless I totally messed up these calculations, I'd conclude that
almost everyone would be totally fine with a .random_state() method
combined with any reasonable generator, and for some reasonable
generators (those with more than, like, ~192 bits of state?) it's just
unconditionally safe for all physically realistic problem sizes.

-n

[1] https://en.wikipedia.org/wiki/Birthday_problem#Approximations
[2] http://www.fiala.me/pubs/papers/sc12-redmpi.pdf
--
Nathaniel J. Smith -- https://vorpus.org
Robert Kern
2016-05-18 12:07:30 UTC
Permalink
Post by Nathaniel Smith
Post by Robert Kern
Post by Nathaniel Smith
[...]
Post by Robert Kern
What you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
root_prng = np.random
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
seed_array = root_prng.randint(1<<32, size=624) #
dtype=np.uint32 under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be
a method RandomState.random_state() that generates and returns a new child
RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or
guaranteed-independent streams. I would not encourage it.
Post by Nathaniel Smith
Post by Robert Kern
Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will at
least make the chance of collision tiny. And it can be easily rewritten to
https://github.com/bashtage/ng-numpy-randomstate
... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for
RandomState.random_state
Post by Nathaniel Smith
Post by Robert Kern
Post by Nathaniel Smith
you really want a tree of independent RNGs and the fancy new PRNGs only
provide a single flat namespace :-/. And even more annoyingly, the tree API
is actually a nicer API, because with a flat namespace you have to know up
front about all possible RNGs your code will use, which is an unfortunate
global coupling that makes it difficult to compose programs out of
independent pieces, while the RandomState.random_state approach composes
beautifully. Maybe there's some clever way to allocate a 64-bit namespace to
make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat
state space. You are just getting the illusion of a tree by hoping that you
never collide. You ought to think about precisely the same global coupling
issues with MT19937 as you do with guaranteed-independent streams.
Hope-and-prayer isn't really a substitute for properly engineering your
problem. It's just a moral hazard to promote this method to the main API.
Nonsense.
If your definition of "hope and prayer" includes assuming that we
won't encounter a random collision in a 2**19937 state space, then
literally all engineering is hope-and-prayer. A collision could
happen, but if it does it's overwhelmingly more likely to happen
because of a flaw in the mathematical analysis, or a bug in the
implementation, or because random quantum fluctuations caused you and
your program to suddenly be transported to a parallel world where 1 +
1 = 1, than that you just got unlucky with your random state. And all
of these hazards apply equally to both MT19937 and more modern PRNGs.
Granted.
Post by Nathaniel Smith
...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
I remain unconvinced on this mark. Grumpily.
Post by Nathaniel Smith
so whether or not it
turns out to be possible I think we should at least be allowed to have
a discussion about whether there's some way to give it to them.
I'm not shutting down discussion of the option. I *implemented* the option.
I think that discussing whether it should be part of the main API is
premature. There probably ought to be a paper or three out there supporting
its safety and utility first. Let the utility function version flourish
first.
Post by Nathaniel Smith
It's
not even 100% out of the question that we conclude that existing PRNGs
are buggy because they don't take this use case into account -- it
would be far from the first time that numpy found itself going beyond
the limits of older numerical tools that weren't designed to build the
kind of large composable systems that numpy gets used for.
MT19937's state space is large enough that you could explicitly encode
a "tree seed" into it, even if you don't trust the laws of probability
-- e.g., you start with a RandomState with id [], then its children
have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
1], ..., [1, 0], ..., and you write these into the state (probably
after sending them through some bit-mixing permutation), to guarantee
non-collision.
Sure. Not entirely sure if that can be done without preallocating the
branching factor or depth, but I'm sure there's some fancy combinatoric
representation of an unbounded tree that could be exploited here. It seems
likely to me that such a thing could be done with the stream IDs of PRNGs
that support that.
Post by Nathaniel Smith
Or if you do trust the laws of probability, then the
randomly-sample-a-PRNG approach is not 100% out of the question even
for more modern PRNGs.
Tread carefully here. PRNGs are not RNGs. Using the root PRNG to generate N
new seeds for the same PRNG does not necessarily generate N good,
uncorrelated streams with high probability. Overlap is not the only factor
in play. Long-term correlations happen even in the case where the N streams
do not overlap, and the structure of the root PRNG could well generate N
such correlated seeds.

http://www.adv-radio-sci.net/12/75/2014/

See section 4.4. It does look like one can get good MT19937 streams with
sequential integer seeds, when expanded by the standard sub-PRNG that we
use for that purpose (yay!). However, if you use a different (but otherwise
good-quality and general purpose) sub-PRNG to expand the key, you get
increasing numbers of failures as you increase the number of parallel
streams. It is not explored in the paper exactly why this is the case, but
I suspect that the similarity of the second sub-PRNG algorithm to that of
the MT itself is part of the problem.

It is *likely* that the scheme I implemented will be okay (our array-seed
expansion algorithm is similar to the integer-seed expansion and *probably*
insulates the sprouted MTs from their MT generating source), but that
remains to be demonstrated.

--
Robert Kern
Chris Barker
2016-05-18 15:50:25 UTC
Permalink
Post by Nathaniel Smith
...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want to
accomplish something anyway :-) I trust you guys to get this right -- or at
least document what's "wrong" with it.

But, if I'm reading the use case that started all this correctly, it
closely matches my use-case. That is, I have a complex model with multiple
independent "random" processes. And we want to be able to re-produce
EXACTLY simulations -- our users get confused when the results are
"different" even if in a statistically insignificant way.

At the moment we are using one RNG, with one seed for everything. So we get
reproducible results, but if one thing is changed, then the entire
simulation is different -- which is OK, but it would be nicer to have each
process using its own RNG stream with it's own seed. However, it matters
not one whit if those seeds are independent -- the processes are different,
you'd never notice if they were using the same PRN stream -- because they
are used differently. So a "fairly low probability of a clash" would be
totally fine.

Granted, in a Monte Carlo simulation, it could be disastrous... :-)

I guess the point is -- do something reasonable, and document its
limitations, and we're all fine :-)

And thanks for giving your attention to this.

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Robert Kern
2016-05-18 16:01:44 UTC
Permalink
Post by Chris Barker
Post by Nathaniel Smith
...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want to
accomplish something anyway :-) I trust you guys to get this right -- or at
least document what's "wrong" with it.
Post by Chris Barker
But, if I'm reading the use case that started all this correctly, it
closely matches my use-case. That is, I have a complex model with multiple
independent "random" processes. And we want to be able to re-produce
EXACTLY simulations -- our users get confused when the results are
"different" even if in a statistically insignificant way.
Post by Chris Barker
At the moment we are using one RNG, with one seed for everything. So we
get reproducible results, but if one thing is changed, then the entire
simulation is different -- which is OK, but it would be nicer to have each
process using its own RNG stream with it's own seed. However, it matters
not one whit if those seeds are independent -- the processes are different,
you'd never notice if they were using the same PRN stream -- because they
are used differently. So a "fairly low probability of a clash" would be
totally fine.

Well, the main question is: do you need to be able to spawn dependent
streams at arbitrary points to an arbitrary depth without coordination
between processes? The necessity for multiple independent streams per se is
not contentious.

--
Robert Kern
j***@gmail.com
2016-05-18 17:20:57 UTC
Permalink
Post by Chris Barker
Post by Chris Barker
Post by Nathaniel Smith
...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want to
accomplish something anyway :-) I trust you guys to get this right -- or at
least document what's "wrong" with it.
Post by Chris Barker
But, if I'm reading the use case that started all this correctly, it
closely matches my use-case. That is, I have a complex model with multiple
independent "random" processes. And we want to be able to re-produce
EXACTLY simulations -- our users get confused when the results are
"different" even if in a statistically insignificant way.
Post by Chris Barker
At the moment we are using one RNG, with one seed for everything. So we
get reproducible results, but if one thing is changed, then the entire
simulation is different -- which is OK, but it would be nicer to have each
process using its own RNG stream with it's own seed. However, it matters
not one whit if those seeds are independent -- the processes are different,
you'd never notice if they were using the same PRN stream -- because they
are used differently. So a "fairly low probability of a clash" would be
totally fine.
Well, the main question is: do you need to be able to spawn dependent
streams at arbitrary points to an arbitrary depth without coordination
between processes? The necessity for multiple independent streams per se is
not contentious.
I'm similar to Chris, and didn't try to figure out the details of what you
are talking about.

However, if there are functions getting into numpy that help in using a
best practice even if it's not bullet proof, then it's still better than
home made approaches.
If it get's in soon, then we can use it in a few years (given dependency
lag). At that point there should be more distributed, nested simulation
based algorithms where we don't know in advance how far we have to go to
get reliable numbers or convergence.

(But I don't see anything like that right now.)

Josef
Post by Chris Barker
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Kern
2016-05-18 17:56:50 UTC
Permalink
Post by j***@gmail.com
Post by Robert Kern
Post by Chris Barker
Post by Nathaniel Smith
...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want
to accomplish something anyway :-) I trust you guys to get this right -- or
at least document what's "wrong" with it.
Post by j***@gmail.com
Post by Robert Kern
Post by Chris Barker
But, if I'm reading the use case that started all this correctly, it
closely matches my use-case. That is, I have a complex model with multiple
independent "random" processes. And we want to be able to re-produce
EXACTLY simulations -- our users get confused when the results are
"different" even if in a statistically insignificant way.
Post by j***@gmail.com
Post by Robert Kern
Post by Chris Barker
At the moment we are using one RNG, with one seed for everything. So
we get reproducible results, but if one thing is changed, then the entire
simulation is different -- which is OK, but it would be nicer to have each
process using its own RNG stream with it's own seed. However, it matters
not one whit if those seeds are independent -- the processes are different,
you'd never notice if they were using the same PRN stream -- because they
are used differently. So a "fairly low probability of a clash" would be
totally fine.
Post by j***@gmail.com
Post by Robert Kern
Well, the main question is: do you need to be able to spawn dependent
streams at arbitrary points to an arbitrary depth without coordination
between processes? The necessity for multiple independent streams per se is
not contentious.
Post by j***@gmail.com
I'm similar to Chris, and didn't try to figure out the details of what
you are talking about.
Post by j***@gmail.com
However, if there are functions getting into numpy that help in using a
best practice even if it's not bullet proof, then it's still better than
home made approaches.
Post by j***@gmail.com
If it get's in soon, then we can use it in a few years (given dependency
lag). At that point there should be more distributed, nested simulation
based algorithms where we don't know in advance how far we have to go to
get reliable numbers or convergence.
Post by j***@gmail.com
(But I don't see anything like that right now.)
Current best practice is to use PRNGs with settable streams (or fixed
jumpahead for those PRNGs cursed to not have settable streams but blessed
to have super-long periods). The way to get those into numpy is to help
Kevin Sheppard finish:

https://github.com/bashtage/ng-numpy-randomstate

He's done nearly all of the hard work already.

--
Robert Kern
Nathaniel Smith
2016-05-18 18:56:17 UTC
Permalink
Post by Robert Kern
Post by Nathaniel Smith
Post by Robert Kern
Post by Nathaniel Smith
[...]
Post by Robert Kern
What you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
root_prng = np.random
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
seed_array = root_prng.randint(1<<32, size=624) #
dtype=np.uint32 under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be
a method RandomState.random_state() that generates and returns a new child
RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or
guaranteed-independent streams. I would not encourage it.
Post by Nathaniel Smith
Post by Robert Kern
Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will at
least make the chance of collision tiny. And it can be easily rewritten to
https://github.com/bashtage/ng-numpy-randomstate
... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for
RandomState.random_state
you really want a tree of independent RNGs and the fancy new PRNGs only
provide a single flat namespace :-/. And even more annoyingly, the tree API
is actually a nicer API, because with a flat namespace you have to know up
front about all possible RNGs your code will use, which is an unfortunate
global coupling that makes it difficult to compose programs out of
independent pieces, while the RandomState.random_state approach composes
beautifully. Maybe there's some clever way to allocate a 64-bit namespace to
make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat
state space. You are just getting the illusion of a tree by hoping that you
never collide. You ought to think about precisely the same global coupling
issues with MT19937 as you do with guaranteed-independent streams.
Hope-and-prayer isn't really a substitute for properly engineering your
problem. It's just a moral hazard to promote this method to the main API.
Nonsense.
If your definition of "hope and prayer" includes assuming that we
won't encounter a random collision in a 2**19937 state space, then
literally all engineering is hope-and-prayer. A collision could
happen, but if it does it's overwhelmingly more likely to happen
because of a flaw in the mathematical analysis, or a bug in the
implementation, or because random quantum fluctuations caused you and
your program to suddenly be transported to a parallel world where 1 +
1 = 1, than that you just got unlucky with your random state. And all
of these hazards apply equally to both MT19937 and more modern PRNGs.
Granted.
Post by Nathaniel Smith
...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
I remain unconvinced on this mark. Grumpily.
Sorry for getting grumpy :-). The engineering reasons seem pretty
obvious to me though? If you have any use case for independent streams
at all, and you're writing code that's intended to live inside a
library's abstraction barrier, then you need some way to choose your
streams to avoid colliding with arbitrary other code that the end-user
might assemble alongside yours as part of their final program. So
AFAICT you have two options: either you need a "tree-style" API for
allocating these streams, or else you need to add some explicit API to
your library that lets the end-user control in detail which streams
you use. Both are possible, but the latter is obviously undesireable
if you can avoid it, since it breaks the abstraction barrier, making
your library more complicated to use and harder to evolve.
Post by Robert Kern
Post by Nathaniel Smith
so whether or not it
turns out to be possible I think we should at least be allowed to have
a discussion about whether there's some way to give it to them.
I'm not shutting down discussion of the option. I *implemented* the option.
I think that discussing whether it should be part of the main API is
premature. There probably ought to be a paper or three out there supporting
its safety and utility first. Let the utility function version flourish
first.
OK -- I guess this particularly makes sense given how
extra-tightly-constrained we currently are in fixing mistakes in
np.random. But I feel like in the end the right place for this really
is inside the RandomState interface, because the person implementing
RandomState is the one best placed to understand (a) the gnarly
technical details here, and (b) how those change depending on the
particular PRNG in use. I don't want to end up with a bunch of
subtly-buggy utility functions in non-specialist libraries like dask
-- so we should be trying to help downstream users figure out how to
actually get this into np.random?
Post by Robert Kern
Post by Nathaniel Smith
It's
not even 100% out of the question that we conclude that existing PRNGs
are buggy because they don't take this use case into account -- it
would be far from the first time that numpy found itself going beyond
the limits of older numerical tools that weren't designed to build the
kind of large composable systems that numpy gets used for.
MT19937's state space is large enough that you could explicitly encode
a "tree seed" into it, even if you don't trust the laws of probability
-- e.g., you start with a RandomState with id [], then its children
have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
1], ..., [1, 0], ..., and you write these into the state (probably
after sending them through some bit-mixing permutation), to guarantee
non-collision.
Sure. Not entirely sure if that can be done without preallocating the
branching factor or depth, but I'm sure there's some fancy combinatoric
representation of an unbounded tree that could be exploited here. It seems
likely to me that such a thing could be done with the stream IDs of PRNGs
that support that.
I'm pretty sure you do have to preallocate both the branching factor
and depth, since getting the collision-free guarantee requires that
across the universe of possible tree addresses, each state id gets
used at most once -- and there are finitely many state ids. But as a
practical matter, saying "you can only sprout up to 2**32 new states
out of any given state, and you can only nest them 600 deep, and
exceeding these bounds is an error" would still be "composable enough"
for ~all practical purposes.
Post by Robert Kern
Post by Nathaniel Smith
Or if you do trust the laws of probability, then the
randomly-sample-a-PRNG approach is not 100% out of the question even
for more modern PRNGs.
Tread carefully here. PRNGs are not RNGs. Using the root PRNG to generate N
new seeds for the same PRNG does not necessarily generate N good,
uncorrelated streams with high probability. Overlap is not the only factor
in play. Long-term correlations happen even in the case where the N streams
do not overlap, and the structure of the root PRNG could well generate N
such correlated seeds.
http://www.adv-radio-sci.net/12/75/2014/
See section 4.4. It does look like one can get good MT19937 streams with
sequential integer seeds, when expanded by the standard sub-PRNG that we use
for that purpose (yay!). However, if you use a different (but otherwise
good-quality and general purpose) sub-PRNG to expand the key, you get
increasing numbers of failures as you increase the number of parallel
streams. It is not explored in the paper exactly why this is the case, but I
suspect that the similarity of the second sub-PRNG algorithm to that of the
MT itself is part of the problem.
It is *likely* that the scheme I implemented will be okay (our array-seed
expansion algorithm is similar to the integer-seed expansion and *probably*
insulates the sprouted MTs from their MT generating source), but that
remains to be demonstrated.
Right... I think the way to think about this is to split it into two
pieces. First, there's the issue that correlated streams exist at all.
That they do for MT is annoying and closely related to why we prefer
other PRNGs these days. When doing the analysis of the collision
probability of randomly sprouted generators, this effectively reduces
the size of the space and increases the risk of collision. For MT this
by itself isn't a big deal because the space is so ludicrously large,
but for the more modern PRNGs with smaller state spaces you definitely
want these correlated streams not to exist at all. Fortunately I
believe this is a design criterion that modern PRNGs take into
account, but yeah, one needs to check this. (This kind of issue is
also why I have a fondness for PRNGs designed on the "get a bigger
hammer" principle, like AES-CTR :-).)

Second, there's the issue that that paper talks about, where *if*
correlated streams exist, then the structure of your PRNG might be
such that when sampling a new seed from an old PRNG, you're
unreasonably likely to hit one of them. (The most obvious example of
this would be if you use an archaic PRNG design that works by
outputting its internal state and then mixing it -- if you use the
output of this PRNG to seed a new PRNG then the two will be identical
modulo some lag!) If your sprouting procedure is subject to this kind
of phenomenon, then it totally invalidates the use of simple
probability theory to analyze the chance of collisions.

But, I ignored this in my analysis because it's definitely solveable
:-). All we need is some deterministic function that we can apply to
the output of our original PRNG that will give us back something
"effectively random" to use as our sprouted seed, totally destroying
all correlations. And this problem of destroying correlations is
well-studied in the crypto world (basically destroying these
correlations *is* the entire problem of cryptographic primitive
design), so even if we have doubts about the specific initialization
functions usually used with MT, we have available to us lots of
primitives that we're *very* confident will do a *very* thorough job,
and sprouting isn't a performance sensitive operations so it doesn't
matter if it takes a little longer using some crypto hash instead of
the classic MT seeding algorithm. And then the simple birthday
calculations are appropriate again.

-n
--
Nathaniel J. Smith -- https://vorpus.org
Robert Kern
2016-05-22 09:35:50 UTC
Permalink
Post by Nathaniel Smith
Post by Robert Kern
Post by Nathaniel Smith
...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
I remain unconvinced on this mark. Grumpily.
Sorry for getting grumpy :-).
And my apologies for some unwarranted hyperbole. I think we're both
converging on a reasonable approach, though.
Post by Nathaniel Smith
The engineering reasons seem pretty
obvious to me though?
Well, I mean, engineers want lots of things. I suspect that most engineers
*really* just want to call `numpy.random.seed(8675309)` at the start and
never explicitly pass around separate streams. There's an upside to that in
terms of code simplicity. There are also significant limitations and
constraints. Ultimately, the upside against the alternative of passing
around RandomState objects is usually overweighed by the limitations, so
best practice is to pass around RandomState objects.

I acknowledge that there exists an upside to the splitting API, but I don't
think it's a groundbreaking improvement over the alternative current best
practice. It's also unclear to me how often situations that really
demonstrate the upside come into play; in my experience a lot of these
situations are already structured such that preallocating N streams is the
natural thing to do. The limitations and constraints are currently
underexplored, IMO; and in this conservative field, pessimism is warranted.
Post by Nathaniel Smith
If you have any use case for independent streams
at all, and you're writing code that's intended to live inside a
library's abstraction barrier, then you need some way to choose your
streams to avoid colliding with arbitrary other code that the end-user
might assemble alongside yours as part of their final program. So
AFAICT you have two options: either you need a "tree-style" API for
allocating these streams, or else you need to add some explicit API to
your library that lets the end-user control in detail which streams
you use. Both are possible, but the latter is obviously undesireable
if you can avoid it, since it breaks the abstraction barrier, making
your library more complicated to use and harder to evolve.
ACK
Post by Nathaniel Smith
Post by Robert Kern
Post by Nathaniel Smith
so whether or not it
turns out to be possible I think we should at least be allowed to have
a discussion about whether there's some way to give it to them.
I'm not shutting down discussion of the option. I *implemented* the option.
I think that discussing whether it should be part of the main API is
premature. There probably ought to be a paper or three out there supporting
its safety and utility first. Let the utility function version flourish
first.
OK -- I guess this particularly makes sense given how
extra-tightly-constrained we currently are in fixing mistakes in
np.random. But I feel like in the end the right place for this really
is inside the RandomState interface, because the person implementing
RandomState is the one best placed to understand (a) the gnarly
technical details here, and (b) how those change depending on the
particular PRNG in use. I don't want to end up with a bunch of
subtly-buggy utility functions in non-specialist libraries like dask
-- so we should be trying to help downstream users figure out how to
actually get this into np.random?
I think this is an open research area. An enterprising grad student could
milk this for a couple of papers analyzing how to do this safely for a
variety of PRNGs. I don't think we can hash this out in an email thread or
PR. So yeah, eventually there might be an API on RandomState for this, but
it's way too premature to do so right now, IMO. Maybe start with a
specialized subclass of RandomState that adds this experimental API. In
ng-numpy-randomstate. ;-)

But if someone has spare time to work on numpy.random, for God's sake, use
Post by Nathaniel Smith
Post by Robert Kern
Post by Nathaniel Smith
It's
not even 100% out of the question that we conclude that existing PRNGs
are buggy because they don't take this use case into account -- it
would be far from the first time that numpy found itself going beyond
the limits of older numerical tools that weren't designed to build the
kind of large composable systems that numpy gets used for.
MT19937's state space is large enough that you could explicitly encode
a "tree seed" into it, even if you don't trust the laws of probability
-- e.g., you start with a RandomState with id [], then its children
have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
1], ..., [1, 0], ..., and you write these into the state (probably
after sending them through some bit-mixing permutation), to guarantee
non-collision.
Sure. Not entirely sure if that can be done without preallocating the
branching factor or depth, but I'm sure there's some fancy combinatoric
representation of an unbounded tree that could be exploited here. It seems
likely to me that such a thing could be done with the stream IDs of PRNGs
that support that.
I'm pretty sure you do have to preallocate both the branching factor
and depth, since getting the collision-free guarantee requires that
across the universe of possible tree addresses, each state id gets
used at most once -- and there are finitely many state ids. But as a
practical matter, saying "you can only sprout up to 2**32 new states
out of any given state, and you can only nest them 600 deep, and
exceeding these bounds is an error" would still be "composable enough"
for ~all practical purposes.
To be honest, I'd even be satisfied with taking the treepath breadcrumbs
and hashing them with a suitable cryptographic hash to get a stream ID.
For, say, PCG64 which has 2**127 settable streams, do
sha256(treepath.asbytes()) % (2**127). I'm okay with spray-and-pray if it's
using well-analyzed cryptographic primitives and enough settable streams.
Subject to actual testing, of course.

And that last point concerns me. PRNG quality testing is a dodgy subject as
it is, and the state of the art for testing parallel streams is even worse.
To be honest, making it more convenient to make new streams at a low
library level makes me leery. I do think this is something that needs to be
considered and designed at a high level, irrespective of the available APIs.

--
Robert Kern
Chris Barker
2016-05-23 16:41:11 UTC
Permalink
Post by Robert Kern
Well, I mean, engineers want lots of things. I suspect that most engineers
*really* just want to call `numpy.random.seed(8675309)` at the start and
never explicitly pass around separate streams. There's an upside to that in
terms of code simplicity. There are also significant limitations and
constraints. Ultimately, the upside against the alternative of passing
around RandomState objects is usually overweighed by the limitations, so
best practice is to pass around RandomState objects.
Could we do something like the logging module, and have numpy.random
"manage" a bunch of stream objects for you -- so you could get the default
single stream easily, and also get access to specific streams without
needing to pass around the objects?

That would allow folks to start off writing code the "easy" way -- then
make it easier to refactor when they realize they need multiple independent
streams.

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Robert Kern
2016-05-23 16:54:02 UTC
Permalink
Post by Chris Barker
Post by Robert Kern
Well, I mean, engineers want lots of things. I suspect that most
engineers *really* just want to call `numpy.random.seed(8675309)` at the
start and never explicitly pass around separate streams. There's an upside
to that in terms of code simplicity. There are also significant limitations
and constraints. Ultimately, the upside against the alternative of passing
around RandomState objects is usually overweighed by the limitations, so
best practice is to pass around RandomState objects.
Post by Chris Barker
Could we do something like the logging module, and have numpy.random
"manage" a bunch of stream objects for you -- so you could get the default
single stream easily, and also get access to specific streams without
needing to pass around the objects?

No, I don't think so. The logging module's namespacing doesn't really have
an equivalent use case for PRNGs. We would just be making a much more
complicated global state to manage.

--
Robert Kern

Sturla Molden
2016-05-17 13:40:56 UTC
Permalink
Post by Stephan Hoyer
I have recently encountered several use cases for randomly generate random
1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request [2],
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that it
provides an obvious alternative to reusing the private numpy.random._mtrand
[3].
What about making numpy.random a finite state machine, and keeping a stack
of RandomState seeds? That is, something similar to what OpenGL does for
its matrices? Then we get two functions, numpy.random.push_seed and
numpy.random.pop_seed.


Sturla
Robert Kern
2016-05-17 13:48:45 UTC
Permalink
Post by Sturla Molden
Post by Stephan Hoyer
I have recently encountered several use cases for randomly generate random
1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request [2],
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that it
provides an obvious alternative to reusing the private
numpy.random._mtrand
Post by Sturla Molden
Post by Stephan Hoyer
[3].
What about making numpy.random a finite state machine, and keeping a stack
of RandomState seeds? That is, something similar to what OpenGL does for
its matrices? Then we get two functions, numpy.random.push_seed and
numpy.random.pop_seed.
I don't think that addresses the issues brought up here. It's just more
global state to worry about.

--
Robert Kern
Eric Moore
2016-05-17 13:50:41 UTC
Permalink
Post by Stephan Hoyer
Post by Stephan Hoyer
I have recently encountered several use cases for randomly generate
random
Post by Stephan Hoyer
1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request
[2],
Post by Stephan Hoyer
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that
it
Post by Stephan Hoyer
provides an obvious alternative to reusing the private
numpy.random._mtrand
Post by Stephan Hoyer
[3].
What about making numpy.random a finite state machine, and keeping a stack
of RandomState seeds? That is, something similar to what OpenGL does for
its matrices? Then we get two functions, numpy.random.push_seed and
numpy.random.pop_seed.
I don't like the idea of adding this kind of internal state. Having it
built into the module means that it is shared by all callers, libraries
user code etc. That's not the right choice when a stack of seeds could be
easily built around the RandomState object if that is really what someone
needs.

Eric
Loading...