Post by Robert KernPost by Nathaniel SmithPost by Robert KernPost by Nathaniel Smith[...]
Post by Robert KernWhat you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
root_prng = np.random
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
seed_array = root_prng.randint(1<<32, size=624) #
dtype=np.uint32 under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be
a method RandomState.random_state() that generates and returns a new child
RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or
guaranteed-independent streams. I would not encourage it.
Post by Nathaniel SmithPost by Robert KernInternally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will at
least make the chance of collision tiny. And it can be easily rewritten to
https://github.com/bashtage/ng-numpy-randomstate
... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for
RandomState.random_state
you really want a tree of independent RNGs and the fancy new PRNGs only
provide a single flat namespace :-/. And even more annoyingly, the tree API
is actually a nicer API, because with a flat namespace you have to know up
front about all possible RNGs your code will use, which is an unfortunate
global coupling that makes it difficult to compose programs out of
independent pieces, while the RandomState.random_state approach composes
beautifully. Maybe there's some clever way to allocate a 64-bit namespace to
make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat
state space. You are just getting the illusion of a tree by hoping that you
never collide. You ought to think about precisely the same global coupling
issues with MT19937 as you do with guaranteed-independent streams.
Hope-and-prayer isn't really a substitute for properly engineering your
problem. It's just a moral hazard to promote this method to the main API.
Nonsense.
If your definition of "hope and prayer" includes assuming that we
won't encounter a random collision in a 2**19937 state space, then
literally all engineering is hope-and-prayer. A collision could
happen, but if it does it's overwhelmingly more likely to happen
because of a flaw in the mathematical analysis, or a bug in the
implementation, or because random quantum fluctuations caused you and
your program to suddenly be transported to a parallel world where 1 +
1 = 1, than that you just got unlucky with your random state. And all
of these hazards apply equally to both MT19937 and more modern PRNGs.
Granted.
Post by Nathaniel Smith...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API,
I remain unconvinced on this mark. Grumpily.
Sorry for getting grumpy :-). The engineering reasons seem pretty
obvious to me though? If you have any use case for independent streams
at all, and you're writing code that's intended to live inside a
library's abstraction barrier, then you need some way to choose your
streams to avoid colliding with arbitrary other code that the end-user
might assemble alongside yours as part of their final program. So
AFAICT you have two options: either you need a "tree-style" API for
allocating these streams, or else you need to add some explicit API to
your library that lets the end-user control in detail which streams
you use. Both are possible, but the latter is obviously undesireable
if you can avoid it, since it breaks the abstraction barrier, making
your library more complicated to use and harder to evolve.
Post by Robert KernPost by Nathaniel Smithso whether or not it
turns out to be possible I think we should at least be allowed to have
a discussion about whether there's some way to give it to them.
I'm not shutting down discussion of the option. I *implemented* the option.
I think that discussing whether it should be part of the main API is
premature. There probably ought to be a paper or three out there supporting
its safety and utility first. Let the utility function version flourish
first.
OK -- I guess this particularly makes sense given how
extra-tightly-constrained we currently are in fixing mistakes in
np.random. But I feel like in the end the right place for this really
is inside the RandomState interface, because the person implementing
RandomState is the one best placed to understand (a) the gnarly
technical details here, and (b) how those change depending on the
particular PRNG in use. I don't want to end up with a bunch of
subtly-buggy utility functions in non-specialist libraries like dask
-- so we should be trying to help downstream users figure out how to
actually get this into np.random?
Post by Robert KernPost by Nathaniel SmithIt's
not even 100% out of the question that we conclude that existing PRNGs
are buggy because they don't take this use case into account -- it
would be far from the first time that numpy found itself going beyond
the limits of older numerical tools that weren't designed to build the
kind of large composable systems that numpy gets used for.
MT19937's state space is large enough that you could explicitly encode
a "tree seed" into it, even if you don't trust the laws of probability
-- e.g., you start with a RandomState with id [], then its children
have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
1], ..., [1, 0], ..., and you write these into the state (probably
after sending them through some bit-mixing permutation), to guarantee
non-collision.
Sure. Not entirely sure if that can be done without preallocating the
branching factor or depth, but I'm sure there's some fancy combinatoric
representation of an unbounded tree that could be exploited here. It seems
likely to me that such a thing could be done with the stream IDs of PRNGs
that support that.
I'm pretty sure you do have to preallocate both the branching factor
and depth, since getting the collision-free guarantee requires that
across the universe of possible tree addresses, each state id gets
used at most once -- and there are finitely many state ids. But as a
practical matter, saying "you can only sprout up to 2**32 new states
out of any given state, and you can only nest them 600 deep, and
exceeding these bounds is an error" would still be "composable enough"
for ~all practical purposes.
Post by Robert KernPost by Nathaniel SmithOr if you do trust the laws of probability, then the
randomly-sample-a-PRNG approach is not 100% out of the question even
for more modern PRNGs.
Tread carefully here. PRNGs are not RNGs. Using the root PRNG to generate N
new seeds for the same PRNG does not necessarily generate N good,
uncorrelated streams with high probability. Overlap is not the only factor
in play. Long-term correlations happen even in the case where the N streams
do not overlap, and the structure of the root PRNG could well generate N
such correlated seeds.
http://www.adv-radio-sci.net/12/75/2014/
See section 4.4. It does look like one can get good MT19937 streams with
sequential integer seeds, when expanded by the standard sub-PRNG that we use
for that purpose (yay!). However, if you use a different (but otherwise
good-quality and general purpose) sub-PRNG to expand the key, you get
increasing numbers of failures as you increase the number of parallel
streams. It is not explored in the paper exactly why this is the case, but I
suspect that the similarity of the second sub-PRNG algorithm to that of the
MT itself is part of the problem.
It is *likely* that the scheme I implemented will be okay (our array-seed
expansion algorithm is similar to the integer-seed expansion and *probably*
insulates the sprouted MTs from their MT generating source), but that
remains to be demonstrated.
Right... I think the way to think about this is to split it into two
pieces. First, there's the issue that correlated streams exist at all.
That they do for MT is annoying and closely related to why we prefer
other PRNGs these days. When doing the analysis of the collision
probability of randomly sprouted generators, this effectively reduces
the size of the space and increases the risk of collision. For MT this
by itself isn't a big deal because the space is so ludicrously large,
but for the more modern PRNGs with smaller state spaces you definitely
want these correlated streams not to exist at all. Fortunately I
believe this is a design criterion that modern PRNGs take into
account, but yeah, one needs to check this. (This kind of issue is
also why I have a fondness for PRNGs designed on the "get a bigger
hammer" principle, like AES-CTR :-).)
Second, there's the issue that that paper talks about, where *if*
correlated streams exist, then the structure of your PRNG might be
such that when sampling a new seed from an old PRNG, you're
unreasonably likely to hit one of them. (The most obvious example of
this would be if you use an archaic PRNG design that works by
outputting its internal state and then mixing it -- if you use the
output of this PRNG to seed a new PRNG then the two will be identical
modulo some lag!) If your sprouting procedure is subject to this kind
of phenomenon, then it totally invalidates the use of simple
probability theory to analyze the chance of collisions.
But, I ignored this in my analysis because it's definitely solveable
:-). All we need is some deterministic function that we can apply to
the output of our original PRNG that will give us back something
"effectively random" to use as our sprouted seed, totally destroying
all correlations. And this problem of destroying correlations is
well-studied in the crypto world (basically destroying these
correlations *is* the entire problem of cryptographic primitive
design), so even if we have doubts about the specific initialization
functions usually used with MT, we have available to us lots of
primitives that we're *very* confident will do a *very* thorough job,
and sprouting isn't a performance sensitive operations so it doesn't
matter if it takes a little longer using some crypto hash instead of
the classic MT seeding algorithm. And then the simple birthday
calculations are appropriate again.
-n
--
Nathaniel J. Smith -- https://vorpus.org