Discussion:
[Numpy-discussion] Backwards-incompatible improvements to numpy.random.RandomState
Antony Lee
2015-05-24 08:22:21 UTC
Permalink
Hi,

As mentioned in

#1450: Patch with Ziggurat method for Normal distribution
#5158: ENH: More efficient algorithm for unweighted random choice without
replacement
#5299: using `random.choice` to sample integers in a large range
#5851: Bug in np.random.dirichlet for small alpha parameters

some methods on np.random.RandomState are implemented either non-optimally
(#1450, #5158, #5299) or have outright bugs (#5851), but cannot be easily
changed due to backwards compatibility concerns. While some have suggested
new methods deprecating the old ones (see e.g. #5872), some consensus has
formed around the following ideas (see #5299 for original discussion,
followed by private discussions with @njsmith):

- Backwards compatibility should only be provided to those who were
explicitly instantiating a seeded RandomState object or reseeding a
RandomState object to a given value, and drawing variates from it: using
the global methods (or a None-seeded RandomState) was already
non-reproducible anyways as e.g. other libraries could be drawing variates
from the global RandomState (of which the free functions in np.random are
actually methods). Thus, the global RandomState object should use the
latest implementation of the methods.

- "RandomState(seed)" and "r = RandomState(...); r.seed(seed)" should offer
backwards-compatibility guarantees (see e.g.
https://docs.python.org/3.4/library/random.html#notes-on-reproducibility).

As such, we propose the following improvements to the API:

- RandomState gains a (keyword-only) parameter, "version", also accessible
as a read-only attribute. This indicates the version of the methods on the
object. The current version of RandomState is retroactively assigned
version 0. The latest available version is available as
np.random.LATEST_VERSION. Backwards-incompatible improvements to
RandomState methods can be introduced but increase the LAGTEST_VERSION.

- The global RandomState is instantiated as
RandomState(version=LATEST_VERSION).

- RandomState() and rs.seed() sets the version to LATEST_VERSION.

- RandomState(seed[!=None]) and rs.seed(seed[!=None]) sets the version to 0.

A proof-of-concept implementation, still missing tests, is tracked as
#5911. It includes the patch proposed in #5158 as an example of how to
include an improved version of random.choice.

Comments, and help for writing tests (in particular to make sure backwards
compatibility is maintained) are welcome.

Antony Lee
Ralf Gommers
2015-05-24 08:59:49 UTC
Permalink
Post by Antony Lee
Hi,
As mentioned in
#1450: Patch with Ziggurat method for Normal distribution
#5158: ENH: More efficient algorithm for unweighted random choice without
replacement
#5299: using `random.choice` to sample integers in a large range
#5851: Bug in np.random.dirichlet for small alpha parameters
some methods on np.random.RandomState are implemented either non-optimally
(#1450, #5158, #5299) or have outright bugs (#5851), but cannot be easily
changed due to backwards compatibility concerns. While some have suggested
new methods deprecating the old ones (see e.g. #5872), some consensus has
formed around the following ideas (see #5299 for original discussion,
- Backwards compatibility should only be provided to those who were
explicitly instantiating a seeded RandomState object or reseeding a
RandomState object to a given value, and drawing variates from it: using
the global methods (or a None-seeded RandomState) was already
non-reproducible anyways as e.g. other libraries could be drawing variates
from the global RandomState (of which the free functions in np.random are
actually methods). Thus, the global RandomState object should use the
latest implementation of the methods.
The rest of the proposal looks good to me, but the reasoning on this point
is shaky. np.random.seed() is *very* widely used, and works fine for a test
suite where each test that needs random numbers calls seed(...) and is run
with nose. Can you explain why you need to touch the behavior of the global
methods in order to make RandomState(version=) work?

Ralf


- "RandomState(seed)" and "r = RandomState(...); r.seed(seed)" should offer
Post by Antony Lee
backwards-compatibility guarantees (see e.g.
https://docs.python.org/3.4/library/random.html#notes-on-reproducibility).
- RandomState gains a (keyword-only) parameter, "version", also accessible
as a read-only attribute. This indicates the version of the methods on the
object. The current version of RandomState is retroactively assigned
version 0. The latest available version is available as
np.random.LATEST_VERSION. Backwards-incompatible improvements to
RandomState methods can be introduced but increase the LAGTEST_VERSION.
- The global RandomState is instantiated as
RandomState(version=LATEST_VERSION).
- RandomState() and rs.seed() sets the version to LATEST_VERSION.
- RandomState(seed[!=None]) and rs.seed(seed[!=None]) sets the version to 0.
A proof-of-concept implementation, still missing tests, is tracked as
#5911. It includes the patch proposed in #5158 as an example of how to
include an improved version of random.choice.
Comments, and help for writing tests (in particular to make sure backwards
compatibility is maintained) are welcome.
Antony Lee
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-05-24 09:30:31 UTC
Permalink
Post by Ralf Gommers
Post by Antony Lee
Hi,
As mentioned in
#1450: Patch with Ziggurat method for Normal distribution
#5158: ENH: More efficient algorithm for unweighted random choice
without replacement
Post by Ralf Gommers
Post by Antony Lee
#5299: using `random.choice` to sample integers in a large range
#5851: Bug in np.random.dirichlet for small alpha parameters
some methods on np.random.RandomState are implemented either
non-optimally (#1450, #5158, #5299) or have outright bugs (#5851), but
cannot be easily changed due to backwards compatibility concerns. While
some have suggested new methods deprecating the old ones (see e.g. #5872),
some consensus has formed around the following ideas (see #5299 for
Post by Ralf Gommers
Post by Antony Lee
- Backwards compatibility should only be provided to those who were
explicitly instantiating a seeded RandomState object or reseeding a
RandomState object to a given value, and drawing variates from it: using
the global methods (or a None-seeded RandomState) was already
non-reproducible anyways as e.g. other libraries could be drawing variates
from the global RandomState (of which the free functions in np.random are
actually methods). Thus, the global RandomState object should use the
latest implementation of the methods.
Post by Ralf Gommers
The rest of the proposal looks good to me, but the reasoning on this
point is shaky. np.random.seed() is *very* widely used, and works fine for
a test suite where each test that needs random numbers calls seed(...) and
is run with nose. Can you explain why you need to touch the behavior of the
global methods in order to make RandomState(version=) work?

You're absolutely right about it being important to preserve the behavior
of the global functions when seeded, but I think this is just a bug in the
description of the proposal here, not in the proposal itself :-).

If you look at the PR, there's no change to how the global functions work
-- they're still just a transparently thin wrapper around a hidden, global
RandomState object, and thus IIUC changes to RandomState will automatically
apply to the global functions as well.

So with this proposal, an unseeded RandomState uses the latest version ->
therefore the global functions, which start out unseeded, start out using
the latest version. If you call .seed() on an existing RandomState object
and pass in a seed but no version= argument, the version gets reset to 0 ->
therefore if you call the global seed() function and pass in a seed but no
version= argument, the global RandomState gets reset to version 0 (at least
until the next time seed() is called), and backcompat is preserved.

-n
Ralf Gommers
2015-05-24 09:54:24 UTC
Permalink
Post by Nathaniel Smith
So with this proposal, an unseeded RandomState uses the latest version ->
therefore the global functions, which start out unseeded, start out using
the latest version. If you call .seed() on an existing RandomState object
and pass in a seed but no version= argument, the version gets reset to 0 ->
therefore if you call the global seed() function and pass in a seed but no
version= argument, the global RandomState gets reset to version 0 (at least
until the next time seed() is called), and backcompat is preserved.
Post by Ralf Gommers
Post by Antony Lee
Hi,
As mentioned in
#1450: Patch with Ziggurat method for Normal distribution
#5158: ENH: More efficient algorithm for unweighted random choice
without replacement
Post by Ralf Gommers
Post by Antony Lee
#5299: using `random.choice` to sample integers in a large range
#5851: Bug in np.random.dirichlet for small alpha parameters
some methods on np.random.RandomState are implemented either
non-optimally (#1450, #5158, #5299) or have outright bugs (#5851), but
cannot be easily changed due to backwards compatibility concerns. While
some have suggested new methods deprecating the old ones (see e.g. #5872),
some consensus has formed around the following ideas (see #5299 for
Post by Ralf Gommers
Post by Antony Lee
- Backwards compatibility should only be provided to those who were
explicitly instantiating a seeded RandomState object or reseeding a
RandomState object to a given value, and drawing variates from it: using
the global methods (or a None-seeded RandomState) was already
non-reproducible anyways as e.g. other libraries could be drawing variates
from the global RandomState (of which the free functions in np.random are
actually methods). Thus, the global RandomState object should use the
latest implementation of the methods.
Post by Ralf Gommers
The rest of the proposal looks good to me, but the reasoning on this
point is shaky. np.random.seed() is *very* widely used, and works fine for
a test suite where each test that needs random numbers calls seed(...) and
is run with nose. Can you explain why you need to touch the behavior of the
global methods in order to make RandomState(version=) work?
You're absolutely right about it being important to preserve the behavior
of the global functions when seeded, but I think this is just a bug in the
description of the proposal here, not in the proposal itself :-). If you
look at the PR, there's no change to how the global functions work --
they're still just a transparently thin wrapper around a hidden, global
RandomState object, and thus IIUC changes to RandomState will automatically
apply to the global functions as well.
Thanks for the clarification. Then +1 from me for this proposal.
Ralf
Alan G Isaac
2015-05-24 12:41:18 UTC
Permalink
I echo Ralf's question.
For those who need replicability, the proposed upgrade path seems quite radical.

Also, I would prefer to have the new functionality introduced beside the existing
implementation of RandomState, with an announcement that RandomState
will change in the next major numpy version number. This will allow everyone
who wants to to change now, without requiring that users attend to minor
numpy version numbers if they want replicability.

I think this is what is required by semantic versioning.

Alan Isaac
the reasoning on this point is shaky. np.random.seed() is *very* widely used, and works fine for a test suite where each test that needs random
numbers calls seed(...) and is run with nose. Can you explain why you need to touch the behavior of the global methods in order to make
RandomState(version=) work?
Ralf Gommers
2015-05-24 12:47:34 UTC
Permalink
Post by Alan G Isaac
I echo Ralf's question.
For those who need replicability, the proposed upgrade path seems quite radical.
It's not radical, and my question was already answered. Nothing changes if
you are doing:

np.random.seed(1234)
np.random.any_random_sample_generator_func()

Values only change if you leave out the call to seed(), which you should
never do if you care about replicability.

Ralf
Post by Alan G Isaac
Also, I would prefer to have the new functionality introduced beside the existing
implementation of RandomState, with an announcement that RandomState
will change in the next major numpy version number. This will allow everyone
who wants to to change now, without requiring that users attend to minor
numpy version numbers if they want replicability.
I think this is what is required by semantic versioning.
Alan Isaac
the reasoning on this point is shaky. np.random.seed() is *very* widely
used, and works fine for a test suite where each test that needs random
numbers calls seed(...) and is run with nose. Can you explain why you
need to touch the behavior of the global methods in order to make
RandomState(version=) work?
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Alan G Isaac
2015-05-24 13:08:12 UTC
Permalink
Post by Ralf Gommers
Values only change if you leave out the call to seed()
OK, but this claim seems to conflict with the following language:
"the global RandomState object should use the latest implementation of the methods".
I take it that this is what Nathan meant by
"I think this is just a bug in the description of the proposal here, not in the proposal itself".

So, is the correct phrasing
"the global RandomState object should use the latest implementation of the methods, unless explicitly seeded"?

Thanks,
Alan
j***@gmail.com
2015-05-24 15:04:11 UTC
Permalink
Post by Alan G Isaac
Post by Ralf Gommers
Values only change if you leave out the call to seed()
"the global RandomState object should use the latest implementation of the methods".
I take it that this is what Nathan meant by
"I think this is just a bug in the description of the proposal here, not
in the proposal itself".
So, is the correct phrasing
"the global RandomState object should use the latest implementation of the
methods, unless explicitly seeded"?
that's how I understand it.

I don't see any problems with the clarified proposal for the use cases that
I know of.

Can we choose the version also for the global random state, for example to
fix both version and seed in unit tests, with version > 0?


BTW: I would expect that bug fixes are still exempt from backwards
compatibility.

fixing #5851 should be independent of the version, (without having looked
at the issue)

(If you need to replicate bugs, then use an old version of a package.)

Josef
Post by Alan G Isaac
Thanks,
Alan
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Anne Archibald
2015-05-24 15:13:49 UTC
Permalink
Do we want a deprecation-like approach, so that eventually people who want
replicability will specify versions, and everyone else gets bug fixes and
improvements? This would presumably take several major versions, but it
might avoid people getting unintentionally trapped on this version.

Incidentally, bug fixes are complicated: if a bug fix uses more or fewer
raw random numbers, it breaks repeatability not just for the call that got
fixed but for all successive random number generations.

Anne
Post by j***@gmail.com
Post by Alan G Isaac
Post by Ralf Gommers
Values only change if you leave out the call to seed()
"the global RandomState object should use the latest implementation of the methods".
I take it that this is what Nathan meant by
"I think this is just a bug in the description of the proposal here, not
in the proposal itself".
So, is the correct phrasing
"the global RandomState object should use the latest implementation of
the methods, unless explicitly seeded"?
that's how I understand it.
I don't see any problems with the clarified proposal for the use cases
that I know of.
Can we choose the version also for the global random state, for example to
fix both version and seed in unit tests, with version > 0?
BTW: I would expect that bug fixes are still exempt from backwards
compatibility.
fixing #5851 should be independent of the version, (without having looked
at the issue)
(If you need to replicate bugs, then use an old version of a package.)
Josef
Post by Alan G Isaac
Thanks,
Alan
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
j***@gmail.com
2015-05-24 15:40:06 UTC
Permalink
Post by Anne Archibald
Do we want a deprecation-like approach, so that eventually people who want
replicability will specify versions, and everyone else gets bug fixes and
improvements? This would presumably take several major versions, but it
might avoid people getting unintentionally trapped on this version.
Incidentally, bug fixes are complicated: if a bug fix uses more or fewer
raw random numbers, it breaks repeatability not just for the call that got
fixed but for all successive random number generations.
Reminder: we are bottom or inline posting
Post by Anne Archibald
Anne
Post by j***@gmail.com
Post by Alan G Isaac
Post by Ralf Gommers
Values only change if you leave out the call to seed()
"the global RandomState object should use the latest implementation of the methods".
I take it that this is what Nathan meant by
"I think this is just a bug in the description of the proposal here, not
in the proposal itself".
So, is the correct phrasing
"the global RandomState object should use the latest implementation of
the methods, unless explicitly seeded"?
that's how I understand it.
I don't see any problems with the clarified proposal for the use cases
that I know of.
Can we choose the version also for the global random state, for example
to fix both version and seed in unit tests, with version > 0?
BTW: I would expect that bug fixes are still exempt from backwards
compatibility.
fixing #5851 should be independent of the version, (without having
looked at the issue)
I skimmed the issue.
In a strict sense it's not really a bug, the user doesn't get wrong
numbers, he or she gets Not A Number.

So there are no current usages that use the function in that range.

Josef
Post by Anne Archibald
Post by j***@gmail.com
(If you need to replicate bugs, then use an old version of a package.)
Josef
Post by Alan G Isaac
Thanks,
Alan
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-05-24 17:49:22 UTC
Permalink
Post by j***@gmail.com
Reminder: we are bottom or inline posting
Can we stop hassling people about this? Inline replies are a great tool to
have in your toolkit for complicated technical discussions, but I feel like
our weird insistence on them has turned into a pointless and exclusionary
thing. It's not like bottom replying is even any better -- the traditional
mailing list rule is you trim quotes to just the part you're replying to
(like this message); quoting the whole thing and replying underneath just
to give people a bit of exercise for their scrolling finger would totally
have gotten you flamed too.

But email etiquette has moved on since the 90s, even regular posters to
this list violate this "rule" all the time, it's time to let it go.

-n
j***@gmail.com
2015-05-24 18:01:28 UTC
Permalink
Post by Nathaniel Smith
Post by j***@gmail.com
Reminder: we are bottom or inline posting
Can we stop hassling people about this? Inline replies are a great tool to
have in your toolkit for complicated technical discussions, but I feel like
our weird insistence on them has turned into a pointless and exclusionary
thing. It's not like bottom replying is even any better -- the traditional
mailing list rule is you trim quotes to just the part you're replying to
(like this message); quoting the whole thing and replying underneath just
to give people a bit of exercise for their scrolling finger would totally
have gotten you flamed too.
But email etiquette has moved on since the 90s, even regular posters to
this list violate this "rule" all the time, it's time to let it go.
It's not a 90's thing and I learned about it around 2009 when I started in
here.
I find it very annoying trying to catch up with a longer thread and the
replies are all over the place.


Anne is a few years older than I in terms of numpy and scipy participation
and this was just intended to be a friendly reminder.

And as BTW: I'm glad Anne is back with scipy.


Josef
Post by Nathaniel Smith
-n
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2015-05-24 20:39:52 UTC
Permalink
Post by j***@gmail.com
Post by Nathaniel Smith
Post by j***@gmail.com
Reminder: we are bottom or inline posting
Can we stop hassling people about this? Inline replies are a great tool
to have in your toolkit for complicated technical discussions, but I feel
like our weird insistence on them has turned into a pointless and
exclusionary thing. It's not like bottom replying is even any better -- the
traditional mailing list rule is you trim quotes to just the part you're
replying to (like this message); quoting the whole thing and replying
underneath just to give people a bit of exercise for their scrolling finger
would totally have gotten you flamed too.
Post by j***@gmail.com
Post by Nathaniel Smith
But email etiquette has moved on since the 90s, even regular posters to
this list violate this "rule" all the time, it's time to let it go.
Post by j***@gmail.com
It's not a 90's thing and I learned about it around 2009 when I started
in here.
Post by j***@gmail.com
I find it very annoying trying to catch up with a longer thread and the
replies are all over the place.
Post by j***@gmail.com
Anne is a few years older than I in terms of numpy and scipy
participation and this was just intended to be a friendly reminder.

And while I know you didn't mean it this way, I'm guessing that being
immediately greeted by criticism for failing to follow some arbitrary and
inconsistently-applied rule was indeed a strong reminder of what a
unpleasant place FOSS mailing lists can sometimes be, and why someone might
disappear from them for a few years. I think we can do better.

This is pretty off-topic for this thread, though, see so let's let it lie
here. If anyone desperately needs to comment further please email me
off-list.

-n
Nathaniel Smith
2015-05-24 18:04:39 UTC
Permalink
Post by Anne Archibald
Do we want a deprecation-like approach, so that eventually people who
want replicability will specify versions, and everyone else gets bug fixes
and improvements? This would presumably take several major versions, but it
might avoid people getting unintentionally trapped on this version.

I'm not sure what you're envisioning as needing a deprecation cycle? The
neat thing about random is that we already have a way for users to say that
they want replicability -- the use of an explicit seed -- so we can just
immediately go to the world you describe, where people who seed get to pick
their version (or default to version 0 for backcompat), and everyone else
gets the improvements automatically. Or is this different from what you
meant somehow?

Fortunately we haven't yet run into any really serious bugs in random, like
"oops we're sampling from the wrong distribution" type bugs. Mostly it's
more like "oops this is really inefficient" or "oops this crashes in this
edge case", so there's no real harm in letting people use old versions. If
we did run into a case where we were giving flat out wrong results, then I
guess we'd still want to keep the code around because reproducibility is
still important, but perhaps with a requirement that you pass an extra
argument like I_know_its_broken=True or something so that people couldn't
end up running the broken code accidentally? I guess we'll cross that
bridge when we come to it.
Post by Anne Archibald
Incidentally, bug fixes are complicated: if a bug fix uses more or fewer
raw random numbers, it breaks repeatability not just for the call that got
fixed but for all successive random number generations.

Yep. This is why we mostly haven't been able to change behavior at *all*
except in cases where there was a clear error so we know no-one was using
something.

-n
Sturla Molden
2015-05-24 18:56:17 UTC
Permalink
Post by Nathaniel Smith
I'm not sure what you're envisioning as needing a deprecation cycle? The
neat thing about random is that we already have a way for users to say
that they want replicability -- the use of an explicit seed --
No, this is not sufficient for random numbers. Random sampling and
ziggurat generators are examples. If we introduce a change (e.g. a
bugfix) that will affect the number of calls to the entropy source, just
setting the seed will in general not be enough to ensure backwards
compatibility. That is e.g. the case with using ziggurat samplers
instead of the current transcendental transforms for normal, exponential
and gamma distributions. While ziggurat is faster (and to my knowledge)
more accurate, it will also make a different number of calls to the
entropy source, and hence the whole sequence will be affected, even if
you do set a random seed.


Sturla
Robert Kern
2015-05-24 19:25:39 UTC
Permalink
Post by Sturla Molden
Post by Nathaniel Smith
I'm not sure what you're envisioning as needing a deprecation cycle? The
neat thing about random is that we already have a way for users to say
that they want replicability -- the use of an explicit seed --
No, this is not sufficient for random numbers. Random sampling and
ziggurat generators are examples. If we introduce a change (e.g. a
bugfix) that will affect the number of calls to the entropy source, just
setting the seed will in general not be enough to ensure backwards
compatibility. That is e.g. the case with using ziggurat samplers
instead of the current transcendental transforms for normal, exponential
and gamma distributions. While ziggurat is faster (and to my knowledge)
more accurate, it will also make a different number of calls to the
entropy source, and hence the whole sequence will be affected, even if
you do set a random seed.
Please reread the proposal at the top of the thread.

--
Robert Kern
Antony Lee
2015-05-24 20:15:04 UTC
Permalink
Thanks to Nathaniel who has indeed clarified my intent, i.e. "the global
RandomState should use the latest implementation, unless explicitly
seeded". More generally, the `RandomState` constructor is just a thin
wrapper around `seed` with the same signature, so one can swap the version
of the global functions with a call to `np.random.seed(version=...)`.
Sturla Molden
2015-05-24 18:46:50 UTC
Permalink
Post by Anne Archibald
Do we want a deprecation-like approach, so that eventually people who
want replicability will specify versions, and everyone else gets bug
fixes and improvements? This would presumably take several major
versions, but it might avoid people getting unintentionally trapped on
this version.
Incidentally, bug fixes are complicated: if a bug fix uses more or fewer
raw random numbers, it breaks repeatability not just for the call that
got fixed but for all successive random number generations.
If a function has a bug, changing it will change the output of the
function. This is not special for random numbers. If not retaining the
old erroneous output means we break-backwards compatibility, then no
bugs can ever be fixed, anywhere in NumPy. I think we need to clarify
what we mean by backwards compatibility for random numbers. What
guarantees should we make from one version to another?


Sturla
Robert Kern
2015-05-24 19:22:32 UTC
Permalink
Post by Sturla Molden
Post by Anne Archibald
Do we want a deprecation-like approach, so that eventually people who
want replicability will specify versions, and everyone else gets bug
fixes and improvements? This would presumably take several major
versions, but it might avoid people getting unintentionally trapped on
this version.
Incidentally, bug fixes are complicated: if a bug fix uses more or fewer
raw random numbers, it breaks repeatability not just for the call that
got fixed but for all successive random number generations.
If a function has a bug, changing it will change the output of the
function. This is not special for random numbers. If not retaining the
old erroneous output means we break-backwards compatibility, then no
bugs can ever be fixed, anywhere in NumPy. I think we need to clarify
what we mean by backwards compatibility for random numbers. What
guarantees should we make from one version to another?
The policy thus far has been that we will fix bugs in the distributions and
make changes that allow a strictly wider domain of distribution parameters
(e.g. allowing b==0 where before we only allowed b>0), but we will not make
other enhancements that would change existing good output.

--
Robert Kern
Sturla Molden
2015-05-24 20:30:59 UTC
Permalink
Post by Antony Lee
Comments, and help for writing tests (in particular to make sure
backwards compatibility is maintained) are welcome.
I have one comment, and that is what makes random numbers so special?
This applies to the rest of NumPy too, fixing a bug can sometimes change
the output of a function.

Personally I think we should only make guarantees about the data types,
array shapes, and things like that, but not about the values. Those who
need a particular version of NumPy for exact reproducibility should
install the version of Python and NumPy they need. That is why virtual
environments exist.

I am sure a lot will disagree with me on this. So please don't take this
as flamebait.


Sturla
Antony Lee
2015-05-24 21:09:43 UTC
Permalink
Post by Sturla Molden
Post by Antony Lee
Comments, and help for writing tests (in particular to make sure
backwards compatibility is maintained) are welcome.
I have one comment, and that is what makes random numbers so special?
This applies to the rest of NumPy too, fixing a bug can sometimes change
the output of a function.
Personally I think we should only make guarantees about the data types,
array shapes, and things like that, but not about the values. Those who
need a particular version of NumPy for exact reproducibility should
install the version of Python and NumPy they need. That is why virtual
environments exist.
I personally agree with this point of view (see original discussion in
#5299, for example); if it was only up to me at least I'd make
RandomState(seed) default to the latest version rather than the original
one (whether to keep the old versions around is another question). On the
other hand, I see that this long-standing debate has prevented obvious
improvements from being added sometimes for years (e.g. a patch for
Ziggurat normal variates has been lying around since 2010), or led to
potential API duplication in order to fix some clearly undesirable behavior
(dirichlet returning "nan" being described as "in a strict sense not really
a bug"(!)), so I'm willing to compromise to get this moving forward.

Antony
j***@gmail.com
2015-05-24 21:49:17 UTC
Permalink
Post by Antony Lee
Post by Sturla Molden
Post by Antony Lee
Comments, and help for writing tests (in particular to make sure
backwards compatibility is maintained) are welcome.
I have one comment, and that is what makes random numbers so special?
This applies to the rest of NumPy too, fixing a bug can sometimes change
the output of a function.
Personally I think we should only make guarantees about the data types,
array shapes, and things like that, but not about the values. Those who
need a particular version of NumPy for exact reproducibility should
install the version of Python and NumPy they need. That is why virtual
environments exist.
I personally agree with this point of view (see original discussion in
#5299, for example); if it was only up to me at least I'd make
RandomState(seed) default to the latest version rather than the original
one (whether to keep the old versions around is another question). On the
other hand, I see that this long-standing debate has prevented obvious
improvements from being added sometimes for years (e.g. a patch for
Ziggurat normal variates has been lying around since 2010), or led to
potential API duplication in order to fix some clearly undesirable behavior
(dirichlet returning "nan" being described as "in a strict sense not really
a bug"(!)), so I'm willing to compromise to get this moving forward.
It's clearly a different kind of "bug" than some of the ones we fixed in
the past without backwards compatibility discussion where the distribution
was wrong, i.e. some values shifted so parts have more weight and parts
have less weight.

As I mentioned, I don't see any real problem with the proposal.

Josef
Post by Antony Lee
Antony
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Daπid
2015-05-25 11:14:08 UTC
Permalink
Post by Sturla Molden
Personally I think we should only make guarantees about the data types,
array shapes, and things like that, but not about the values. Those who
need a particular version of NumPy for exact reproducibility should
install the version of Python and NumPy they need. That is why virtual
environments exist.
But there is a lot of legacy code out there that doesn't specify the
version required; and in most cases the original author cannot even be
asked.

Tests are a particularly annoying case. For example, when testing an
algorithm, is usually a good practice to record the number of iterations as
well as the result; consider it an early warning that we have changed
something we possibly didn't mean to, even if the result is correct. If we
want to support several NumPy versions, and the algorithm has any
randomness, the tests would have to be duplicated, or find a seed that
gives the exact same results. Thus, keeping different versions lets us
compare the results against the old API, without needing to duplicate the
tests. A lot less people will get annoyed.


/David.
Antony Lee
2015-05-29 21:06:39 UTC
Permalink
Post by Antony Lee
A proof-of-concept implementation, still missing tests, is tracked as
#5911. It includes the patch proposed in #5158 as an example of how to
include an improved version of random.choice.
Tests are in now (whether we should bundle in pickles of old versions to
make sure they are still unpickled correctly and outputs of old random
streams to make sure they are still reproduced is a good question, though).
Comments welcome.

Antony
Antony Lee
2015-06-09 17:07:59 UTC
Permalink
Post by Antony Lee
A proof-of-concept implementation, still missing tests, is tracked as
Post by Antony Lee
#5911. It includes the patch proposed in #5158 as an example of how to
include an improved version of random.choice.
Tests are in now (whether we should bundle in pickles of old versions to
make sure they are still unpickled correctly and outputs of old random
streams to make sure they are still reproduced is a good question, though).
Comments welcome.
Kindly bumping the issue.

Antony

Loading...