Discussion:
[Numpy-discussion] Question about numpy.random.choice with probabilties
Nadav Har'El
2017-01-17 09:55:58 UTC
Permalink
Hi, I'm looking for a way to find a random sample of C different items out
of N items, with a some desired probabilty Pi for each item i.

I saw that numpy has a function that supposedly does this,
numpy.random.choice (with replace=False and a probabilities array), but
looking at the algorithm actually implemented, I am wondering in what sense
are the probabilities Pi actually obeyed...

To me, the code doesn't seem to be doing the right thing... Let me explain:

Consider a simple numerical example: We have 3 items, and need to pick 2
different ones randomly. Let's assume the desired probabilities for item 1,
2 and 3 are: 0.2, 0.4 and 0.4.

Working out the equations there is exactly one solution here: The random
outcome of numpy.random.choice in this case should be [1,2] at probability
0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed
a solution for the desired probabilities because it yields item 1 in
[1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
0.2+0.6 = 0.8 = 2*P2, etc.

However, the algorithm in numpy.random.choice's replace=False generates, if
I understand correctly, different probabilities for the outcomes: I believe
in this case it generates [1,2] at probability 0.23333, [1,3] also 0.2333,
and [2,3] at probability 0.53333.

My question is how does this result fit the desired probabilities?

If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333,
then the expect number of "1" results we'll get per drawing is 0.23333 +
0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and for
"3" 0.76666. As you can see, the proportions are off: Item 2 is NOT twice
common than item 1 as we originally desired (we asked for probabilities
0.2, 0.4, 0.4 for the individual items!).


--
Nadav Har'El
***@scylladb.com
a***@gmail.com
2017-01-17 17:18:59 UTC
Permalink
Hi Nadav,

I may be wrong, but I think that the result of the current implementation
is actually the expected one.
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4

P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])

Now, P([1]) = 0.2 and P([2]) = 0.4. However:
P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability)
P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that, once
normalised, translate into 1/3 and 2/3 respectively)
Therefore P([1,2]) = 0.7/3 = 0.23333
Similarly, P([1,3]) = 0.23333 and P([2,3]) = 1.6/3 = 0.533333

What am I missing?

Alessandro
Post by Nadav Har'El
Hi, I'm looking for a way to find a random sample of C different items out
of N items, with a some desired probabilty Pi for each item i.
I saw that numpy has a function that supposedly does this,
numpy.random.choice (with replace=False and a probabilities array), but
looking at the algorithm actually implemented, I am wondering in what sense
are the probabilities Pi actually obeyed...
Consider a simple numerical example: We have 3 items, and need to pick 2
different ones randomly. Let's assume the desired probabilities for item 1,
2 and 3 are: 0.2, 0.4 and 0.4.
Working out the equations there is exactly one solution here: The random
outcome of numpy.random.choice in this case should be [1,2] at probability
0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed
a solution for the desired probabilities because it yields item 1 in
[1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
0.2+0.6 = 0.8 = 2*P2, etc.
However, the algorithm in numpy.random.choice's replace=False generates, if
I understand correctly, different probabilities for the outcomes: I believe
in this case it generates [1,2] at probability 0.23333, [1,3] also 0.2333,
and [2,3] at probability 0.53333.
My question is how does this result fit the desired probabilities?
If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333,
then the expect number of "1" results we'll get per drawing is 0.23333 +
0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and for
"3" 0.76666. As you can see, the proportions are off: Item 2 is NOT twice
common than item 1 as we originally desired (we asked for probabilities
0.2, 0.4, 0.4 for the individual items!).
--
Nadav Har'El
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/
attachments/20170117/d1f0a1db/attachment-0001.html>
------------------------------
Subject: Digest Footer
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
------------------------------
End of NumPy-Discussion Digest, Vol 124, Issue 24
*************************************************
--
--------------------------------------------------------------------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--------------------------------------------------------------------------
Nadav Har'El
2017-01-17 21:13:26 UTC
Permalink
Post by a***@gmail.com
Hi Nadav,
I may be wrong, but I think that the result of the current implementation
is actually the expected one.
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4
P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
Yes, this formula does fit well with the actual algorithm in the code. But,
Post by a***@gmail.com
P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability)
P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that,
once normalised, translate into 1/3 and 2/3 respectively)
Therefore P([1,2]) = 0.7/3 = 0.23333
Similarly, P([1,3]) = 0.23333 and P([2,3]) = 1.6/3 = 0.533333
Right, these are the numbers that the algorithm in the current code, and
the formula above, produce:

P([1,2]) = P([1,3]) = 0.23333
P([2,3]) = 0.53333

What I'm puzzled about is that these probabilities do not really fullfill
the given probability vector 0.2, 0.4, 0.4...
Let me try to explain explain:

Why did the user choose the probabilities 0.2, 0.4, 0.4 for the three items
in the first place?

One reasonable interpretation is that the user wants in his random picks to
see item 1 half the time of item 2 or 3.
For example, maybe item 1 costs twice as much as item 2 or 3, so picking it
half as often will result in an equal expenditure on each item.

If the user randomly picks the items individually (a single item at a
time), he indeed gets exactly this distribution: 0.2 of the time item 1,
0.4 of the time item 2, 0.4 of the time item 3.

Now, what happens if he picks not individual items, but pairs of different
items using numpy.random.choice with two items, replace=false?
Suddenly, the distribution of the individual items in the results get
skewed: If we look at the expected number of times we'll see each item in
one draw of a random pair, we will get:

E(1) = P([1,2]) + P([1,3]) = 0.46666
E(2) = P([1,2]) + P([2,3]) = 0.76666
E(3) = P([1,3]) + P([2,3]) = 0.76666

Or renormalizing by dividing by 2:

P(1) = 0.233333
P(2) = 0.383333
P(3) = 0.383333

As you can see this is not quite the probabilities we wanted (which were
0.2, 0.4, 0.4)! In the random pairs we picked, item 1 was used a bit more
often than we wanted, and item 2 and 3 were used a bit less often!

So that brought my question of why we consider these numbers right.

In this example, it's actually possible to get the right item distribution,
if we pick the pair outcomes with the following probabilties:

P([1,2]) = 0.2 (not 0.233333 as above)
P([1,3]) = 0.2
P([2,3]) = 0.6 (not 0.533333 as above)

Then, we get exactly the right P(1), P(2), P(3): 0.2, 0.4, 0.4

Interestingly, fixing things like I suggest is not always possible.
Consider a different probability-vector example for three items - 0.99,
0.005, 0.005. Now, no matter which algorithm we use for randomly picking
pairs from these three items, *each* returned pair will inevitably contain
one of the two very-low-probability items, so each of those items will
appear in roughly half the pairs, instead of in a vanishingly small
percentage as we hoped.

But in other choices of probabilities (like the one in my original
example), there is a solution. For 2-out-of-3 sampling we can actually show
a system of three linear equations in three variables, so there is always
one solution but if this solution has components not valid as probabilities
(not in [0,1]) we end up with no solution - as happens in the 0.99, 0.005,
0.005 example.
Post by a***@gmail.com
What am I missing?
Alessandro
Post by Nadav Har'El
Hi, I'm looking for a way to find a random sample of C different items out
of N items, with a some desired probabilty Pi for each item i.
I saw that numpy has a function that supposedly does this,
numpy.random.choice (with replace=False and a probabilities array), but
looking at the algorithm actually implemented, I am wondering in what sense
are the probabilities Pi actually obeyed...
Consider a simple numerical example: We have 3 items, and need to pick 2
different ones randomly. Let's assume the desired probabilities for item 1,
2 and 3 are: 0.2, 0.4 and 0.4.
Working out the equations there is exactly one solution here: The random
outcome of numpy.random.choice in this case should be [1,2] at probability
0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed
a solution for the desired probabilities because it yields item 1 in
[1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
0.2+0.6 = 0.8 = 2*P2, etc.
However, the algorithm in numpy.random.choice's replace=False generates, if
I understand correctly, different probabilities for the outcomes: I believe
in this case it generates [1,2] at probability 0.23333, [1,3] also 0.2333,
and [2,3] at probability 0.53333.
My question is how does this result fit the desired probabilities?
If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333,
then the expect number of "1" results we'll get per drawing is 0.23333 +
0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and for
"3" 0.76666. As you can see, the proportions are off: Item 2 is NOT twice
common than item 1 as we originally desired (we asked for probabilities
0.2, 0.4, 0.4 for the individual items!).
--
Nadav Har'El
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachmen
ts/20170117/d1f0a1db/attachment-0001.html>
------------------------------
Subject: Digest Footer
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
------------------------------
End of NumPy-Discussion Digest, Vol 124, Issue 24
*************************************************
--
--------------------------------------------------------------------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--------------------------------------------------------------------------
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
j***@gmail.com
2017-01-17 22:25:39 UTC
Permalink
Post by Nadav Har'El
Post by a***@gmail.com
Hi Nadav,
I may be wrong, but I think that the result of the current
implementation is actually the expected one.
Post by Nadav Har'El
Post by a***@gmail.com
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4
P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
Yes, this formula does fit well with the actual algorithm in the code.
Post by a***@gmail.com
P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability)
P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that,
once normalised, translate into 1/3 and 2/3 respectively)
Post by Nadav Har'El
Post by a***@gmail.com
Therefore P([1,2]) = 0.7/3 = 0.23333
Similarly, P([1,3]) = 0.23333 and P([2,3]) = 1.6/3 = 0.533333
Right, these are the numbers that the algorithm in the current code, and
P([1,2]) = P([1,3]) = 0.23333
P([2,3]) = 0.53333
What I'm puzzled about is that these probabilities do not really fullfill
the given probability vector 0.2, 0.4, 0.4...
Post by Nadav Har'El
Why did the user choose the probabilities 0.2, 0.4, 0.4 for the three
items in the first place?
Post by Nadav Har'El
One reasonable interpretation is that the user wants in his random picks
to see item 1 half the time of item 2 or 3.
Post by Nadav Har'El
For example, maybe item 1 costs twice as much as item 2 or 3, so picking
it half as often will result in an equal expenditure on each item.
Post by Nadav Har'El
If the user randomly picks the items individually (a single item at a
time), he indeed gets exactly this distribution: 0.2 of the time item 1,
0.4 of the time item 2, 0.4 of the time item 3.
Post by Nadav Har'El
Now, what happens if he picks not individual items, but pairs of
different items using numpy.random.choice with two items, replace=false?
Post by Nadav Har'El
Suddenly, the distribution of the individual items in the results get
skewed: If we look at the expected number of times we'll see each item in
Post by Nadav Har'El
E(1) = P([1,2]) + P([1,3]) = 0.46666
E(2) = P([1,2]) + P([2,3]) = 0.76666
E(3) = P([1,3]) + P([2,3]) = 0.76666
P(1) = 0.233333
P(2) = 0.383333
P(3) = 0.383333
As you can see this is not quite the probabilities we wanted (which were
0.2, 0.4, 0.4)! In the random pairs we picked, item 1 was used a bit more
often than we wanted, and item 2 and 3 were used a bit less often!
Post by Nadav Har'El
So that brought my question of why we consider these numbers right.
In this example, it's actually possible to get the right item
P([1,2]) = 0.2 (not 0.233333 as above)
P([1,3]) = 0.2
P([2,3]) = 0.6 (not 0.533333 as above)
Then, we get exactly the right P(1), P(2), P(3): 0.2, 0.4, 0.4
Interestingly, fixing things like I suggest is not always possible.
Consider a different probability-vector example for three items - 0.99,
0.005, 0.005. Now, no matter which algorithm we use for randomly picking
pairs from these three items, *each* returned pair will inevitably contain
one of the two very-low-probability items, so each of those items will
appear in roughly half the pairs, instead of in a vanishingly small
percentage as we hoped.
Post by Nadav Har'El
But in other choices of probabilities (like the one in my original
example), there is a solution. For 2-out-of-3 sampling we can actually show
a system of three linear equations in three variables, so there is always
one solution but if this solution has components not valid as probabilities
(not in [0,1]) we end up with no solution - as happens in the 0.99, 0.005,
0.005 example.


I think the underlying problem is that in the sampling space the events (1,
2) (1, 3) (2, 3) are correlated and because of the discreteness an
arbitrary marginal distribution on the individual events 1, 2, 3 is not
possible.

related aside:
I'm not able (or willing to spend the time) on the math, but I just went
through something similar for survey sampling in finite population (e.g.
survey two out of 3 individuals, where 3 is the population), leading to the
Horvitz–Thompson estimator. The books have chapters on different sampling
schemes and derivation of the marginal and joint probability to be surveyed.

(I gave up on sampling without replacement, and assume we have a large
population where it doesn't make a difference.)

In some of the sampling schemes they pick sequentially and adjust the
probabilities for the remaining individuals. That seems to provide more
flexibility to create a desired or optimal sampling scheme.

Josef
Post by Nadav Har'El
Post by a***@gmail.com
What am I missing?
Alessandro
Post by Nadav Har'El
Hi, I'm looking for a way to find a random sample of C different items out
of N items, with a some desired probabilty Pi for each item i.
I saw that numpy has a function that supposedly does this,
numpy.random.choice (with replace=False and a probabilities array), but
looking at the algorithm actually implemented, I am wondering in what sense
are the probabilities Pi actually obeyed...
Consider a simple numerical example: We have 3 items, and need to pick 2
different ones randomly. Let's assume the desired probabilities for item 1,
2 and 3 are: 0.2, 0.4 and 0.4.
Working out the equations there is exactly one solution here: The random
outcome of numpy.random.choice in this case should be [1,2] at probability
0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed
a solution for the desired probabilities because it yields item 1 in
[1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
0.2+0.6 = 0.8 = 2*P2, etc.
However, the algorithm in numpy.random.choice's replace=False generates, if
I understand correctly, different probabilities for the outcomes: I believe
in this case it generates [1,2] at probability 0.23333, [1,3] also 0.2333,
and [2,3] at probability 0.53333.
My question is how does this result fit the desired probabilities?
If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333,
then the expect number of "1" results we'll get per drawing is 0.23333 +
0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and for
"3" 0.76666. As you can see, the proportions are off: Item 2 is NOT twice
common than item 1 as we originally desired (we asked for probabilities
0.2, 0.4, 0.4 for the individual items!).
--
Nadav Har'El
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <
https://mail.scipy.org/pipermail/numpy-discussion/attachments/20170117/d1f0a1db/attachment-0001.html
Post by Nadav Har'El
Post by a***@gmail.com
Post by Nadav Har'El
------------------------------
Subject: Digest Footer
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
------------------------------
End of NumPy-Discussion Digest, Vol 124, Issue 24
*************************************************
--
--------------------------------------------------------------------------
Post by Nadav Har'El
Post by a***@gmail.com
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may
contain confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--------------------------------------------------------------------------
Post by Nadav Har'El
Post by a***@gmail.com
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
a***@gmail.com
2017-01-17 23:58:05 UTC
Permalink
Post by Nadav Har'El
Post by a***@gmail.com
Hi Nadav,
I may be wrong, but I think that the result of the current implementation
is actually the expected one.
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4
P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
Yes, this formula does fit well with the actual algorithm in the code.
Just a note: this formula is correct and it is one of statistics
fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability +
https://en.wikipedia.org/wiki/Bayes%27_theorem
Thus, the result we get from random.choice IMHO definitely makes sense. Of
course, I think we could always discuss about implementing other sampling
methods if they are useful to some application.
Post by Nadav Har'El
Post by a***@gmail.com
P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability)
P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that,
once normalised, translate into 1/3 and 2/3 respectively)
Therefore P([1,2]) = 0.7/3 = 0.23333
Similarly, P([1,3]) = 0.23333 and P([2,3]) = 1.6/3 = 0.533333
Right, these are the numbers that the algorithm in the current code, and
P([1,2]) = P([1,3]) = 0.23333
P([2,3]) = 0.53333
What I'm puzzled about is that these probabilities do not really fullfill
the given probability vector 0.2, 0.4, 0.4...
Why did the user choose the probabilities 0.2, 0.4, 0.4 for the three
items in the first place?
One reasonable interpretation is that the user wants in his random picks
to see item 1 half the time of item 2 or 3.
For example, maybe item 1 costs twice as much as item 2 or 3, so picking
it half as often will result in an equal expenditure on each item.
If the user randomly picks the items individually (a single item at a
time), he indeed gets exactly this distribution: 0.2 of the time item 1,
0.4 of the time item 2, 0.4 of the time item 3.
Now, what happens if he picks not individual items, but pairs of different
items using numpy.random.choice with two items, replace=false?
Suddenly, the distribution of the individual items in the results get
skewed: If we look at the expected number of times we'll see each item in
E(1) = P([1,2]) + P([1,3]) = 0.46666
E(2) = P([1,2]) + P([2,3]) = 0.76666
E(3) = P([1,3]) + P([2,3]) = 0.76666
P(1) = 0.233333
P(2) = 0.383333
P(3) = 0.383333
As you can see this is not quite the probabilities we wanted (which were
0.2, 0.4, 0.4)! In the random pairs we picked, item 1 was used a bit more
often than we wanted, and item 2 and 3 were used a bit less often!
p is not the probability of the output but the one of the source finite
population. I think that if you want to preserve that distribution, as
Josef pointed out, you have to make extractions independent, that is either
sample with replacement or approximate an infinite population (that is
basically the same thing). But of course in this case you will also end up
with events [X,X].
Post by Nadav Har'El
So that brought my question of why we consider these numbers right.
In this example, it's actually possible to get the right item
P([1,2]) = 0.2 (not 0.233333 as above)
P([1,3]) = 0.2
P([2,3]) = 0.6 (not 0.533333 as above)
Then, we get exactly the right P(1), P(2), P(3): 0.2, 0.4, 0.4
Interestingly, fixing things like I suggest is not always possible.
Consider a different probability-vector example for three items - 0.99,
0.005, 0.005. Now, no matter which algorithm we use for randomly picking
pairs from these three items, *each* returned pair will inevitably contain
one of the two very-low-probability items, so each of those items will
appear in roughly half the pairs, instead of in a vanishingly small
percentage as we hoped.
But in other choices of probabilities (like the one in my original
example), there is a solution. For 2-out-of-3 sampling we can actually show
a system of three linear equations in three variables, so there is always
one solution but if this solution has components not valid as probabilities
(not in [0,1]) we end up with no solution - as happens in the 0.99, 0.005,
0.005 example.
Post by a***@gmail.com
What am I missing?
Alessandro
Post by Nadav Har'El
Hi, I'm looking for a way to find a random sample of C different items out
of N items, with a some desired probabilty Pi for each item i.
I saw that numpy has a function that supposedly does this,
numpy.random.choice (with replace=False and a probabilities array), but
looking at the algorithm actually implemented, I am wondering in what sense
are the probabilities Pi actually obeyed...
Consider a simple numerical example: We have 3 items, and need to pick 2
different ones randomly. Let's assume the desired probabilities for item 1,
2 and 3 are: 0.2, 0.4 and 0.4.
Working out the equations there is exactly one solution here: The random
outcome of numpy.random.choice in this case should be [1,2] at probability
0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed
a solution for the desired probabilities because it yields item 1 in
[1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
0.2+0.6 = 0.8 = 2*P2, etc.
However, the algorithm in numpy.random.choice's replace=False generates, if
I understand correctly, different probabilities for the outcomes: I believe
in this case it generates [1,2] at probability 0.23333, [1,3] also 0.2333,
and [2,3] at probability 0.53333.
My question is how does this result fit the desired probabilities?
If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333,
then the expect number of "1" results we'll get per drawing is 0.23333 +
0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and for
"3" 0.76666. As you can see, the proportions are off: Item 2 is NOT twice
common than item 1 as we originally desired (we asked for probabilities
0.2, 0.4, 0.4 for the individual items!).
--
Nadav Har'El
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachmen
ts/20170117/d1f0a1db/attachment-0001.html>
------------------------------
Subject: Digest Footer
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
------------------------------
End of NumPy-Discussion Digest, Vol 124, Issue 24
*************************************************
--
------------------------------------------------------------
--------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may
contain confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
------------------------------------------------------------
--------------
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
--------------------------------------------------------------------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--------------------------------------------------------------------------
j***@gmail.com
2017-01-18 00:51:25 UTC
Permalink
Post by a***@gmail.com
Post by Nadav Har'El
Post by a***@gmail.com
Hi Nadav,
I may be wrong, but I think that the result of the current
implementation is actually the expected one.
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4
P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
Yes, this formula does fit well with the actual algorithm in the code.
Just a note: this formula is correct and it is one of statistics
fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability +
https://en.wikipedia.org/wiki/Bayes%27_theorem
Thus, the result we get from random.choice IMHO definitely makes sense. Of
course, I think we could always discuss about implementing other sampling
methods if they are useful to some application.
Post by Nadav Har'El
Post by a***@gmail.com
P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability)
P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that,
once normalised, translate into 1/3 and 2/3 respectively)
Therefore P([1,2]) = 0.7/3 = 0.23333
Similarly, P([1,3]) = 0.23333 and P([2,3]) = 1.6/3 = 0.533333
Right, these are the numbers that the algorithm in the current code, and
P([1,2]) = P([1,3]) = 0.23333
P([2,3]) = 0.53333
What I'm puzzled about is that these probabilities do not really fullfill
the given probability vector 0.2, 0.4, 0.4...
Why did the user choose the probabilities 0.2, 0.4, 0.4 for the three
items in the first place?
One reasonable interpretation is that the user wants in his random picks
to see item 1 half the time of item 2 or 3.
For example, maybe item 1 costs twice as much as item 2 or 3, so picking
it half as often will result in an equal expenditure on each item.
If the user randomly picks the items individually (a single item at a
time), he indeed gets exactly this distribution: 0.2 of the time item 1,
0.4 of the time item 2, 0.4 of the time item 3.
Now, what happens if he picks not individual items, but pairs of
different items using numpy.random.choice with two items, replace=false?
Suddenly, the distribution of the individual items in the results get
skewed: If we look at the expected number of times we'll see each item in
E(1) = P([1,2]) + P([1,3]) = 0.46666
E(2) = P([1,2]) + P([2,3]) = 0.76666
E(3) = P([1,3]) + P([2,3]) = 0.76666
P(1) = 0.233333
P(2) = 0.383333
P(3) = 0.383333
As you can see this is not quite the probabilities we wanted (which were
0.2, 0.4, 0.4)! In the random pairs we picked, item 1 was used a bit more
often than we wanted, and item 2 and 3 were used a bit less often!
p is not the probability of the output but the one of the source finite
population. I think that if you want to preserve that distribution, as
Josef pointed out, you have to make extractions independent, that is either
sample with replacement or approximate an infinite population (that is
basically the same thing). But of course in this case you will also end up
with events [X,X].
With replacement and keeping duplicates the results might also be similar
in the pattern of the marginal probabilities
https://onlinecourses.science.psu.edu/stat506/node/17

Another approach in survey sampling is also to drop duplicates in with
replacement sampling, but then the sample size itself is random. (again I
didn't try to understand the small print)

(another related aside: The problem with discrete sample space in small
samples shows up also in calculating hypothesis tests, e.g. fisher's exact
or similar. Because, we only get a few discrete possibilities in the sample
space, it is not possible to construct a test that has exactly the desired
type 1 error.)


Josef
Post by a***@gmail.com
Post by Nadav Har'El
So that brought my question of why we consider these numbers right.
In this example, it's actually possible to get the right item
P([1,2]) = 0.2 (not 0.233333 as above)
P([1,3]) = 0.2
P([2,3]) = 0.6 (not 0.533333 as above)
Then, we get exactly the right P(1), P(2), P(3): 0.2, 0.4, 0.4
Interestingly, fixing things like I suggest is not always possible.
Consider a different probability-vector example for three items - 0.99,
0.005, 0.005. Now, no matter which algorithm we use for randomly picking
pairs from these three items, *each* returned pair will inevitably contain
one of the two very-low-probability items, so each of those items will
appear in roughly half the pairs, instead of in a vanishingly small
percentage as we hoped.
But in other choices of probabilities (like the one in my original
example), there is a solution. For 2-out-of-3 sampling we can actually show
a system of three linear equations in three variables, so there is always
one solution but if this solution has components not valid as probabilities
(not in [0,1]) we end up with no solution - as happens in the 0.99, 0.005,
0.005 example.
Post by a***@gmail.com
What am I missing?
Alessandro
Post by Nadav Har'El
Hi, I'm looking for a way to find a random sample of C different items out
of N items, with a some desired probabilty Pi for each item i.
I saw that numpy has a function that supposedly does this,
numpy.random.choice (with replace=False and a probabilities array), but
looking at the algorithm actually implemented, I am wondering in what sense
are the probabilities Pi actually obeyed...
Consider a simple numerical example: We have 3 items, and need to pick 2
different ones randomly. Let's assume the desired probabilities for item 1,
2 and 3 are: 0.2, 0.4 and 0.4.
Working out the equations there is exactly one solution here: The random
outcome of numpy.random.choice in this case should be [1,2] at probability
0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed
a solution for the desired probabilities because it yields item 1 in
[1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
0.2+0.6 = 0.8 = 2*P2, etc.
However, the algorithm in numpy.random.choice's replace=False generates, if
I understand correctly, different probabilities for the outcomes: I believe
in this case it generates [1,2] at probability 0.23333, [1,3] also 0.2333,
and [2,3] at probability 0.53333.
My question is how does this result fit the desired probabilities?
If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333,
then the expect number of "1" results we'll get per drawing is 0.23333 +
0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and for
"3" 0.76666. As you can see, the proportions are off: Item 2 is NOT twice
common than item 1 as we originally desired (we asked for probabilities
0.2, 0.4, 0.4 for the individual items!).
--
Nadav Har'El
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachmen
ts/20170117/d1f0a1db/attachment-0001.html>
------------------------------
Subject: Digest Footer
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
------------------------------
End of NumPy-Discussion Digest, Vol 124, Issue 24
*************************************************
--
------------------------------------------------------------
--------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may
contain confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
------------------------------------------------------------
--------------
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
--------------------------------------------------------------------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--------------------------------------------------------------------------
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Nadav Har'El
2017-01-18 08:35:30 UTC
Permalink
Post by a***@gmail.com
Post by Nadav Har'El
Post by a***@gmail.com
Hi Nadav,
I may be wrong, but I think that the result of the current
implementation is actually the expected one.
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4
P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
Yes, this formula does fit well with the actual algorithm in the code.
Just a note: this formula is correct and it is one of statistics
fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability +
https://en.wikipedia.org/wiki/Bayes%27_theorem
Hi,

Yes, of course the formula is correct, but it doesn't mean we're not
applying it in the wrong context.

I'll be honest here: I came to numpy.random.choice after I actually coded a
similar algorithm (with the same results) myself, because like you I
thought this was the "obvious" and correct algorithm. Only then I realized
that its output doesn't actually produce the desired probabilities
specified by the user - even in the cases where that is possible. And I
started wondering if existing libraries - like numpy - do this differently.
And it turns out, numpy does it (basically) in the same way as my algorithm.
Post by a***@gmail.com
Thus, the result we get from random.choice IMHO definitely makes sense.
Let's look at what the user asked this function, and what it returns:

User asks: please give me random pairs of the three items, where item 1 has
probability 0.2, item 2 has 0.4, and 3 has 0.4.

Function returns: random pairs, where if you make many random returned
results (as in the law of large numbers) and look at the items they
contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is
0.38333.
These are not (quite) the probabilities the user asked for...

Can you explain a sense where the user's requested probabilities (0.2, 0.4,
0.4) are actually adhered in the results which random.choice returns?

Thanks,
Nadav Har'El.
a***@gmail.com
2017-01-18 09:00:50 UTC
Permalink
Post by Nadav Har'El
Post by a***@gmail.com
Post by Nadav Har'El
Post by a***@gmail.com
Hi Nadav,
I may be wrong, but I think that the result of the current
implementation is actually the expected one.
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4
P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
Yes, this formula does fit well with the actual algorithm in the code.
Just a note: this formula is correct and it is one of statistics
fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability
+ https://en.wikipedia.org/wiki/Bayes%27_theorem
Hi,
Yes, of course the formula is correct, but it doesn't mean we're not
applying it in the wrong context.
I'll be honest here: I came to numpy.random.choice after I actually coded
a similar algorithm (with the same results) myself, because like you I
thought this was the "obvious" and correct algorithm. Only then I realized
that its output doesn't actually produce the desired probabilities
specified by the user - even in the cases where that is possible. And I
started wondering if existing libraries - like numpy - do this differently.
And it turns out, numpy does it (basically) in the same way as my algorithm.
Post by a***@gmail.com
Thus, the result we get from random.choice IMHO definitely makes sense.
User asks: please give me random pairs of the three items, where item 1
has probability 0.2, item 2 has 0.4, and 3 has 0.4.
Function returns: random pairs, where if you make many random returned
results (as in the law of large numbers) and look at the items they
contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is
0.38333.
These are not (quite) the probabilities the user asked for...
Can you explain a sense where the user's requested probabilities (0.2,
0.4, 0.4) are actually adhered in the results which random.choice returns?
I think that the question the user is asking by specifying p is a slightly
different one:
"please give me random pairs of the three items extracted from a
population of 3 items where item 1 has probability of being extracted of
0.2, item 2 has 0.4, and 3 has 0.4. Also please remove extract items once
extracted."
Post by Nadav Har'El
Thanks,
Nadav Har'El.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
--------------------------------------------------------------------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--------------------------------------------------------------------------
Nadav Har'El
2017-01-18 09:52:45 UTC
Permalink
Post by a***@gmail.com
Post by Nadav Har'El
User asks: please give me random pairs of the three items, where item 1
has probability 0.2, item 2 has 0.4, and 3 has 0.4.
Function returns: random pairs, where if you make many random returned
results (as in the law of large numbers) and look at the items they
contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is
0.38333.
These are not (quite) the probabilities the user asked for...
Can you explain a sense where the user's requested probabilities (0.2,
0.4, 0.4) are actually adhered in the results which random.choice returns?
I think that the question the user is asking by specifying p is a slightly
"please give me random pairs of the three items extracted from a
population of 3 items where item 1 has probability of being extracted of
0.2, item 2 has 0.4, and 3 has 0.4. Also please remove extract items once
extracted."
You are right, if that is what the user wants, numpy.random.choice does the
right thing.

I'm just wondering whether this is actually what users want, and whether
they understand this is what they are getting.

As I said, I expected it to generate pairs with, empirically, the desired
distribution of individual items. The documentation of numpy.random.choice
seemed to me (wrongly) that it implis that that's what it does. So I was
surprised to realize that it does not.

Nadav.
j***@gmail.com
2017-01-18 13:53:24 UTC
Permalink
Post by Nadav Har'El
Post by a***@gmail.com
Post by Nadav Har'El
User asks: please give me random pairs of the three items, where item 1
has probability 0.2, item 2 has 0.4, and 3 has 0.4.
Function returns: random pairs, where if you make many random returned
results (as in the law of large numbers) and look at the items they
contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is
0.38333.
These are not (quite) the probabilities the user asked for...
Can you explain a sense where the user's requested probabilities (0.2,
0.4, 0.4) are actually adhered in the results which random.choice returns?
I think that the question the user is asking by specifying p is a
"please give me random pairs of the three items extracted from a
population of 3 items where item 1 has probability of being extracted of
0.2, item 2 has 0.4, and 3 has 0.4. Also please remove extract items once
extracted."
You are right, if that is what the user wants, numpy.random.choice does
the right thing.
I'm just wondering whether this is actually what users want, and whether
they understand this is what they are getting.
As I said, I expected it to generate pairs with, empirically, the desired
distribution of individual items. The documentation of numpy.random.choice
seemed to me (wrongly) that it implis that that's what it does. So I was
surprised to realize that it does not.
As Alessandro and you showed, the function returns something that makes
sense. If the user wants something different, then they need to look for a
different function, which is however difficult if it doesn't have a
solution in general.

Sounds to me a bit like a Monty Hall problem. Whether we like it or not, or
find it counter intuitive, it is what it is given the sampling scheme.

Having more sampling schemes would be useful, but it's not possible to
implement sampling schemes with impossible properties

Josef
Post by Nadav Har'El
Nadav.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
j***@gmail.com
2017-01-18 14:30:48 UTC
Permalink
Post by j***@gmail.com
Post by Nadav Har'El
Post by a***@gmail.com
Post by Nadav Har'El
User asks: please give me random pairs of the three items, where item 1
has probability 0.2, item 2 has 0.4, and 3 has 0.4.
Function returns: random pairs, where if you make many random returned
results (as in the law of large numbers) and look at the items they
contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is
0.38333.
These are not (quite) the probabilities the user asked for...
Can you explain a sense where the user's requested probabilities (0.2,
0.4, 0.4) are actually adhered in the results which random.choice returns?
I think that the question the user is asking by specifying p is a
"please give me random pairs of the three items extracted from a
population of 3 items where item 1 has probability of being extracted of
0.2, item 2 has 0.4, and 3 has 0.4. Also please remove extract items once
extracted."
You are right, if that is what the user wants, numpy.random.choice does
the right thing.
I'm just wondering whether this is actually what users want, and whether
they understand this is what they are getting.
As I said, I expected it to generate pairs with, empirically, the desired
distribution of individual items. The documentation of numpy.random.choice
seemed to me (wrongly) that it implis that that's what it does. So I was
surprised to realize that it does not.
As Alessandro and you showed, the function returns something that makes
sense. If the user wants something different, then they need to look for a
different function, which is however difficult if it doesn't have a
solution in general.
Sounds to me a bit like a Monty Hall problem. Whether we like it or not,
or find it counter intuitive, it is what it is given the sampling scheme.
Having more sampling schemes would be useful, but it's not possible to
implement sampling schemes with impossible properties.
BTW: sampling 3 out of 3 without replacement is even worse

No matter what sampling scheme and what selection probabilities we use, we
always have every element with probability 1 in the sample.


(Which in survey statistics implies that the sampling error or standard
deviation of any estimate of a population mean or total is zero. Which I
found weird. How can you do statistics and get an estimate that doesn't
have any uncertainty associated with it?)

Josef
Post by j***@gmail.com
Josef
Post by Nadav Har'El
Nadav.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Nadav Har'El
2017-01-18 15:12:57 UTC
Permalink
Post by j***@gmail.com
Having more sampling schemes would be useful, but it's not possible to
Post by j***@gmail.com
implement sampling schemes with impossible properties.
BTW: sampling 3 out of 3 without replacement is even worse
No matter what sampling scheme and what selection probabilities we use, we
always have every element with probability 1 in the sample.
I agree. The random-sample function of the type I envisioned will be able
to reproduce the desired probabilities in some cases (like the example I
gave) but not in others. Because doing this correctly involves a set of n
linear equations in comb(n,k) variables, it can have no solution, or many
solutions, depending on the n and k, and the desired probabilities. A
function of this sort could return an error if it can't achieve the desired
probabilities.

But in many cases (the 0.2, 0.4, 0.4 example I gave was just something
random I tried) there will be a way to achieve exactly the desired
distribution.

I guess I'll need to write this new function myself :-) Because my use case
definitely requires that the output of the random items produced matches
the required probabilities (when possible).

Thanks,
Nadav.
Anne Archibald
2017-01-23 12:27:43 UTC
Permalink
Post by j***@gmail.com
Having more sampling schemes would be useful, but it's not possible to
implement sampling schemes with impossible properties.
BTW: sampling 3 out of 3 without replacement is even worse
No matter what sampling scheme and what selection probabilities we use, we
always have every element with probability 1 in the sample.
I agree. The random-sample function of the type I envisioned will be able
to reproduce the desired probabilities in some cases (like the example I
gave) but not in others. Because doing this correctly involves a set of n
linear equations in comb(n,k) variables, it can have no solution, or many
solutions, depending on the n and k, and the desired probabilities. A
function of this sort could return an error if it can't achieve the desired
probabilities.
It seems to me that the basic problem here is that the numpy.random.choice
docstring fails to explain what the function actually does when called with
weights and without replacement. Clearly there are different expectations;
I think numpy.random.choice chose one that is easy to explain and implement
but not necessarily what everyone expects. So the docstring should be
clarified. Perhaps a Notes section:

When numpy.random.choice is called with replace=False and non-uniform
probabilities, the resulting distribution of samples is not obvious.
numpy.random.choice effectively follows the procedure: when choosing the
kth element in a set, the probability of element i occurring is p[i]
divided by the total probability of all not-yet-chosen (and therefore
eligible) elements. This approach is always possible as long as the sample
size is no larger than the population, but it means that the probability
that element i occurs in the sample is not exactly p[i].

Anne
Robert Kern
2017-01-23 14:33:57 UTC
Permalink
Post by Anne Archibald
Post by Nadav Har'El
Post by j***@gmail.com
Post by j***@gmail.com
Having more sampling schemes would be useful, but it's not possible to
implement sampling schemes with impossible properties.
Post by Anne Archibald
Post by Nadav Har'El
Post by j***@gmail.com
BTW: sampling 3 out of 3 without replacement is even worse
No matter what sampling scheme and what selection probabilities we use,
we always have every element with probability 1 in the sample.
Post by Anne Archibald
Post by Nadav Har'El
I agree. The random-sample function of the type I envisioned will be
able to reproduce the desired probabilities in some cases (like the example
I gave) but not in others. Because doing this correctly involves a set of n
linear equations in comb(n,k) variables, it can have no solution, or many
solutions, depending on the n and k, and the desired probabilities. A
function of this sort could return an error if it can't achieve the desired
probabilities.
Post by Anne Archibald
It seems to me that the basic problem here is that the
numpy.random.choice docstring fails to explain what the function actually
does when called with weights and without replacement. Clearly there are
different expectations; I think numpy.random.choice chose one that is easy
to explain and implement but not necessarily what everyone expects. So the
Post by Anne Archibald
When numpy.random.choice is called with replace=False and non-uniform
probabilities, the resulting distribution of samples is not obvious.
numpy.random.choice effectively follows the procedure: when choosing the
kth element in a set, the probability of element i occurring is p[i]
divided by the total probability of all not-yet-chosen (and therefore
eligible) elements. This approach is always possible as long as the sample
size is no larger than the population, but it means that the probability
that element i occurs in the sample is not exactly p[i].

I don't object to some Notes, but I would probably phrase it more like we
are providing the standard definition of the jargon term "sampling without
replacement" in the case of non-uniform probabilities. To my mind (or more
accurately, with my background), "replace=False" obviously picks out the
implemented procedure, and I would have been incredibly surprised if it did
anything else. If the option were named "unique=True", then I would have
needed some more documentation to let me know exactly how it was
implemented.

--
Robert Kern
a***@gmail.com
2017-01-23 14:52:56 UTC
Permalink
Post by Anne Archibald
Post by Nadav Har'El
Post by j***@gmail.com
Post by j***@gmail.com
Having more sampling schemes would be useful, but it's not possible
to implement sampling schemes with impossible properties.
Post by Anne Archibald
Post by Nadav Har'El
Post by j***@gmail.com
BTW: sampling 3 out of 3 without replacement is even worse
No matter what sampling scheme and what selection probabilities we
use, we always have every element with probability 1 in the sample.
Post by Anne Archibald
Post by Nadav Har'El
I agree. The random-sample function of the type I envisioned will be
able to reproduce the desired probabilities in some cases (like the example
I gave) but not in others. Because doing this correctly involves a set of n
linear equations in comb(n,k) variables, it can have no solution, or many
solutions, depending on the n and k, and the desired probabilities. A
function of this sort could return an error if it can't achieve the desired
probabilities.
Post by Anne Archibald
It seems to me that the basic problem here is that the
numpy.random.choice docstring fails to explain what the function actually
does when called with weights and without replacement. Clearly there are
different expectations; I think numpy.random.choice chose one that is easy
to explain and implement but not necessarily what everyone expects. So the
Post by Anne Archibald
When numpy.random.choice is called with replace=False and non-uniform
probabilities, the resulting distribution of samples is not obvious.
numpy.random.choice effectively follows the procedure: when choosing the
kth element in a set, the probability of element i occurring is p[i]
divided by the total probability of all not-yet-chosen (and therefore
eligible) elements. This approach is always possible as long as the sample
size is no larger than the population, but it means that the probability
that element i occurs in the sample is not exactly p[i].
I don't object to some Notes, but I would probably phrase it more like we
are providing the standard definition of the jargon term "sampling without
replacement" in the case of non-uniform probabilities. To my mind (or more
accurately, with my background), "replace=False" obviously picks out the
implemented procedure, and I would have been incredibly surprised if it did
anything else. If the option were named "unique=True", then I would have
needed some more documentation to let me know exactly how it was
implemented.
FWIW, I totally agree with Robert
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
--------------------------------------------------------------------------
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--------------------------------------------------------------------------
Nadav Har'El
2017-01-23 15:41:57 UTC
Permalink
Post by Robert Kern
I don't object to some Notes, but I would probably phrase it more like we
are providing the standard definition of the jargon term "sampling without
replacement" in the case of non-uniform probabilities. To my mind (or more
accurately, with my background), "replace=False" obviously picks out the
implemented procedure, and I would have been incredibly surprised if it did
anything else. If the option were named "unique=True", then I would have
needed some more documentation to let me know exactly how it was
implemented.
FWIW, I totally agree with Robert
With my own background (MSc. in Mathematics), I agree that this algorithm
is indeed the most natural one. And as I said, when I wanted to implement
something myself when I wanted to choose random combinations (k out of n
items), I wrote exactly the same one. But when it didn't produce the
desired probabilities (even in cases where I knew that doing this was
possible), I wrongly assumed numpy would do things differently - only to
realize it uses exactly the same algorithm. So clearly, the documentation
didn't quite explain what it does or doesn't do.

Also, Robert, I'm curious: beyond explaining why the existing algorithm is
reasonable (which I agree), could you give me an example of where it is
actually *useful* for sampling?

Let me give you an illustrative counter-example:

Let's imagine a country that a country has 3 races: 40% Lilliputians, 40%
Blefuscans, an 20% Yahoos (immigrants from a different section of the book
;-)).
Gulliver wants to take a poll, and needs to sample people from all these
races with appropriate proportions.

These races live in different parts of town, so to pick a random person he
needs to first pick one of the races and then a random person from that
part of town.

If he picks one respondent at a time, he uses numpy.random.choice(3,
size=1,p=[0.4,0.4,0.2])) to pick the part of town, and then a person from
that part - he gets the desired 40% / 40% / 20% division of races.

Now imagine that Gulliver can interview two respondents each day, so he
needs to pick two people each time. If he picks 2 choices of part-of-town
*with* replacement, numpy.random.choice(3, size=2,p=[0.4,0.4,0.2]), that's
also fine: he may need to take two people from the same part of town, or
two from two different parts of town, but in any case will still get the
desired 40% / 40% / 20% division between the races of the people he
interviews.

But consider that we are told that if two people from the same race meet in
Gulliver's interview room, the two start chatting between themselves, and
waste Gulliver's time. So he prefers to interview two people of *different*
races. That's sampling without replacement. So he uses
numpy.random.choice(size=2,p=[0.4,0.4,0.2],replace=False) to pick two
different parts of town, and one person from each.
But then he looks at his logs, and discovers he actually interviewed the
races at 38% / 38% / 23% proportions - not the 40%/40%/20% he wanted.
So the opinions of the Yahoos were over-counted in this poll!

I know that this is a silly example (made even sillier by the names of
races I used), but I wonder if you could give me an example where the
current behavior of replace=False is genuinely useful.

Not that I'm saying that fixing this problem is easy (I'm still struggling
with it myself in the general case of size < n-1).

Nadav.
Robert Kern
2017-01-23 16:08:29 UTC
Permalink
Post by Nadav Har'El
Post by Robert Kern
I don't object to some Notes, but I would probably phrase it more like
we are providing the standard definition of the jargon term "sampling
without replacement" in the case of non-uniform probabilities. To my mind
(or more accurately, with my background), "replace=False" obviously picks
out the implemented procedure, and I would have been incredibly surprised
if it did anything else. If the option were named "unique=True", then I
would have needed some more documentation to let me know exactly how it was
implemented.
Post by Nadav Har'El
FWIW, I totally agree with Robert
With my own background (MSc. in Mathematics), I agree that this algorithm
is indeed the most natural one. And as I said, when I wanted to implement
something myself when I wanted to choose random combinations (k out of n
items), I wrote exactly the same one. But when it didn't produce the
desired probabilities (even in cases where I knew that doing this was
possible), I wrongly assumed numpy would do things differently - only to
realize it uses exactly the same algorithm. So clearly, the documentation
didn't quite explain what it does or doesn't do.

In my experience, I have seen "without replacement" mean only one thing. If
the docstring had said "returns unique items", I'd agree that it doesn't
explain what it does or doesn't do. The only issue is that "without
replacement" is jargon, and it is good to recapitulate the definitions of
such terms for those who aren't familiar with them.
Post by Nadav Har'El
Also, Robert, I'm curious: beyond explaining why the existing algorithm
is reasonable (which I agree), could you give me an example of where it is
actually *useful* for sampling?

The references I previously quoted list a few. One is called "multistage
sampling proportional to size". The idea being that you draw (without
replacement) from a larger units (say, congressional districts) before
sampling within them. It is similar to the situation you outline, but it is
probably more useful at a different scale, like lots of larger units (where
your algorithm is likely to provide no solution) rather than a handful.

It is probably less useful in terms of survey design, where you are trying
to *design* a process to get a result, than it is in queueing theory and
related fields, where you are trying to *describe* and simulate a process
that is pre-defined.

--
Robert Kern

Anne Archibald
2017-01-23 15:22:42 UTC
Permalink
Post by Robert Kern
I don't object to some Notes, but I would probably phrase it more like we
are providing the standard definition of the jargon term "sampling without
replacement" in the case of non-uniform probabilities. To my mind (or more
accurately, with my background), "replace=False" obviously picks out the
implemented procedure, and I would have been incredibly surprised if it did
anything else. If the option were named "unique=True", then I would have
needed some more documentation to let me know exactly how it was
implemented.
It is what I would have expected too, but we have a concrete example of a
user who expected otherwise; where one user speaks up, there are probably
more who didn't (some of whom probably have code that's not doing what they
think it does). So for the cost of adding a Note, why not help some of them?

As for the standardness of the definition: I don't know, have you a
reference where it is defined? More natural to me would be to have a list
of items with integer multiplicities (as in: "cat" 3 times, "dog" 1 time).
I'm hesitant to claim ours is a standard definition unless it's in a
textbook somewhere. But I don't insist on my phrasing.

Anne
Robert Kern
2017-01-23 15:47:54 UTC
Permalink
Post by Anne Archibald
Post by Robert Kern
I don't object to some Notes, but I would probably phrase it more like
we are providing the standard definition of the jargon term "sampling
without replacement" in the case of non-uniform probabilities. To my mind
(or more accurately, with my background), "replace=False" obviously picks
out the implemented procedure, and I would have been incredibly surprised
if it did anything else. If the option were named "unique=True", then I
would have needed some more documentation to let me know exactly how it was
implemented.
Post by Anne Archibald
It is what I would have expected too, but we have a concrete example of a
user who expected otherwise; where one user speaks up, there are probably
more who didn't (some of whom probably have code that's not doing what they
think it does). So for the cost of adding a Note, why not help some of them?

That's why I said I'm fine with adding a Note. I'm just suggesting a
re-wording so that the cautious language doesn't lead anyone who is
familiar with the jargon to think we're doing something ad hoc while still
providing the details for those who aren't so familiar.
Post by Anne Archibald
As for the standardness of the definition: I don't know, have you a
reference where it is defined? More natural to me would be to have a list
of items with integer multiplicities (as in: "cat" 3 times, "dog" 1 time).
I'm hesitant to claim ours is a standard definition unless it's in a
textbook somewhere. But I don't insist on my phrasing.

Textbook, I'm not so sure, but it is the *only* definition I've ever
encountered in the literature:

http://epubs.siam.org/doi/abs/10.1137/0209009
http://www.sciencedirect.com/science/article/pii/S002001900500298X

--
Robert Kern
Nadav Har'El
2017-01-23 16:08:18 UTC
Permalink
Post by Anne Archibald
Post by Anne Archibald
As for the standardness of the definition: I don't know, have you a
reference where it is defined? More natural to me would be to have a list
of items with integer multiplicities (as in: "cat" 3 times, "dog" 1 time).
I'm hesitant to claim ours is a standard definition unless it's in a
textbook somewhere. But I don't insist on my phrasing.
Textbook, I'm not so sure, but it is the *only* definition I've ever
http://epubs.siam.org/doi/abs/10.1137/0209009
Very interesting. This paper (PDF available if you search for its name in
Google) explicitly mentions one of the uses of this algorithm is
"multistage sampling", which appears to be exactly the same thing as in the
hypothetical Gulliver example I gave in my earlier mail.

And yet, I showed in my mail that this algorithm does NOT reproduce the
desired frequency of the different sampling units...

Moreover, this paper doesn't explain why you need the "without replacement"
for this use case (everything seems easier, and the desired probabilities
are reproduced, with replacement).
In my story I gave a funny excuse why "without replacement" might be
warrented, but if you're interested I can tell you a bit about my actual
use case, with a more serious reason why I want without replacement.
Post by Anne Archibald
http://www.sciencedirect.com/science/article/pii/S002001900500298X
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...