Discussion:
[Numpy-discussion] numpy.power -> numpy.random.choice Probabilities don't sum to 1
Ryan R. Rosario
2015-12-18 21:25:10 UTC
Permalink
Hi,

I have a matrix whose entries I must raise to a certain power and then normalize by row. After I do that, when I pass some rows to numpy.random.choice, I get a ValueError: probabilities do not sum to 1.

I understand that floating point is not perfect, and my matrix is so large that I cannot use np.longdouble because I will run out of RAM.

As an example on a smaller matrix:

np.power(mymatrix, 10, out=mymatrix)
row_normalized = np.apply_along_axis(lambda x: x / np.sum(x), 1, mymatrix)
sums = row_normalized.sum(axis=1)
sums[np.where(sums != 1)]

array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994,
0.99999994, 0.99999994], dtype=float32)

np.random.choice(range(row_normalized.shape[0]), 1, p=row_normalized[0, :])

ValueError: probabilities do not sum to 1


I also tried the normalize function in sklearn.preprocessing and have the same problem.

Is there a way to avoid this problem without having to make manual adjustments to get the row sums to = 1?

— Ryan
Nathaniel Smith
2015-12-19 01:00:05 UTC
Permalink
Post by Ryan R. Rosario
Hi,
I have a matrix whose entries I must raise to a certain power and then normalize by row. After I do that, when I pass some rows to numpy.random.choice, I get a ValueError: probabilities do not sum to 1.
I understand that floating point is not perfect, and my matrix is so large that I cannot use np.longdouble because I will run out of RAM.
np.power(mymatrix, 10, out=mymatrix)
row_normalized = np.apply_along_axis(lambda x: x / np.sum(x), 1, mymatrix)
I'm sorry I don't have a solution to your actual problem off the top
of my head, but it's probably helpful in general to know that a better
way to write this would be just

row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True)

apply_along_axis is slow and can almost always be replaced by a
broadcasting expression like this.
Post by Ryan R. Rosario
sums = row_normalized.sum(axis=1)
sums[np.where(sums != 1)]
And here you can just write

sums[sums != 1]

i.e. the call to where() isn't doing anything useful.

-n
--
Nathaniel J. Smith -- http://vorpus.org
Andy Ray Terrel
2015-12-19 12:55:25 UTC
Permalink
A simple fix would certainly by pass the check in random.choice, but I
don't know how to get that. So let's focus on the summation.

I believe you are hitting an instability in summing small numbers as a
power to 10th order would produce. Here is an example:

mymatrix = np.random.rand(1024,1024).astype('float16')*1e-7
row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True)
sums = row_normalized.sum(axis=1)
len(sums[sums != 1]) # -> 108

One can use things like Kahan summation and you will need to collect the
size of the error and truncate all numbers in mymatrix under that error.
I'm not quite sure how to quickly implement such a thing in numpy without a
loop.
Post by Ryan R. Rosario
Post by Ryan R. Rosario
Hi,
I have a matrix whose entries I must raise to a certain power and then
normalize by row. After I do that, when I pass some rows to
numpy.random.choice, I get a ValueError: probabilities do not sum to 1.
Post by Ryan R. Rosario
I understand that floating point is not perfect, and my matrix is so
large that I cannot use np.longdouble because I will run out of RAM.
Post by Ryan R. Rosario
np.power(mymatrix, 10, out=mymatrix)
row_normalized = np.apply_along_axis(lambda x: x / np.sum(x), 1,
mymatrix)
I'm sorry I don't have a solution to your actual problem off the top
of my head, but it's probably helpful in general to know that a better
way to write this would be just
row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True)
apply_along_axis is slow and can almost always be replaced by a
broadcasting expression like this.
Post by Ryan R. Rosario
sums = row_normalized.sum(axis=1)
sums[np.where(sums != 1)]
And here you can just write
sums[sums != 1]
i.e. the call to where() isn't doing anything useful.
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Sebastian Berg
2015-12-19 16:17:22 UTC
Permalink
Post by Andy Ray Terrel
A simple fix would certainly by pass the check in random.choice, but I
don't know how to get that. So let's focus on the summation.
I believe you are hitting an instability in summing small numbers as a
mymatrix = np.random.rand(1024,1024).astype('float16')*1e-7
row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True)
sums = row_normalized.sum(axis=1)
len(sums[sums != 1]) # -> 108
One can use things like Kahan summation and you will need to collect
the size of the error and truncate all numbers in mymatrix under that
error. I'm not quite sure how to quickly implement such a thing in
numpy without a loop.
In fact, the code even seems to do kahan summation, however, I think it
always assumes double precision for the p keyword argument, so as a work
around at least, you have to sum to convert to and normalize it as
double.

- Sebastian
Post by Andy Ray Terrel
On Fri, Dec 18, 2015 at 1:25 PM, Ryan R. Rosario
Post by Ryan R. Rosario
Hi,
I have a matrix whose entries I must raise to a certain
power and then normalize by row. After I do that, when I pass
probabilities do not sum to 1.
Post by Ryan R. Rosario
I understand that floating point is not perfect, and my
matrix is so large that I cannot use np.longdouble because I
will run out of RAM.
Post by Ryan R. Rosario
np.power(mymatrix, 10, out=mymatrix)
row_normalized = np.apply_along_axis(lambda x: x /
np.sum(x), 1, mymatrix)
I'm sorry I don't have a solution to your actual problem off the top
of my head, but it's probably helpful in general to know that a better
way to write this would be just
row_normalized = mymatrix / np.sum(mymatrix, axis=1,
keepdims=True)
apply_along_axis is slow and can almost always be replaced by a
broadcasting expression like this.
Post by Ryan R. Rosario
sums = row_normalized.sum(axis=1)
sums[np.where(sums != 1)]
And here you can just write
sums[sums != 1]
i.e. the call to where() isn't doing anything useful.
-n
--
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...