Discussion:
[Numpy-discussion] Proposal to add `weights` to `np.percentile` and `np.median`
Joseph Fox-Rabinovitz
2016-02-16 05:49:42 UTC
Permalink
I would like to add a `weights` keyword to `np.partition`,
`np.percentile` and `np.median`. My reason for doing so is to to allow
`np.histogram` to process automatic bin selection with weights.
Currently, weights are not supported for the automatic bin selection
and would be difficult to support in `auto` mode without having
`np.percentile` support a `weights` keyword. I suspect that there are
many other uses for such a feature.

I have taken a preliminary look at the C implementation of the
partition functions that are the basis for `partition`, `median` and
`percentile`. I think that it would be possible to add versions (or
just extend the functionality of existing ones) that check the ratio
of the weights below the partition point to the total sum of the
weights instead of just counting elements.

One of the main advantages of such an implementation is that it would
allow any real weights to be handled correctly, not just integers.
Complex weights would not be supported.

The purpose of this email is to see if anybody objects, has ideas or
cares at all about this proposal before I spend a significant amount
of time working on it. For example, did I miss any functions in my
list?

Regards,

-Joe
Antony Lee
2016-02-16 18:32:24 UTC
Permalink
See earlier discussion here: https://github.com/numpy/numpy/issues/6326
Basically, naïvely sorting may be faster than a not-so-optimized version of
quickselect.

Antony
Post by Joseph Fox-Rabinovitz
I would like to add a `weights` keyword to `np.partition`,
`np.percentile` and `np.median`. My reason for doing so is to to allow
`np.histogram` to process automatic bin selection with weights.
Currently, weights are not supported for the automatic bin selection
and would be difficult to support in `auto` mode without having
`np.percentile` support a `weights` keyword. I suspect that there are
many other uses for such a feature.
I have taken a preliminary look at the C implementation of the
partition functions that are the basis for `partition`, `median` and
`percentile`. I think that it would be possible to add versions (or
just extend the functionality of existing ones) that check the ratio
of the weights below the partition point to the total sum of the
weights instead of just counting elements.
One of the main advantages of such an implementation is that it would
allow any real weights to be handled correctly, not just integers.
Complex weights would not be supported.
The purpose of this email is to see if anybody objects, has ideas or
cares at all about this proposal before I spend a significant amount
of time working on it. For example, did I miss any functions in my
list?
Regards,
-Joe
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Joseph Fox-Rabinovitz
2016-02-16 18:41:30 UTC
Permalink
Thanks for pointing me to that. I had something a bit different in
mind but that definitely looks like a good start.
Post by Antony Lee
See earlier discussion here: https://github.com/numpy/numpy/issues/6326
Basically, naïvely sorting may be faster than a not-so-optimized version of
quickselect.
Antony
Post by Joseph Fox-Rabinovitz
I would like to add a `weights` keyword to `np.partition`,
`np.percentile` and `np.median`. My reason for doing so is to to allow
`np.histogram` to process automatic bin selection with weights.
Currently, weights are not supported for the automatic bin selection
and would be difficult to support in `auto` mode without having
`np.percentile` support a `weights` keyword. I suspect that there are
many other uses for such a feature.
I have taken a preliminary look at the C implementation of the
partition functions that are the basis for `partition`, `median` and
`percentile`. I think that it would be possible to add versions (or
just extend the functionality of existing ones) that check the ratio
of the weights below the partition point to the total sum of the
weights instead of just counting elements.
One of the main advantages of such an implementation is that it would
allow any real weights to be handled correctly, not just integers.
Complex weights would not be supported.
The purpose of this email is to see if anybody objects, has ideas or
cares at all about this proposal before I spend a significant amount
of time working on it. For example, did I miss any functions in my
list?
Regards,
-Joe
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
j***@gmail.com
2016-02-16 19:39:42 UTC
Permalink
On Tue, Feb 16, 2016 at 1:41 PM, Joseph Fox-Rabinovitz <
Post by Joseph Fox-Rabinovitz
Thanks for pointing me to that. I had something a bit different in
mind but that definitely looks like a good start.
Post by Antony Lee
See earlier discussion here: https://github.com/numpy/numpy/issues/6326
Basically, naïvely sorting may be faster than a not-so-optimized version
of
Post by Antony Lee
quickselect.
Antony
2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz <
Post by Joseph Fox-Rabinovitz
I would like to add a `weights` keyword to `np.partition`,
`np.percentile` and `np.median`. My reason for doing so is to to allow
`np.histogram` to process automatic bin selection with weights.
Currently, weights are not supported for the automatic bin selection
and would be difficult to support in `auto` mode without having
`np.percentile` support a `weights` keyword. I suspect that there are
many other uses for such a feature.
I have taken a preliminary look at the C implementation of the
partition functions that are the basis for `partition`, `median` and
`percentile`. I think that it would be possible to add versions (or
just extend the functionality of existing ones) that check the ratio
of the weights below the partition point to the total sum of the
weights instead of just counting elements.
One of the main advantages of such an implementation is that it would
allow any real weights to be handled correctly, not just integers.
Complex weights would not be supported.
The purpose of this email is to see if anybody objects, has ideas or
cares at all about this proposal before I spend a significant amount
of time working on it. For example, did I miss any functions in my
list?
Regards,
-Joe
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
statsmodels just got weighted quantiles
https://github.com/statsmodels/statsmodels/pull/2707

I didn't try to figure out it's computational efficiency, and we would
gladly delegate to whatever fast algorithm would be in numpy.

Josef
Joseph Fox-Rabinovitz
2016-02-16 19:48:26 UTC
Permalink
Please correct me if I misunderstood, but the code in that commit is
doing a full sort, somewhat similar to what
`scipy.stats.scoreatpercentile`. If that is correct, I will run some
benchmarks first, but I think there is value to going forward with a
numpy version that extends the current partitioning scheme.

- Joe
Post by j***@gmail.com
On Tue, Feb 16, 2016 at 1:41 PM, Joseph Fox-Rabinovitz
Post by Joseph Fox-Rabinovitz
Thanks for pointing me to that. I had something a bit different in
mind but that definitely looks like a good start.
Post by Antony Lee
See earlier discussion here: https://github.com/numpy/numpy/issues/6326
Basically, naïvely sorting may be faster than a not-so-optimized version of
quickselect.
Antony
2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz
Post by Joseph Fox-Rabinovitz
I would like to add a `weights` keyword to `np.partition`,
`np.percentile` and `np.median`. My reason for doing so is to to allow
`np.histogram` to process automatic bin selection with weights.
Currently, weights are not supported for the automatic bin selection
and would be difficult to support in `auto` mode without having
`np.percentile` support a `weights` keyword. I suspect that there are
many other uses for such a feature.
I have taken a preliminary look at the C implementation of the
partition functions that are the basis for `partition`, `median` and
`percentile`. I think that it would be possible to add versions (or
just extend the functionality of existing ones) that check the ratio
of the weights below the partition point to the total sum of the
weights instead of just counting elements.
One of the main advantages of such an implementation is that it would
allow any real weights to be handled correctly, not just integers.
Complex weights would not be supported.
The purpose of this email is to see if anybody objects, has ideas or
cares at all about this proposal before I spend a significant amount
of time working on it. For example, did I miss any functions in my
list?
Regards,
-Joe
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
statsmodels just got weighted quantiles
https://github.com/statsmodels/statsmodels/pull/2707
I didn't try to figure out it's computational efficiency, and we would
gladly delegate to whatever fast algorithm would be in numpy.
Josef
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
j***@gmail.com
2016-02-16 20:22:35 UTC
Permalink
On Tue, Feb 16, 2016 at 2:48 PM, Joseph Fox-Rabinovitz <
Post by Joseph Fox-Rabinovitz
Please correct me if I misunderstood, but the code in that commit is
doing a full sort, somewhat similar to what
`scipy.stats.scoreatpercentile`. If that is correct, I will run some
benchmarks first, but I think there is value to going forward with a
numpy version that extends the current partitioning scheme.
I think so, but it's hiding inside pandas groupby, which also uses a hash,
IIUC.
AFAICS, the main reason it's implemented this way is to get correct tie
handling.

There could be large performance differences depending on whether there are
many ties (discretized data) or only unique floats.

(just guessing)

Josef
Post by Joseph Fox-Rabinovitz
- Joe
Post by j***@gmail.com
On Tue, Feb 16, 2016 at 1:41 PM, Joseph Fox-Rabinovitz
Post by Joseph Fox-Rabinovitz
Thanks for pointing me to that. I had something a bit different in
mind but that definitely looks like a good start.
https://github.com/numpy/numpy/issues/6326
Post by j***@gmail.com
Post by Joseph Fox-Rabinovitz
Post by Antony Lee
Basically, naïvely sorting may be faster than a not-so-optimized
version
Post by j***@gmail.com
Post by Joseph Fox-Rabinovitz
Post by Antony Lee
of
quickselect.
Antony
2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz
Post by Joseph Fox-Rabinovitz
I would like to add a `weights` keyword to `np.partition`,
`np.percentile` and `np.median`. My reason for doing so is to to
allow
Post by j***@gmail.com
Post by Joseph Fox-Rabinovitz
Post by Antony Lee
Post by Joseph Fox-Rabinovitz
`np.histogram` to process automatic bin selection with weights.
Currently, weights are not supported for the automatic bin selection
and would be difficult to support in `auto` mode without having
`np.percentile` support a `weights` keyword. I suspect that there are
many other uses for such a feature.
I have taken a preliminary look at the C implementation of the
partition functions that are the basis for `partition`, `median` and
`percentile`. I think that it would be possible to add versions (or
just extend the functionality of existing ones) that check the ratio
of the weights below the partition point to the total sum of the
weights instead of just counting elements.
One of the main advantages of such an implementation is that it would
allow any real weights to be handled correctly, not just integers.
Complex weights would not be supported.
The purpose of this email is to see if anybody objects, has ideas or
cares at all about this proposal before I spend a significant amount
of time working on it. For example, did I miss any functions in my
list?
Regards,
-Joe
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
statsmodels just got weighted quantiles
https://github.com/statsmodels/statsmodels/pull/2707
I didn't try to figure out it's computational efficiency, and we would
gladly delegate to whatever fast algorithm would be in numpy.
Josef
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...