[Numpy-discussion] scipy.stats.qqplot and scipy.stats.probplot axis labeling

Discussion:

Mark Gawron

2016-06-10 07:06:22 UTC

The scipy.stats.qqplot and scipy.stats.probplot functions plot expected values versus actual data values for visualization of fit to a distribution. First a one-D array of expected percentiles is generated for a sample of size N; then that is passed to dist.ppf, the per cent point function for the chosen distribution, to return an array of expected values. The visualized data points are pairs of expected and actual values, and a linear regression is done on these to produce the line data points in this distribution should lie on.

osr = np.sort(x)
osm_uniform = _calc_uniform_order_statistic_medians(len(x))
osm = dist.ppf(osm_uniform)
slope, intercept, r, prob, sterrest = stats.linregress(osm, osr)

My question concerns the plot display.

ax.plot(osm, osr, 'bo', osm, slope*osm + intercept, 'r-')

import numpy as np
xt = np.arange(-3,3,dtype=int)
# Find the 5 quantiles to divide the data into sixths
percentiles = [x*.167 + .502 for x in xt]
percentiles = np.array(percentiles + [.999])
vals = dist.ppf(percentiles)
ax.set_xticks(vals)
xt = np.array(list(xt)+[3])
ax.set_xticklabels(xt)
ax.set_xlabel('Quantile')
plt.show()

Ive attached two images to show the difference between the current visualization and the suggested one.

Mark Gawron

Ralf Gommers

2016-06-11 12:53:13 UTC

Permalink

Hi Mark,

Note that the scipy-dev or scipy-user mailing list would have been more
appropriate for this question.

Post by Mark Gawron
The scipy.stats.qqplot and scipy.stats.probplot functions plot expected
values versus actual data values for visualization of fit to a
distribution. First a one-D array of expected percentiles is generated for
a sample of size N; then that is passed to dist.ppf, the per cent point
function for the chosen distribution, to return an array of expected
values. The visualized data points are pairs of expected and actual
values, and a linear regression is done on these to produce the line data
points in this distribution should lie on.
osr = np.sort(x)
osm_uniform = _calc_uniform_order_statistic_medians(len(x))
osm = dist.ppf(osm_uniform)
slope, intercept, r, prob, sterrest = stats.linregress(osm, osr)
My question concerns the plot display.
ax.plot(osm, osr, 'bo', osm, slope*osm + intercept, 'r-')
The x-axis of the resulting plot is labeled quantiles, but the xticks and
xticklabels produced produced by qqplot and problplot do not seem correct
for the their intended interpretations. First the numbers on the x-axis do
not represent quantiles; the intervals between them do not in general
contain equal numbers of points. For a normal distribution with sigma=1,
they represent standard deviations. Changing the label on the x-axis does
not seem like a very good solution, because the interpretation of the
values on the x-axis will be different for different distributions. Rather
the right solution seems to be to actually show quantiles on the x-axis.
The numbers on the x-axis can stay as they are, representing quantile
indexes, but they need to be spaced so as to show the actual division
points that carve the population up into groups of the same size. This
can be done in something like the following way.

The ticks are correct I think, but they're theoretical quantiles and not
sample quantiles. This was discussed in [1] and is consistent with R [2]
and statsmodels [3]. I see that we just forgot to add "theoretical" to the
x-axis label (mea culpa). Does adding that resolve your concern?

[1] https://github.com/scipy/scipy/issues/1821
[2] http://data.library.virginia.edu/understanding-q-q-plots/
[3]
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html?highlight=qqplot#statsmodels.graphics.gofplots.qqplot

Ralf

j***@gmail.com

2016-06-11 17:03:26 UTC

Permalink

Post by Ralf Gommers
Hi Mark,
Note that the scipy-dev or scipy-user mailing list would have been more
appropriate for this question.

The ticks are correct I think, but they're theoretical quantiles and not
sample quantiles. This was discussed in [1] and is consistent with R [2]
and statsmodels [3]. I see that we just forgot to add "theoretical" to the
x-axis label (mea culpa). Does adding that resolve your concern?
[1] https://github.com/scipy/scipy/issues/1821
[2] http://data.library.virginia.edu/understanding-q-q-plots/
[3]
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html?highlight=qqplot#statsmodels.graphics.gofplots.qqplot
Ralf

as related link
http://phobson.github.io/mpl-probscale/tutorial/closer_look_at_viz.html

Paul Hobson has done a lot of work for getting different probabitlity
scales attached to pp-plots or generalized versions of probability plots. I
think qqplots are less ambiguous because they are on the original or
standardized scale.

I haven't worked my way through the various interpretation of probability
axis yet because I find it "not obvious". It might be easier for fields
that have a tradition of using probability papers.

It's planned to be added to the statsmodels probability plots so that there
will be a large choice of axis labels and scales.

Josef

Post by Ralf Gommers
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Mark Gawron

2016-06-11 18:49:20 UTC

Permalink

Thanks, Jozef. This is very helpful. And I will direct this
to one of the other mailing lists, once I read the previous posts.

Regarding your remark: Maybe Im having a terminology problem. It seems to me once you do

Post by Ralf Gommers

osm = dist.ppf(osm_uniform)

Post by Ralf Gommers
Hi Mark,
Note that the scipy-dev or scipy-user mailing list would have been more appropriate for this question.
The scipy.stats.qqplot and scipy.stats.probplot functions plot expected values versus actual data values for visualization of fit to a distribution. First a one-D array of expected percentiles is generated for a sample of size N; then that is passed to dist.ppf, the per cent point function for the chosen distribution, to return an array of expected values. The visualized data points are pairs of expected and actual values, and a linear regression is done on these to produce the line data points in this distribution should lie on.

osr = np.sort(x)
osm_uniform = _calc_uniform_order_statistic_medians(len(x))
osm = dist.ppf(osm_uniform)
slope, intercept, r, prob, sterrest = stats.linregress(osm, osr)

My question concerns the plot display.

ax.plot(osm, osr, 'bo', osm, slope*osm + intercept, 'r-')

The x-axis of the resulting plot is labeled quantiles, but the xticks and xticklabels produced produced by qqplot and problplot do not seem correct for the their intended interpretations. First the numbers on the x-axis do not represent quantiles; the intervals between them do not in general contain equal numbers of points. For a normal distribution with sigma=1, they represent standard deviations. Changing the label on the x-axis does not seem like a very good solution, because the interpretation of the values on the x-axis will be different for different distributions. Rather the right solution seems to be to actually show quantiles on the x-axis. The numbers on the x-axis can stay as they are, representing quantile indexes, but they need to be spaced so as to show the actual division points that carve the population up into groups of the same size. This can be done in something like the following way.
The ticks are correct I think, but they're theoretical quantiles and not sample quantiles. This was discussed in [1] and is consistent with R [2] and statsmodels [3]. I see that we just forgot to add "theoretical" to the x-axis label (mea culpa). Does adding that resolve your concern?
[1] https://github.com/scipy/scipy/issues/1821
[2] http://data.library.virginia.edu/understanding-q-q-plots/
[3] http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html?highlight=qqplot#statsmodels.graphics.gofplots.qqplot
Ralf
as related link http://phobson.github.io/mpl-probscale/tutorial/closer_look_at_viz.html
Paul Hobson has done a lot of work for getting different probabitlity scales attached to pp-plots or generalized versions of probability plots. I think qqplots are less ambiguous because they are on the original or standardized scale.
I haven't worked my way through the various interpretation of probability axis yet because I find it "not obvious". It might be easier for fields that have a tradition of using probability papers.
It's planned to be added to the statsmodels probability plots so that there will be a large choice of axis labels and scales.
Josef
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

j***@gmail.com

2016-06-11 19:24:03 UTC

Permalink

Post by Mark Gawron
Thanks, Jozef. This is very helpful. And I will direct this
to one of the other mailing lists, once I read the previous posts.
Regarding your remark: Maybe Im having a terminology problem. It seems to me once you do
osm = dist.ppf(osm_uniform)
youâre back in the value space for the particular distribution. So this
gives you known probability intervals, but not UNIFORM probability
intervals (the interval between 0 and 1 STD covers a bigger prob interval
than the the interval between 1 and 2). And the idea of a quantile is
that itâs a division point in a UNIFORM division of the probability axis.

Yes and No, quantile, i.e. what you get from ppf, are units of the random
variable. So it is on the scale of the random variable not on a probability
scale. The axis labels are in units of the random variable.

pp-plots have probabilities on the axis and are uniform scaled in
probabilities but non-uniform in the values of the random variable.

The difficult part to follow is if the plot is done uniform in one scale,
but the axis are labeled non-uniform in the other scale. That's what Paul's
probscale does and what you have in mind, AFAIU.

Josef

Post by Mark Gawron
Mark

Post by Ralf Gommers
Hi Mark,
Note that the scipy-dev or scipy-user mailing list would have been more
appropriate for this question.

Post by Mark Gawron
The scipy.stats.qqplot and scipy.stats.probplot functions plot expected
values versus actual data values for visualization of fit to a
distribution. First a one-D array of expected percentiles is generated for
a sample of size N; then that is passed to dist.ppf, the per cent point
function for the chosen distribution, to return an array of expected
values. The visualized data points are pairs of expected and actual
values, and a linear regression is done on these to produce the line data
points in this distribution should lie on.
osr = np.sort(x)
osm_uniform = _calc_uniform_order_statistic_medians(len(x))
osm = dist.ppf(osm_uniform)
slope, intercept, r, prob, sterrest = stats.linregress(osm, osr)
My question concerns the plot display.
ax.plot(osm, osr, 'bo', osm, slope*osm + intercept, 'r-')
The x-axis of the resulting plot is labeled quantiles, but the xticks
and xticklabels produced produced by qqplot and problplot do not seem
correct for the their intended interpretations. First the numbers on the
x-axis do not represent quantiles; the intervals between them do not in
general contain equal numbers of points. For a normal distribution with
sigma=1, they represent standard deviations. Changing the label on the
x-axis does not seem like a very good solution, because the interpretation
of the values on the x-axis will be different for different distributions.
Rather the right solution seems to be to actually show quantiles on the
x-axis. The numbers on the x-axis can stay as they are, representing
quantile indexes, but they need to be spaced so as to show the actual
division points that carve the population up into groups of the same
size. This can be done in something like the following way.

The ticks are correct I think, but they're theoretical quantiles and not
sample quantiles. This was discussed in [1] and is consistent with R [2]
and statsmodels [3]. I see that we just forgot to add "theoretical" to the
x-axis label (mea culpa). Does adding that resolve your concern?
[1] https://github.com/scipy/scipy/issues/1821
[2] http://data.library.virginia.edu/understanding-q-q-plots/
[3]
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html?highlight=qqplot#statsmodels.graphics.gofplots.qqplot
Ralf

as related link
http://phobson.github.io/mpl-probscale/tutorial/closer_look_at_viz.html
Paul Hobson has done a lot of work for getting different probabitlity
scales attached to pp-plots or generalized versions of probability plots. I
think qqplots are less ambiguous because they are on the original or
standardized scale.
I haven't worked my way through the various interpretation of probability
axis yet because I find it "not obvious". It might be easier for fields
that have a tradition of using probability papers.
It's planned to be added to the statsmodels probability plots so that
there will be a large choice of axis labels and scales.
Josef

Post by Ralf Gommers
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Mark Gawron

2016-06-11 19:31:11 UTC

Permalink

Ok,

Our messages crossed. I understand now.

Thanks.

Mark

Post by Ralf Gommers

osm = dist.ppf(osm_uniform)

youre back in the value space for the particular distribution. So this
gives you known probability intervals, but not UNIFORM probability
intervals (the interval between 0 and 1 STD covers a bigger prob interval
than the the interval between 1 and 2). And the idea of a quantile is
that its a division point in a UNIFORM division of the probability axis.
Yes and No, quantile, i.e. what you get from ppf, are units of the random variable. So it is on the scale of the random variable not on a probability scale. The axis labels are in units of the random variable.
pp-plots have probabilities on the axis and are uniform scaled in probabilities but non-uniform in the values of the random variable.
The difficult part to follow is if the plot is done uniform in one scale, but the axis are labeled non-uniform in the other scale. That's what Paul's probscale does and what you have in mind, AFAIU.
Josef
Mark

osr = np.sort(x)
osm_uniform = _calc_uniform_order_statistic_medians(len(x))
osm = dist.ppf(osm_uniform)
slope, intercept, r, prob, sterrest = stats.linregress(osm, osr)

My question concerns the plot display.

ax.plot(osm, osr, 'bo', osm, slope*osm + intercept, 'r-')

Ralf Gommers

2016-06-12 10:18:29 UTC

Permalink

Post by Ralf Gommers
Hi Mark,
Note that the scipy-dev or scipy-user mailing list would have been more
appropriate for this question.

Sent a PR for this: https://github.com/scipy/scipy/pull/6249

Ralf

Post by Ralf Gommers
[1] https://github.com/scipy/scipy/issues/1821
[2] http://data.library.virginia.edu/understanding-q-q-plots/
[3]
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html?highlight=qqplot#statsmodels.graphics.gofplots.qqplot
Ralf