Theoretical and empirical distributions of the p value

Butler, J. S.; Jones, Peter

doi:10.1007/s40300-017-0130-2

Theoretical and empirical distributions of the p value

Published: 11 December 2017

Volume 76, pages 1–30, (2018)
Cite this article

METRON Aims and scope Submit manuscript

908 Accesses
5 Citations
Explore all metrics

Abstract

The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling Techniques for Quantitative Research

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

The potential of working hypotheses for deductive exploratory research

Article Open access 08 December 2020

Notes

We included only school districts categorized by the NCES as a “local school district” or a “local school district component of a supervisory union”. These districts accounted for over 90% of all public school enrollments and contained the majority of operating public schools. Of importance to this study, these districts received a substantial portion of the total local, state and federal revenues for public schools.

References

Abelson, R.P.: On the surprising longevity of flogged horses: why there is a case for the significance test. Psychol. Sci. 8, 12–20 (1997)
Article Google Scholar
Barnett, V.: Comparative Statistical Inference, 2nd edn. Wiley, Hoboken (1982)
MATH Google Scholar
Bayarri, M.J., Berger, J.O.: The interplay of Bayesian and frequentist analysis. Stat. Sci. 19, 58–80 (2004)
Article MathSciNet MATH Google Scholar
Bayarri, M.J., Berger, J.O.: P values for composite null models. J. Am. Stat. Assoc. 95, 1127–1142 (2000)
MathSciNet MATH Google Scholar
Berger, J.O.: Could Fisher, Jeffreys and Neyman have agreed on testing? Stat. Sci. 18, 1–12 (2003)
Article MathSciNet MATH Google Scholar
Berger, J.O., Sellke, T.: Testing a point null hypothesis: the irreconcilability of p values and evidence. J. Am. Stat. Assoc. 82, 112–122 (1987)
MathSciNet MATH Google Scholar
Berkson, J.: Tests of significance considered as evidence. J. Am. Stat. Assoc. 37, 325–335 (1942)
Article Google Scholar
Casella, G., Berger, R.L.: Reconciling Bayesian and frequentist evidence in the one-sided testing problem. J. Am. Stat. Assoc. 62, 106–111 (1987)
Article MathSciNet MATH Google Scholar
William, G.: Cochran, Sampling Techniques, 3rd edn. Wiley, New York (1977)
Google Scholar
Cowles, M., Davis, C.: On the origins of the.05 level of statistical significance. Am. Psychol. 37, 553–558 (1982)
Article Google Scholar
Cox, D.R.: The role of significance tests. Scand. J. Stat. 49–63 (1977)
Cumming, G.: The new statistics: why and how. Psychol. Sci. 25, 7–29 (2014)
Article Google Scholar
Davidson, R., Mackinnon, J.G.: Estimation and Inference in Econometrics. Oxford University Press, Oxford (1993)
MATH Google Scholar
Durbin, J.: Estimation of parameters in time-series regression models. J. Roy. Stat. Soc. B 22, 139–153 (1960)
MathSciNet MATH Google Scholar
Gigerenzer, G.: 2014: What scientific idea is ready for retirement? http://edge.org/response-detail/25462. Accessed 31 Aug 2014
Gravetter, F.J., Wallnau, L.B.: Statistics for the Behavioral Sciences, Chapter 8, “Introduction to Hypothesis Testing”. Cengage Learning, Wadsworth (2009)
Google Scholar
Hamilton, J.D.: Time Series Analysis. Princeton University Press, Princeton (1994)
MATH Google Scholar
Harris, R.J.: Significance Tests Have Their Place. Psychol. Sci. 8, 8–11 (1997)
Article Google Scholar
Hodgson, R.T.: The problem of being a normal deviate. Am. J. Phys. 47, 1092–1093 (1979)
Article Google Scholar
Hunter, J.E.: Needed: a ban on the significance test. Psychol. Sci. 8, 3–7 (1997)
Article Google Scholar
Kelly, M.: Emily Dickinson and the monkeys on the stair Or: what is the significance of the 5% significance level. Significance 10, 21–22 (2013)
Article Google Scholar
Kendall, N.G., Stuart, A., Ord, J.K.: The Advanced Theory of Statistics, vol. III. Griffin, London (1987)
MATH Google Scholar
Kiefer, J.: Lecture Notes on Statistical Inference, mimeographed, n.d., n.p., obtained from the Cornell University Department of Mathematics (1979)
Killeen, P.R.: An alternative to null-hypothesis significance tests. Psychol. Sci. 16, 345–353 (2005)
Article Google Scholar
MacKinnon, D.P., Fairchild, A.J.: Current directions in mediation analysis. Curr. Dir. Psychol. Sci. 18, 16–20 (2009)
Article Google Scholar
Morrison, D.E., Henkel, R.E. (eds.): The significance test controversy–a reader. Butterworth, London (1970)
Google Scholar
Poole, C.: Beyond the confidence interval. Am. J. Public Health 77, 195–199 (1987)
Article Google Scholar
Robins, J.M., van der Vaart, A., Ventura, V.R.: Asymptotic distribution of p values in composite null models. J. Am. Stat. Assoc. 95, 1143–1156 (2000)
MathSciNet MATH Google Scholar
Royall, R.: Statistical evidence: a likelihood paradigm, vol. 71. CRC Press, Boca Raton (1997)
MATH Google Scholar
Royall, R.: On the probability of observing misleading statistical evidence. J. Am. Stat. Assoc. 95, 760–768 (2000)
Article MathSciNet MATH Google Scholar
Scarr, S.: Rules of evidence: a larger context for the statistical debate. Psychol. Sci. 8, 16–17 (1997)
Article Google Scholar
Schervish, M.J.: P values: what they are and what they are not. Am Stat 50, 203–206 (1996)
MathSciNet Google Scholar
Sellke, T., Bayarri, M.J., Berger, J.O.: Calibration of \(p\) values for testing precise null hypotheses. Am Statistician 55, 62–71 (2001)
Article MathSciNet MATH Google Scholar
Spjøtvoll, E.: Discussion of D.R. Cox’s paper. Scand. J. Stat. 63–66 (1977)
U.S. Department of Commerce.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, Applied Mathematics Series Volume 55 (1966)
Wellek, S.: A critical evaluation of the current “p-value controversy”. Biometr J (2017)
van Zwet, W.R.: Discussion of D.R. Cox’s paper. Scand. J. Stat. 67 (1977)
Yates, F.: The influence of statistical methods for research workers on the development of the science of statistics. J. Am. Stat. Assoc. 46, 19–34 (1951)
Google Scholar
Ziliak, S.T., McCloskey, D.N.: The cult of statistical significance. JSM conference paper, Section on Economic Education (2009)
MATH Google Scholar
Ziliak, S.T., McCloskey, D.N.: The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, Ann Arbor (2008)
MATH Google Scholar
Hempel, C.: Contributions to the Logical Analysis of the Concept of Probability, Ph.D. thesis, Frederich Wilhelm University of Berlin, in German, translated by the author (1934)
von Mises, R.: Probability, Statistics, and Truth. Dover Publications, Inc, New York (1981). (reprint of earlier editions, the earliest being 1928)

Download references

Author information

Authors and Affiliations

Martin School of Public Policy and Administration, University of Kentucky, Lexington, USA
J. S. Butler
Department of Government, University of Alabama at Birmingham, Birmingham, USA
Peter Jones

Authors

J. S. Butler
View author publications
You can also search for this author in PubMed Google Scholar
Peter Jones
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Jones.

Additional information

The authors thank Amy Sibulkin for asking questions which led to this paper, participants in a seminar at the University of Kentucky, and an anonymous referee for additions which are significant using either frequentist or Bayesian methods.

Appendices

Appendix A: Asymmetrical confidence intervals

Confidence intervals for Normally distributed estimators are symmetrical and constructed in the familiar way taught in statistics courses. For a large enough sample, continuous and differentiable functions of Normally distributed estimators are also distributed Normal, using the delta method, but the sample size might need to be very large. For example, chi square distributions refer to squares of standard Normals, and chi square distributions become Normal but the number of degrees of freedom must go to infinity. Ratios of Normally distributed estimators are technically Cauchy because of the non-zero density at value zero in the denominator. The precision must increase to the point that no estimate could with any probability large enough to matter be at zero, e.g. male adult American height is distributed Normal with a mean of 68 inches, variance of 4 squared inches, which is 34 standard deviations from 0. For estimators, that might be difficult. The probability distribution of ratios of random variables can be a Cauchy with no finite mean or variance [19].

Products of estimators can arise in various ways, e.g. mediation models in psychology [25] or autoregressive disturbances in macroeconomics, i.e. AR(1) as estimated by [14], for which see Hamilton [17, p. 226]. Exponentiation of logit or hazard model coefficients can also result in non-Normal distributions unless the sample size is very large. The correct distribution is lognormal which becomes Normal as the underlying variance becomes small, i.e. the sample size becomes large.

If the confidence interval is asymmetric, it should be chosen to minimize the width while equalizing the probability density function at each end, while fixing the total probability at 95% or whatever arbitrary value is intended by the researcher. That can be done mathematically in modern times, but special computer programs are required. Symmetric non-linear confidence intervals are inefficiently wide. Table 4 shows asymmetric confidence intervals for the standard problem of exponentiating a normally distributed estimated parameter. The intervals are closer to zero and only slightly shorter if the standard error is small, but much shorter if the standard error is large.

When asymptotic distributions are asymmetric, confidence intervals do not correspond to p values in the usual calculation from Bell curves [34].

Using resampling variances, the confidence intervals can be computed from the empirical distribution of estimates. Using Bayesian variances, the confidence intervals can be computed from the posterior probability distribution of the parameter.

Table 4 Symmetric and asymmetric confidence intervals for an exponentiated estimator. The effect size is irrelevant as both are scaled by its exponentiated value; 0 assumed here. The table is based on the s.e. of the exponentiated estimator.

Full size table

Appendix B: Existent and hypothetical populations, or finite and infinite collectives

Policy analysis can proceed by collecting data on all charter schools or all environmental laws in a state or nation or all wars over a period of time. This can be construed as a population, but under that interpretation, there appears to be no uncertainty at all about the statistics or relationships in the data. Measurement error and omitted variables would continue to be explanations for standard errors, but neither of those is a satisfactory basis for evaluating policy analysis, as the first invalidates the data, and the second invalidates the model. There is a philosophical problem of justifying the application of inferential statistical formulae to an apparently complete census of a population.

If the sample really is a substantial subset of the population, the variances of sample means, and by implication moments and maximum likelihood estimators, are subject to the finite population correction (Cochran [9, Sect. 2.6, pp. 24–25]). If the sample size is n as usual while the population size is N, the variance is reduced by the factor (N – n)/N. The standard deviation is reduced by \(\surd \)[(N – n)/(N – 1)]. The covariance is reduced by a factor of [(N – n)/(N – 1)]. This can be safely ignored if the sample is a small part of the population, i.e. n/N is small, but under the interpretation that the present population of people, places, or things is all that matters, the finite population correction factor should always be applied, which would eliminate the variance of fixed effects for states in most policy studies. Ignoring the finite population correction factor would then be a standard mistake.

The philosophical interpretation that eliminates the problem in general is based on a distinction between existent and hypothetical populations (Kendall et al. ([22], Sect. 1.29, pp. 22–23)). The work by von Mises [42, pp. 98–99], in which the word “collective” refers to the more conventional “population” makes a distinction between the idea that “the calculus of probability deals each time merely with one single collective, whose distribution is subjected to certain summations or integrations”, i.e. the “restriction to one single initial collective” versus the “admissible distribution functions, the nature of the sub-sets of the attribute space for which probabilities can be defined, etc”. That is, there is a large set of possible combinations of attributes of the states, people, places, or things studied, and that is the collective which is studied. There could be a million different versions of New York State, only one of which is in fact observed. This makes n/N go to zero and justifies standard empirical inferential statistics.

Returning to Kendall et al. ([22]), the sample can be less than the entire population because some people were not included, but could in principle have been, or because not every possible toss of a die has been recorded, which could not even in principle be done. Those other tosses have a hypothetical existence. The hypothetical population also applies to people—in Kendall et al. ([22], p. 23), to tsetse flies—as their condition and attributes could be different in a vast number of ways.

It must be emphasized that the distinction between existent and hypothetical populations is not merely a matter of ontological speculation—if it were we could safely ignore it—but one of practical importance when inferences are drawn about a population from a sample generated by it (Kendall et al. ([22], p. 23)).

Kendall et al. ([22], Sect. 9.4, p. 292) pursue the idea of a hypothetical population further. If empirical probability is “a limiting relative frequency”, then the limit concept requires that samples go to infinity in principle, and “we must be able to contemplate a series of replications under identical conditions and to specify the possible values that might be realized” (Kendall et al. ([22], p. 292)). Then “we must be willing that our observed value was randomly selected from the set of possibilities. At first sight, this is a rather baffling conception. However, such a structure is fundamental in a frequency-based theory of inference, and the approach is justified as being empirically useful” (p. 293).

The explanations of empirical probability by von Mises [42] repeatedly refer to the “limiting value of relative frequencies” (pp. 12, 21, 82, 105, 110, 124, 226, 229). The alternative to the argument of infinite sequences of observations from a hypothetical population is a “finite collective” (pp. 82–83). “There is no doubt about the fact that the sequences of observations to which the theory of probability is applied in practice are all finite” (p. 82). That is, samples are finite, and all states might be observed once. Infinite sequences are not substituted for the finite sequences. Probability and the resulting statistics are calculated based on the finite sequences. Hempel [41] similarly argued that “the results of a theory based on the notion of an infinite collective can be applied to finite sequences of observations” (von Mises [42, p. 85]).

In econometrics, the assumption is that there is a disturbance term in addition to the explanatory variables, so that given any set of explanatory variables, an infinite number of results (an infinite population or collective) is possible for any state, person, place, or thing. The mean, variance, and possibly the probability distribution could be estimated with enough data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Butler, J.S., Jones, P. Theoretical and empirical distributions of the p value. METRON 76, 1–30 (2018). https://doi.org/10.1007/s40300-017-0130-2

Download citation

Received: 07 July 2016
Accepted: 14 October 2017
Published: 11 December 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s40300-017-0130-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Theoretical and empirical distributions of the p value

Abstract

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

The potential of working hypotheses for deductive exploratory research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Asymmetrical confidence intervals

Appendix B: Existent and hypothetical populations, or finite and infinite collectives

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Theoretical and empirical distributions of the p value

Abstract

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

The potential of working hypotheses for deductive exploratory research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Asymmetrical confidence intervals

Appendix B: Existent and hypothetical populations, or finite and infinite collectives

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation