Sizeless science and the cult of statistical significance
from Lars Syll
Last year yours truly had an interesting luncheon discussion with Deirdre McCloskey on her controversy with Kevin Hoover on significance testing. It got me thinking about where the fetish status of significance testing comes from and why we are still teaching and practising it without serious qualifications despite its obvious inadequacies.
A non-trivial part of teaching statistics is made up of learning students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the probabilities generated by these statistical tests – p-values – really are, still most students misinterpret them.
Giving a statistics course for the Swedish National Research School in History, I asked the students at the exam to explain how one should correctly interpret p-values. Although the correct definition is p(data|null hypothesis), a majority of the students either misinterpreted the p-value as being the likelihood of a sampling error (which of course is wrong, since the very computation of the p value is based on the assumption that sampling errors are what causes the sample statistics not coinciding with the null hypothesis) or that the p-value is the probability of the null hypothesis being true, given the data (which of course also is wrong, since that is p(null hypothesis|data) rather than the correct p(data|null hypothesis)).
This is not to blame on students’ ignorance, but rather on significance testing not being particularly transparent (conditional probability inference is difficult even to those of us who teach and practice it). A lot of researchers fall pray to the same mistakes. So – given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape – why continue to press students and researchers to do null hypothesis significance testing, testing that relies on weird backward logic that students and researchers usually don’t understand?
In a recent review of Deirdre’s and Stephen Ziliak‘s The Cult of Statistical Significance (University of Michigan Press 2008), mathematical statistician Olle Häggström succinctly summarizes what the debate is all about:
Stephen Ziliak and Deirdre McCloskey, claim in their recent book The Cult of Statistical Significance [ZM] that the reliance on statistical methods has gone too far and turned into a ritual and an obstacle to scientific progress.
A typical situation is the following. A scientist formulates a null hypothesis. By means of a significance test, she tries to falsify it. The analysis leads to a p-value, which indicates how likely it would have been, if the null hypothesis were true, to obtain data at least as extreme as those she actually got. If the p-value is below a certain prespecified threshold (typically 0.01 or 0.05), the result is deemed statistically significant, which, although far from constituting a definite
disproof of the null hypothesis, counts as evidence against it.
Imagine now that a new drug for reducing blood pressure is being tested and that the fact of the matter is that the drug does have a positive effect (as compared with a placebo) but that the effect is so small that it is of no practical relevance to the patient’s health or well-being. If the study involves sufficiently many patients, the effect will nevertheless with high probability be detected, and the study will yield statistical significance. The lesson to learn from this is that in a medical study, statistical significance is not enough—the detected effect also needs to be large enough to be medically significant. Likewise, empirical studies in economics (or psychology, geology, etc.) need to consider not only statistical significance but also economic (psychological, geological, etc.) significance.
A major point in The Cult of Statistical Significance is the observation that many researchers are so obsessed with statistical significance that they neglect to ask themselves whether the detected discrepancies are large enough to be of any subject-matter significance. Ziliak and McCloskey call this neglect sizeless science …
The Cult of Statistical Significance is written in an entertaining and polemical style. Sometimes the authors push their position a bit far, such as when they ask themselves: “If nullhypothesis significance testing is as idiotic as we and its other critics have so long believed, how on earth has it survived?” (p. 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance.
Statistical significance doesn’t say that something is important or true. Although Häggström has a point in his last remark, I still think – since there already are far better and more relevant testing that can be done (see e. g. my posts here and here) – it is high time to consider what should be the proper function of what has now really become a statistical fetish.