## ‘Sizeless science’ and the cult of significance testing

from **Lars Syll**

A couple of years ago yours truly had an interesting luncheon discussion with Deirdre McCloskey on her controversy with Kevin Hoover on significance testing. It got me thinking about where the fetish status of significance testing comes from and why we are still teaching and practising it without serious qualifications despite its obvious inadequacies.

A non-trivial part of teaching statistics is made up of learning students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the proba-bilities generated by these statistical tests – *p-values* – really are, still most students misinterpret them.

Giving a statistics course for the *Swedish National Research School in History*, I asked the students at the exam to explain how one should correctly interpret *p-values*. Although the correct definition is p(data|null hypothesis), a majority of the students either misinterpreted the meaning of the *p-value* as being the *likelihood of a sampling error* (which of course is wrong, since the very computation of the p value is based on the assumption that sampling errors are what causes the sample statistics not coinciding with the null hypothesis), or that the meaning of the *p-value* is the probability of the null hypothesis being true, given the data (which is a case of the fallacy of transposing the conditional, and of course also being wrong, since that is p(null hypothesis|data) rather than the correct p(data|null hypothesis)).

This is not to blame on students’ ignorance, but rather on significance testing not being particularly transparent (conditional probability inference is difficult even to those of us who teach and practice it). A lot of researchers fall prey to the same mistakes. So — given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape — why continue to press students and researchers to do null hypothesis significance testing, testing that relies on weird backward logic that students and researchers usually don’t understand?

Reviewing Deirdre’s and Stephen Ziliak’s *The Cult of Statistical Significance *(University of Michigan Press 2008), mathematical statistician Olle Häggström succinctly summarizes what the debate is all about:

Stephen Ziliak and Deirdre McCloskey, claim in their recent book

The Cult of Statistical Significance[ZM] that the reliance on statistical methods has gone too far and turned into a ritual and an obstacle to scientific progress.A typical situation is the following. A scientist formulates a

null hypothesis. By means of asignificance test, she tries to falsify it. The analysis leads to ap-value, which indicates how likely it would have been, if the null hypothesis were true, to obtain data at least as extreme as those she actually got. If thep-valueis below a certain prespecified threshold (typically 0.01 or 0.05), the result is deemedstatistically significant, which, although far from constituting a definite disproof of the null hypothesis, counts as evidence against it.Imagine now that a new drug for reducing blood pressure is being tested and that the fact of the matter is that the drug does have a positive effect (as compared with a placebo) but that the effect is so small that it is of no practical relevance to the patient’s health or well-being. If the study involves sufficiently many patients, the effect will nevertheless with high probability be detected, and the study will yield statistical significance. The lesson to learn from this is that in a medical study, statistical significance is not enough—the detected effect also needs to be large enough to be

medically significant. Likewise, empirical studies in economics (or psychology, geology, etc.) need to consider not only statistical significance but also economic (psychological, geological, etc.) significance.A major point in

The Cult of Statistical Significanceis the observation that many researchers are so obsessed with statistical significance that they neglect to ask themselves whether the detected discrepancies are large enough to be of any subject-matter significance. Ziliak and McCloskey call this neglectsizeless science …

The Cult of Statistical Significanceis written in an entertaining and polemical style. Sometimes the authors push their position a bit far, such as when they ask themselves: “If nullhypothesis significance testing is as idiotic as we and its other critics have so long believed, how on earth has it survived?” (p. 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically needbothstatisticalandsubject-matter significance.

Statistical significance doesn’t say that something is important or true. Although Häggström has a point in his last remark, I still think – since there already are far better and more relevant testing that can be done (see e. g. my posts here and here) – it is high time to consider what should be the proper function of what has now really become a statistical fetish.

Yeah time to do away with this system

Forty years ago I was involved in a research project about the difference between parental and student community attitudes. It was being conducted by a student under the paid auspices of a professor at a nearby university and the managers of the agency where I was a Master of Social Work student. The student researcher crunched his data and concluded that there was no difference. He was unsure of that conclusion (my clinical experience said there was a difference) but when he spoke to his university-based supervisor was assured he was correct. The results were going to impact on funding services for youth in the community and this looked like it was going to put the funding at risk. I was asked to review the research with the student and discovered that he had ACCEPTED the null hypothesis (There is no difference in the attitudes of youth and adults) rather than rejecting it because he had significant differences in his Chi Square analysis of the data between adults and youth. I still have a copy of the study and delight in his dedication to me wherein he gave me the middle name “Stats.” The press conference explaining the results later that week allowed the community to justify funding a special progamme for youth.

Many publications in psychology require authors to report effect sizes as well as p values. Don’t know what the requirements are in other disciplines and this may just be a requirement in a sub-discipline of psychology (ie cognition and cognitive neuroscience).

The problem, as I see it, is not that there is anything in particular wrong with the p value as a tool but that there are far too many grotesquely incompetent people infesting our laboratories: they should be weeded out forthwith and sent off to be greengrocers or what ever God intended them to be.

This is no joke. If you can not get your head around the prosecutor’s fallacy of P(A|B) = P(B|A), or inverse problems as a class, then you have no business being responsible for any important piece of research. Washing bottles, maybe, or counting beans or sexing rats, but you are not to be trusted to deliver any sort of professional level analysis: as the literature shows all too clearly, such people fall into the simplest logical traps and are subsequently befuddled when trying to diagnose where it all went wrong.

So teach your students something along the lines of “the p value is the first step in the long and arduous process of establishing that some claim is both justified and important. Its purpose is to check, right off the bat, whether you are trying to extract more insight from your numbers than the data actually warrant.” From that starting point you can move on to study the other important methodological issues like multiple comparisons and confounding variables that are also abused on a regular basis by the uncomprehending.

“since that is p(null hypothesis|data) rather than the correct p(data|null hypothesis)).”

No, the p-value is not the probability of the data given the null hypothesis (except in some special cases). It is the probability of the data plus some unobserved results that would also considered to be evidence against the null hypothesis. Which non-observations to include in calculating the p-value is not always clear. For instance, suppose that the null hypothesis is that a coin is fair, with a 50-50 chance of a toss coming up heads or tails, and suppose that we toss it 8 times and it comes up heads 7 times and tails once. The sequence of tosses will occur with probability 1/256, but that is not the p-value. We also consider the other ways in which heads could come up 7 times and tails once. We consider the exact sequence to be irrelevant. That probability is 8/256. But that is not all. We also consider the case in which the coin came up heads 8 times. That result is worse for the null hypothesis, but is definitely not part of the data. It is unobserved. Anyway, it gives us a p-value of 9/256. But wait! there’s more! A result of 7 tails and 1 head would also be as bad for the null hypothesis as a result of 7 heads and 1 tail. Why not include those unobserved results in the p-value calculation? Well, we do just that in a 2-tailed test, which yields a p-value of 18/256. It may seem strange, but why not a 2-tailed test? After all, one coin flip gives no information at all about whether the null hypothesis is true or not. Shouldn’t the p-value after one flip be 1.00? It is with a 2-tailed test.