Statistical significance tests do not validate models
from Lars Syll
The word ‘significant’ has a special place in the world of statistics, thanks to a test that researchers use to avoid jumping to conclusions from too little data. Suppose a researcher has what looks like an exciting result: She gave 30 kids a new kind of lunch, and they all got better grades than a control group that didn’t get the lunch. Before concluding that the lunch helped, she must ask the question: If it actually had no effect, how likely would I be to get this result? If that probability, or p-value, is below a certain threshold — typically set at 5 percent — the result is deemed ‘statistically significant.’
Clearly, this statistical significance is not the same as real-world significance — all it offers is an indication of whether you’re seeing an effect where there is none. Even this narrow technical meaning, though, depends on where you set the threshold at which you are willing to discard the ‘null hypothesis’ — that is, in the above case, the possibility that there is no effect. I would argue that there’s no good reason to always set it at 5 percent. Rather, it should depend on what is being studied, and on the risks involved in acting — or failing to act — on the conclusions …
This example illustrates three lessons. First, researchers shouldn’t blindly follow convention in picking an appropriate p-value cutoff. Second, in order to choose the right p-value threshold, they need to know how the threshold affects the probability of a Type II error. Finally, they should consider, as best they can, the costs associated with the two kinds of errors.
Statistics is a powerful tool. But, like any powerful tool, it can’t be used the same way in all situations.
Good lessons indeed — underlining how important it is not to equate science with statistical calculation. All science entail human judgement, and using statistical models doesn’t relieve us of that necessity. Working with misspecified models, the scientific value of significance testing is actually zero – even though you’re making valid statistical inferences! Statistical models and concomitant significance tests are no substitutes for doing science.
In its standard form, a significance test is not the kind of ‘severe test’ that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypotheses. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since they can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypotheses by making it hard to disconfirm the null hypothesis.
And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as ‘probably confirmed.’ Standard scientific methodology tells us that when there is only say a 10 % probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more ‘reasonable’ to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same 10 % result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.
We should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-values mean next to nothing if the model is wrong. And most importantly — statistical significance tests DO NOT validate models!
In journal articles a typical regression equation will have an intercept and several explanatory variables. The regression output will usually include an F-test, with p – 1 degrees of freedom in the numerator and n – p in the denominator. The null hypothesis will not be stated. The missing null hypothesis is that all the coefficients vanish, except the intercept.
If F is significant, that is often thought to validate the model. Mistake. The F-test takes the model as given. Significance only means this: if the model is right and the coefficients are 0, it is very unlikely to get such a big F-statistic. Logically, there are three possibilities on the table:
i) An unlikely event occurred.
ii) Or the model is right and some of the coefficients differ from 0.
iii) Or the model is wrong.