Home > Uncategorized > The deadly sin of statistical reification

## The deadly sin of statistical reification

from Lars Syll People sometimes speak as if random variables “behave” in a certain way, as if they have a life of their own. Thus “X is normally distributed”, “W follows a gamma”, “The underlying distribution behind y is binomial”, and so on. To behave is to act, to be caused, to react. Somehow, it is thought, these distributions are causes. This is the Deadly Sin of Reification, perhaps caused by the beauty of the mathematics where, due to some mental abstraction, the equations undergo biogenesis. The behavior of these “random” creatures is expressed in language about “distributions.” We hear, “Many things are normally (gamma, Weibull, etc. etc.) distributed”, “Height is normally distributed”, “Y is binomial”, “Independent, identically distributed random variables”.

There is no such thing as a “true” distribution in any ontological sense. Examples abound. The temptation here is magical thinking. Strictly and without qualification, to say a thing is “distributed as” is to assume murky causes are at work, pushing variables this way and that knowing they are “part of” some mathematician’s probability distribution.

To say a thing “has” a distribution is false. The only thing we are privileged to say is things like this: “Give this-and-such set of premises, the probability X takes this value equals that”, where “that” is calculated via a probability implied by the premises … Probability is a matter of ascribable or quantifiable uncertainty, a logical relation between accepted premises and some specified proposition, and nothing more.

William Briggs

In econometrics one often gets the feeling that many of its practitioners think of it as a kind of automatic inferential machine: input data and out comes casual knowledge. Like pulling a rabbit from a hat. Great — but first you have to put the rabbit in the hat. And this is where assumptions about distributions and probabilities come into the picture.

The assumption of imaginary “super populations” is one of the many dubious assumptions used in modern econometrics.

As social scientists — and economists — we have to confront the all-important question of how to handle uncertainty and randomness. Should we define randomness with probability? If we do, we have to accept that to speak of randomness we also have to presuppose the existence of nomological probability machines, since probabilities cannot be spoken of – and actually, to be strict, do not at all exist – without specifying such system-contexts. Accepting a domain of probability theory and sample space of infinite populations also implies that judgments are made on the basis of observations that are actually never made!

Infinitely repeated trials or samplings never take place in the real world. So that cannot be a sound inductive basis for a science with aspirations of explaining real-world socio-economic processes, structures or events. It’s not tenable.

And as if this wasn’t enough, one could — as we’ve seen — also seriously wonder what kind of “populations” these statistical and econometric models ultimately are based on. Why should we as social scientists — and not as pure mathematicians working with formal-axiomatic systems without the urge to confront our models with real target systems — unquestioningly accept models based on concepts like the “infinite super populations” used in e.g. the potential outcome framework that has become so popular lately in social sciences?

Of course, one could treat observational or experimental data as random samples from real populations. I have no problem with that. But probabilistic econometrics does not content itself with that kind of populations. Instead it creates imaginary populations of “parallel universes” and assumes that our data are random samples from that kind of  “infinite super populations.”

But this is actually nothing else but hand-waving! And it is inadequate for real science. As David Freedman writes:

With this approach, the investigator does not explicitly define a population that could in principle be studied, with unlimited resources of time and money. The investigator merely assumes that such a population exists in some ill-defined sense. And there is a further assumption, that the data set being analyzed can be treated as if it were based on a random sample from the assumed population. These are convenient fictions … Nevertheless, reliance on imaginary populations is widespread. Indeed regression models are commonly used to analyze convenience samples … The rhetoric of imaginary populations is seductive because it seems to free the investigator from the necessity of understanding how data were generated.

In social sciences — including economics — it’s always wise to ponder C. S. Peirce’s remark that universes are not as common as peanuts …