## Two must-read statistics books

from **Lars Syll**

Mathematical statistician** David Freedman**‘s *Statistical Models and Causal Inference *(Cambridge University Press, 2010)* * and *Statistical Models: Theory and Practice* (Cambridge University Press, 2009) are marvellous books. They ought to be mandatory reading for every serious social scientist — including economists and econometricians — who doesn’t want to succumb to *ad hoc*assumptions and unsupported statistical conclusions!

How do we calibrate the uncertainty introduced by data collection? Nowadays, this question has become quite salient, and it is routinely answered using wellknown methods of statistical inference, with standard errors, t -tests, and P-values … These conventional answers, however, turn out to depend critically on certain rather restrictive assumptions, for instance, random sampling …

Thus, investigators who use conventional statistical technique turn out to be making, explicitly or implicitly, quite restrictive behavioral assumptions about their data collection process … More typically, perhaps, the data in hand are simply the data most readily available …

The moment that conventional statistical inferences are made from convenience samples, substantive assumptions are made about how the social world operates … When applied to convenience samples, the random sampling assumption is not a mere technicality or a minor revision on the periphery; the assumption becomes an integral part of the theory …

In particular, regression and its elaborations … are now standard tools of the trade. Although rarely discussed, statistical assumptions have major impacts on analytic results obtained by such methods.

Consider the usual textbook exposition of least squares regression. We have n observational units, indexed by i = 1, . . . , n. There is a response variable yi , conceptualized as μi + i , where μi is the theoretical mean of yi while the disturbances or errors i represent the impact of random variation (sometimes of omitted variables). The errors are assumed to be drawn independently from a common (gaussian) distribution with mean 0 and finite variance. Generally, the error distribution is not empirically identifiable outside the model; so it cannot be studied directly—even in principle—without the model. The error distribution is an imaginary population and the errors i are treated as if they were a random sample from this imaginary population—a research strategy whose frailty was discussed earlier.

Usually, explanatory variables are introduced and μi is hypothesized to be a linear combination of such variables. The assumptions about the μi and i are seldom justified or even made explicit—although minor correlations in the i can create major bias in estimated standard errors for coefficients …

Why do μi and i behave as assumed? To answer this question, investigators would have to consider, much more closely than is commonly done, the connection between social processes and statistical assumptions …

We have tried to demonstrate that statistical inference with convenience samples is a risky business. While there are better and worse ways to proceed with the data at hand, real progress depends on deeper understanding of the data-generation mechanism. In practice, statistical issues and substantive issues overlap. No amount of statistical maneuvering will get very far without some understanding of how the data were produced.

More generally, we are highly suspicious of efforts to develop empirical generalizations from any single dataset. Rather than ask what would happen in principle if the study were repeated, it makes sense to actually repeat the study. Indeed, it is probably impossible to predict the changes attendant on replication without doing replications. Similarly, it may be impossible to predict changes resulting from interventions without actually intervening.

How would you compare these to Jude Pearl’s causal inference publications?

Judea is over all more optimistic about how far statistics can help/take us in our search för causality. David and yours truly was/is more pessimistic (to some degree depending on doubts about how far the intevention/manipulation pardigm takes us in open real-world systems).

I found the passage interesting. Thank you, Lars, for picking it out and sharing it.

If the search for causality is what is important then why is Buckingham’s 𝛱 theorem and dimensional analysis ignored. They provide a method for establishing the factors that are significant even when the underlying mechanisms are unknown.

A good example of why groups of dimension one are important in the sciences and engineering is to be found at

http://www-mdp.eng.cam.ac.uk/web/library/enginfo/aerothermal_dvd_only/aero/fprops/dimension/node6.html

It explains with the example of finding the drag for a fluid flowing past a cylinder. It shows how the simplistic method, of measuring the drag varying only one of the possible parameters whilst keeping the others constant and repeating this for each parameter as the variable, may take a lifetime of experiments. How often does one read “ceteris paribus” in economic analysis. The solution is to vary the value of a group of dimension one — a manageable proposition.

I would suggest that conventional economic analysis is attempting to follow the simplistic futile method. Proper scientific methods need to be implemented. Dimensional analysis and first principles analysis which subsumes dimensional analysis are the appropriate tools. Why are they not in the tool bags of academic analysts?

Social scientific theories always answer questions raised within certain historical contexts, which involve the common presuppositions of an era. Thorough insight into a social science problem therefore requires a historical perspective. Consequently, to better understand the contemporary approaches to the complex issue of causation, and the problems they raise, it is necessary to have a clear insight into the historical evolution of the concept of cause and its use. Even a cursory examination of this history shows that the development of the history of the concept of cause reveals a remarkable discrepancy between the constancy in the use of terminology and the gradual shift in the meaning of the terms used. Most generally, in social science today causation is a relationship that holds between events, objects, variables, or states of affairs. In the lives of social scientists, and similarly in the lives of laypersons, causality is the centerpiece of the universe and so the main subject of human knowledge. It is needed for knowing the beginnings and endings of things, and to make sense of the world. Any question “What to do?” implies causation.

Current notions of causation in social science are based on (caused by?) the writings of David Hume. According to Hume, it is impossible to demonstrate empirically that a cause produces an effect. Just because the sun has risen every day since the beginning of the Earth does not mean that it will rise again tomorrow. However, it is intolerable to go about one’s life without assuming such connections, and the best that we can do is to maintain an open mind and never presume that we know any laws of causality for certain. Also, Hume notes that causality is an interpretation (by humans in society) of observables (causal statements are always inferential). For example, does the Sun rise cause the Rooster to crow, or is it the reverse? In answering these questions social scientists look to three criteria. First, time. It is usually assumed that the cause chronologically precedes the effect. In a strict reading, if A causes B, then A must always be followed by B. Second, Association/correlation. Changes in X cause changes in Y. For example, football weekends cause heavier traffic, more food sales, etc. Social scientists must be careful in interpreting correlation coefficients. Just because two variables are highly correlated does not mean that one causes the other. There are many good examples of correlation which are nonsensical when interpreted in terms of causation. For example, increases in crime cause increases in ice cream sales. Really? The number of cavities in elementary school children and vocabulary size have a strong positive correlation. Causation? Finally, no other factor causes both (spuriousness). A and B are “caused” by a third factor of which we have no knowledge? Finally, current social scientific concepts of causality are based on some specific assumptions. These are:

1) Reality is real, it exists “out there” and waits to be discovered. Kant argued that reality exists independently of people’s perception about it.

2) Reality is ordered (not chaotic).

3) Behavior of humans is patterned. Without this assumption the logic and predictions for causation would be impossible.

4) Reality is stable, but knowledge about it is additive.

At least two of these assumptions have no empirical basis.

1) Reality can be changed.

2) People can change their history (reality).

3) Humans are qualitatively different from the objects of study in the natural sciences (rocks, stars, chemical compounds, etc.).

4) Humans think and learn, have an awareness of themselves and their past. They create themselves through culture and live in collective relationships we call societies.

5) These unique human characteristics create debate about causation in the social sciences.

As for statistics, I’m not certain it has a role in investigating causes, apart from describing the situations we believe do or do not involve causation. Statistics is the wrong tool for anything else. It’s seldom used by laypersons for causal inference. And, as noted above its use by social scientists to “infer” causation involves some hazy assumptions.