Home > Uncategorized > Why data is not enough to answer scientific questions

Why data is not enough to answer scientific questions

from Lars Syll

The Book of Why_coverIronically, the need for a theory of causation began to surface at the same time that statistics came into being. In fact modern statistics hatched out of the causal questions that Galton and Pearson asked about heredity and out of their ingenious attempts to answer them from cross-generation data. Unfortunately, they failed in this endeavor and, rather than pause to ask “Why?”, they declared those questions off limits, and turned to develop a thriving, causality- free enterprise called statistics.

This was a critical moment in the history of science. The opportunity to equip causal questions with a language of their own came very close to being realized, but was squandered. In the following years, these questions were declared unscientific and went underground. Despite heroic efforts by the geneticist Sewall Wright (1889-1988), causal vocabulary was virtually prohibited for more than half a century. And when you prohibit speech, you prohibit thought, and you stifle principles, methods, and tools.

Readers do not have to be scientists to witness this prohibition. In Statistics 101, every student learns to chant: “Correlation is not causation.” With good reason! The rooster crow is highly correlated with the sunrise, yet it does not cause the sunrise.

Unfortunately, statistics took this common-sense observation and turned it into a fetish. It tells us that correlation is not causation, but it does not tell us what causation is. In vain will you search the index of a statistics textbook for an entry on “cause.” Students are never allowed to say that X is the cause of Y — only that X and Y are related or associated.

A popular idea in quantitative social sciences is to think of a cause (C) as something that increases the probability of its effect or outcome (O). That is:  

P(O|C) > P(O|-C)

However, as is also well-known, a correlation between two variables, say A and B, does not necessarily imply that that one is a cause of the other, or the other way around, since they may both be an effect of a common cause, C.

In statistics and econometrics, we usually solve this confounder problem by controlling for C, i. e. by holding C fixed. This means that we actually look at different populations – those in which C occurs in every case, and those in which C doesn’t occur at all. This means that knowing the value of A does not influence the probability of C [P(C|A) = P(C)]. So if there then still exist a correlation between A and B in either of these populations, there has to be some other cause operating. But if all other possible causes have been controlled for too, and there is still a correlation between A and B, we may safely conclude that A is a cause of B, since by controlling for all other possible causes, the correlation between the putative cause A and all the other possible causes (D, E,. F …) is broken.

This is, of course, a very demanding prerequisite, since we may never actually be sure to have identified all putative causes. Even in scientific experiments may the number of uncontrolled causes be innumerable. Since nothing less will do, we do all understand how hard it is to actually get from correlation to causality. This also means that only relying on statistics or econometrics is not enough to deduce causes from correlations.

Some people think that randomization may solve the empirical problem. By randomizing we are getting different populations that are homogeneous in regards to all variables except the one we think is a genuine cause. In that way, we are supposed being able not having to actually know what all these other factors are.

If you succeed in performing an ideal randomization with different treatment groups and control groups that is attainable. But — it presupposes that you really have been able to establish — and not just assumed — that the probability of all other causes but the putative (A) have the same probability distribution in the treatment and control groups, and that the probability of assignment to treatment or control groups are independent of all other possible causal variables.

Unfortunately, real experiments and real randomizations seldom or never achieve this. So, yes, we may do without knowing all causes, but it takes ideal experiments and idealrandomizations to do that, not real ones.

That means that in practice we do have to have sufficient background knowledge to deduce causal knowledge. Without old knowledge, we can’t get new knowledge, and — no causes in, no causes out.

Econometrics is basically a deductive method. Given the assumptions (such as manipulability, transitivity, Reichenbach probability principles, separability, additivity, linearity, etc., etc.) it delivers deductive inferences. The problem, of course, is that we will never completely know when the assumptions are right. Real target systems are seldom epistemically isomorphic to axiomatic-deductive models/systems, and even if they were, we still have to argue for the external validity of the e conclusions reached from within these epistemically convenient models/systems. Causal evidence generated by statistical/econometric procedures may be valid in closed models, but what we usually are interested in, is causal evidence in the real target system we happen to live in.

Advocates of econometrics want to have deductively automated answers to fundamental causal questions. But to apply ‘thin’ methods we have to have ‘thick’ background knowledge of what’s going on in the real world, and not in idealized models. Conclusions can only be as certain as their premises — and that also applies to the quest for causality in econometrics.

The central problem with the present ‘machine learning’ and ‘big data’ hype is that so many — falsely — think that they can get away with analysing real-world phenomena without any (commitment to) theory. But — data never speaks for itself. Without a prior statistical set-up, there actually are no data at all to process. And — using a machine learning algorithm will only produce what you are looking for.

Clever data-mining tricks are never enough to answer important scientific questions. Theory matters.

  1. August 10, 2018 at 8:21 pm

    Only when many lives are at stake combined with a large and seriously angry cohort at the palace gates do the institutional gate keepers consider these issues. For example, FDA approval for combination therapies for AIDS victims was essentially zero until the above criteria were met. To satisfy the FDA requirements before then was the biochemical equivalent of safe cracking. For therapy A plus B plus C, you had to show data that A does no harm and B does no harm and C does no harm and ABC in combination does no harm and ABC in combination does some good. The number of combinations approved was of course zero given the cost of data collection and the probability of success. So the result was research into combination therapies was devolved to the grass roots. And it was this grass roots effort that led to successful cocktail therapies that worked.

  2. Helen Sakho
    August 11, 2018 at 3:04 am

    I believe we did several times here talk about the origins of qualitative research being grounded in the sociology of the dying many decades ago, and why this was so. The problem now is far beyond consumer fetishism that turned into other fetishisms through marketing ploys a long time ago too. Cocktailed grassless (rootless and/or ruthless) platters of delicious foods are still on offer next to the starving. It is local and global, as is “glocal” food and fusion therapies exported within and without geographies and histories.
    One must repeat that theories that have no predictive powers whatsoever are not worth the paper they are written on.

    These are extraordinary times. So, let us add to the curriculum and mock exam questions that I suggested only a few days ago this one please:
    “ Create a new formula that explains how Economics moved from previous kinds of fetishisms to death fetishism?” Take any example of any part of the world. Your examples may be drawn from any part of the world. To reassure you further that examiners will not be biased in any way, you may draw examples from anywhere around the globe, including the US, Canada, Australia, China, Iran, Iraq, Turkey, Israel (any part), Pakistan, India, Bangladesh, Korea, Scotland, Ireland, England, Wales, Yamen, Africa (any part), Europe (any part), Afghanistan, Uzbekistan to name but a few. If you breakdown in the middle of the exam, rest assured that a competent nurse is at hand to assist your survival”.

  3. August 11, 2018 at 5:58 am

    Bravo, Lars Syll! Yes, theory matters. This is one of propositions that I wanted you to pronounce. As you have explained many econometricians denies this or ignore this. Or perhaps they are simply ignorant. But many readers of this blog (including loyal readers of your blog post) do not want to admit that theory matters. They not only distrust theory but also contempt theory. This was one of feeling that you have been enhancing among your readers.

    Of course, those who distrust theories have a good reason to do so. The theory which seems so established in economics (i.e. neoclassical economics) is so irrelevant in all aspects, It is a healthy side of theory distrust. But if this becomes a contempt of all theories in economics, it denies all scientific and deep thinking. In the end, it nurtures those idle economists and amateurs who do not sincerely study economics. This prevents necessary understanding of any kind of heterodox economics. They are only satisfied because they can criticize what they cannot understand. This is a very bad intellectual atmosphere that I observe among many readers and comment posters in Real-World Economics Review Blog. (Of course, there are many others who are not drowned in this atmosphere.) Lars Syll is responsible (at least in part) of this state of affairs as he is the leading blog writer to this site. I hope he will change his orientation to a more reasonable stance.

    • Risk Analyst
      August 11, 2018 at 5:08 pm

      I agree with what I think you are saying. Personally I think of the massive range of ideas in heterodox as kind of like a salad bar where one can pick and choose what you want, and you can choose different ideas on different days for different issues. I do not care about consistency, which is the lever that Lucas et. al. used to help create this mess. I absolutely do not wish to discourage development of more approaches, but at the same time some seem to have the impression they need to wait until the perfect theory is created and then present it to the profession like some kind of a birthday cake. There is already more than enough to be far superior to the current economics but my frustration is that there is no traction to displace it. Sometimes I think that may be because those with heterodox ideas are too interested in additionally advocating for their own political or other ideas instead of keeping the eye on the ball.

  4. August 12, 2018 at 1:31 am

    You are right to say that heterodox economics is a salad bar. There are full of good taste ideas but they are not united in a coherent system.

    I also agree with you that many heterodox economists are rather too interested in political ideas and policies than being a sincere observer of what is happening in the economy and how it works.

    Although we are still in a state we “need to wait until the perfect theory is created and then present it to the profession like some kind of a birthday cake,” in my opinion, ingredients are already collected and the birthday cake is now being prepared.

    The birthday case is a modern version (i.e of the 21 century) of classical political economy. It stands basically on Ricardo’s cost of production theory of value with two major innovations in the 20th century: Sraffa’s reformulation of classical theory of value and Oxford Economists’ Research Group’s findings (the discovery of markup pricing). This theory of value was restricted to the domestic theory (i.e. theory of prices in a close country). But it was extended to include a theory of international values in the 21 century. Now the classical tradition is as powerful as the neoclassical general equilibrium
    theory and far more relevant in view of human capabilities we assume in rationality, information gathering and economic actions.

    Hitherto heterodox economics was lacking theoretical cores on which to base their analyses. Now the situation has changed.

    See two of my papers:
    (1) The revival of classical theory of values
    (2) The new theory of international values: an overview

  5. August 19, 2018 at 2:19 pm

    In all work activities the goal is to deal with one or more human areas of concern. This is not a search for truth. Or for elegant or mathematically correct theories. Or for theories that predict our future. For example, dealing with the threats of pandemics. We can estimate the likelihood of several forms of pandemics based on known conditions from historical or current pandemics. We also can estimate the similarity of future pandemics with historical pandemics. We also have knowledge of which actions worked to stop historical pandemics, as well as the extent of the success of each such action. This is so far an inductive process. We draw on lots of experiences, direct and indirect from history relating to pandemics and their remedies. Using this experience those involved in a current a pandemic prepares several narratives of the causes and possible remedies for the current pandemic. In discussion those involved decide on next steps. One of which is full application of one or more of the narratives. This may involve experimentation but is primarily based on tapping the hands-on experience of those involved in the current pandemic. A narrative or in some instances several of the narratives is placed into practice. As they are applied the success of each narrative is assessed (quantitatively or qualitatively). Narratives judged working continue. Narratives judged failing are ended. This outline varies a bit by areas of concern. It is representative, however, in demonstrating the general form of problem solving in situations that are uncertain, involve many different types of actors, and have the potential for substantial and chaotic consequences.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.