## What are the key assumptions of linear regression models?

from: **Lars Syll**

In Andrew Gelman’s and Jennifer Hill’s statistics book Data Analysis Using Regression and Multilevel/Hierarchical Models the authors list the assumptions of the linear regression model. The assumptions — *in decreasing order of importance *– are:

1. Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .

2. Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .

3. Independence of errors. . . .

4. Equal variance of errors. . . .

5. Normality of errors. . . .

Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . . .

Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points.

Yours truly can’t but concur (having touched upon this before here), especially on the “decreasing order of importance” of the assumptions. But then, of course, one really has to wonder why econometrics textbooks — almost invariably — turn this order of importance upside-down and don’t have more thorough discussions on the overriding importance of Gelman/Hill’s two first points …

Of course these linear regression assumptions are rarely, if ever, satisfied in applied econometrics and they are never mentioned, because model error analysis would inconveniently show statistical insignificance and destroy the worth of many papers, because valid conclusions cannot be drawn the datasets.

So without even mentioning the possibility of model errors, a vast majority of papers simply assert t-values > 3 and R-square greater than about 50 percent indicate statistical significance etc. As those assumptions are almost never met, R-square has to exceed 90 percent (i.e. model and actual values have to be very highly correlated) to indicate statistical significance or sufficiently high signal-to-noise ratio. This level of correlation is almost never found in economic datasets (used in published papers).

Bernanke’s expert status on “The Great Depression” rests mainly on his book (with that title) containing many linear regression analyses using sparse datasets for the depression years of several countries. His expertise has been put to the test in the past several years and he has been claiming victory, according to the data which the US government manipulates.

This is why economics is a pretend science. The econometric models used by governments do not justify the excessive confidence placed in their extreme policies.

By the way, I should add that the nature and distribution of model errors are often due to the failure of the first two assumptions (point of this post) to be satisfied. For example, nonlinear dependence of a variable can show up as changing asymmetry or lopsidedness of the spread of model errors as the actual or model values changes. Such nonlinear dependence would be evident if error analysis were undertaken, which appears rarely.

Also, linear regression econometric models do not provide any guide to what happens when variables are changed by large amounts. For examples, if an official interest rate move from 6 percent to 5.8 percent indicates increased production by 0.2 percent (say). It does not follow that at 3 percent official rate, production would increase by 3 percent. At 3 percent official rate, production may even contract and do anything else unpredictable, because linear models are valid only for limited perturbation of the variables.

This was exactly what happened in the GFC, where credit agency and bank internal risk models were all based on “reduced form” (linear regression) models, which predicted linear increases in credit defaults in proportion to increases in the amount mortgage loans. Of course the reality is nonlinear, because there were tipping points, beyond which credit defaults increase nonlinearly with the amount of credit. We suffer seriously from pseudo-science.

A question from somebody not at all involved in economics: is there any use of non-parametric methods in econometric models at all, or do they by default rely on regressions, even in the presence of sparse data and inappropriate linearity constraints?

If you make a lot of assumptions, you can draw a lot of conclusions, which are likely to be wrong, because the more assumptions you make the more likely you are to make false ones. The power of making strong assumptions (e.g. normal distribution) is restrained by increased likelihood of false conclusions. This is the path of most econometrics.

Non-parametric methods make few assumptions (e.g. about distributions) and therefore can draw few conclusions. But the conclusions drawn from non-parametric methods are much more likely to be correct. From a scientific point of view, non-parametric methods should be preferred in presenting economic data, in view data limitations.

Non-parametric methods are less popular with economists because they do not allow economists to make grandiose claims about their ability to forecast the future. An honest economic science should start off with non-parametric methods.

I must disagree with the list

of assumptions listed in Gelman and Hill’s book.

Linear regression models need not make all the

assumption listed above.

The most obvious needless assumption is

Assumption 3: “Independence of errors”.

This is needed ONLY if a regression coefficient is to be

given a causal interpretation. Otherwise, error independence

is achieved automatically in linear systems, since

the regression coefficient is fitted so as to satisfy

this independence.

(The regression coefficient is the slope of E[Y|X=x])

The less obvious needlessness lies in

Assumption 2: Additivity and Linearity

Linearity is needed only if we insist of equating

the regression coefficient with the slope of E[Y|X=x]

However, if we merely wish to find the best (in MSE)

linear predictor of Y given the observation X=x, then

regression analysis will give us what we ask even if the

E[Y|X=x] is not a linear function of x.

Conclusion, the assumptions of the linear regression

model vary with what we expect to do with the result.

No assumptions at all are needed for optimal predictions.

Linear regression can be applied to almost any data. But the statistics are meaningless unless certain assumptions are met.

Causality cannot be attributed to linear regression models, which show only correlations. Independence of errors is not guaranteed and important. If errors are highly correlated, we have mathematical instability of estimation due to multi-collinearity.

If the relationship is inherently nonlinear (say a quadratic) then a linear regression is rather meaningless. Try fitting a straight line to y=x^2, what does a “best linear predictor” prove?

Economists have been drawing invalid conclusions, as if no assumptions are needed. A typical example is using t-values greater than three as indication of statistical significance, without checking relevant assumptions.

Judea, what do you mean by “No assumptions at all are needed for optimal predictions?”

Suppose that we have lots of economic data for a small island banana republic. Linear regression allows us to extrapolate, optimally (in some mathematical sense). But how do we turn an extrapolation into a prediction? Or by ‘prediction’ do we implicitly assume certain standard assumptions, such as no Tsunami, no crop failure, no riots, no dictator, no invasion and … no financial crisis?

In 2006/7 were the banks making ‘optimal predictions’? Did politicians understand this? It seemed to me at the time that key decision-makers were getting confused by the language, and I am not clear that it has been clarified.