omg! a post! but really.... a rant about independence.
Lately I've been frustrated by attempting to rewrite the regression coefficients of a linear model in terms of the correlation coefficients between my explanatory variables. I was hoping, rather, to find some resource somewhere with the formulae I require. Everything I've found, au contraire, in textbooks, lecture notes, the intarwebs, etc., has fallen terribly short. For now I'll concentrate on one little problem that confused me greatly for a few days of otherwise-blissful vacation.
In the statement of multiple linear regression, there are three different uses of the word independence:
- Linear regression literature will speak of the independent variables, usually denoted X_ij for all j, which are also called the explanatory variables, the regressors, the predictors. Regression likes to call them "independent" variables simply to contrast them with the "dependent" variable, y_i, also called the response variable, the regressand, etc.
- A fundamental assumption of most linear regression analysis is that each sampled data point is statistically independent of the others. That is, the random error terms (epsilon_i) are statistically independent of each other. This says nothing of the explanatory variables (but does help to describe the response variable y_i).
- In multiple linear regression (where we have multiple explanatory variables, that is, X_ij for a range of j), the "independent" explanatory variables must be linearly independent of each other. If this is not true then there is the problem of "multicollinearity", and the regression coefficients are not uniquely specified.
In particular, one can quite well perform a linear regression when some explanatory variables are statistically correlated. But not if they are perfectly correlated (a correlation coefficient of +/- 1), as that would mean they are linearly dependent, and we violate rule 3.
What boggles my mind, is that after gouging through dozens of references, I have not found any resource that mentions these multiple inconsistent uses of the same word in the same setting! At least Wikipedia actually states that it means linear independence for rule 3, on its
regression analysis page. But some texts may as well have written: "The independent variables may be non-independent so long as they are independent", although that might have answered my question if I knew how to interpret it.
And in case you were worried, there are of course ways to continue an analysis if the assumptions of 2 and/or 3 are not met. Although, these methods may be undesirable. Mwaha.