My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the swiss
dataset has six variables.
library(swiss)names(swiss)# "Fertility" "Agriculture" "Examination" "Education" # "Catholic" "Infant.Mortality"
Now I introduce a new variable ec
. It is the linear combination of Examination
and Catholic
.
ec <- swiss$Examination + swiss$Catholic
When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns NA
as their coefficients. The command below illustrates the point perfectly.
lm(Fertility ~ . + ec, swiss)Coefficients: (Intercept) Agriculture Examination Education 66.9152 -0.1721 -0.2580 -0.8709 Catholic Infant.Mortality ec 0.1041 1.0770 NA
However, when we regress first on ec
and then all of the regressors as shown below,
lm(Fertility ~ ec + ., swiss) Coefficients: (Intercept) ec Agriculture Examination 66.9152 0.1041 -0.1721 -0.3621 Education Catholic Infant.Mortality -0.8709 NA 1.0770
I would expect the coefficients of both Catholic
and Examination
to be NA
. The variable ec
is linear combination of both of them but in the end the coefficient of Examination
is not NA
whereas that of the Catholic
is NA
.
Could anyone explain the reason of that?