(edited to include two instead of one categorical variable)
I have a hypothetical dataset with two categorical variables, 4 mutually exclusive treatments and 5 mutually exclusive groups, and a continuous variable, as well as a response variable. Each treatment is coded as a dummy variable and is either 0 or 1.
I am trying to fit a linear regression model with an intercept fixed at zero (y ~ 0 + treat1 + treat2 + treat3 + treat4 + group1 + group2 + group3 + group4 + group5 + contvar) using the lm() function in R. I want to get coefficient estimates for all four treatments, all five groups, as well as the continuous variable.
Here's a reproducible example of my issue:
set.seed(123) # added# Sample sizen <- 100# Generate predictors, treatments 1 - 4 that are mutually exclusive, and a continuous variabledata <- data.frame( treat1 = c(rep(1, 25), rep(0, 75)), treat2 = c(rep(0, 25), rep(1, 25), rep(0, 50)), treat3 = c(rep(0, 50), rep(1, 25), rep(0, 25)), treat4 = c(rep(0, 75), rep(1, 25)), # edit group1 = rep(c(rep(1, 5), rep(0, 20)), 4), group2 = rep(c(rep(0, 5), rep(1, 5), rep(0, 15)), 4), group3 = rep(c(rep(0, 10), rep(1, 5), rep(0, 10)), 4), group4 = rep(c(rep(0, 15), rep(1, 5), rep(0, 5)), 4), group5 = rep(c(rep(0, 20), rep(1, 5)), 4), contvar = sample(0:100, n)/100)# Define means for each treatmentmean1 <- rnorm(100, -0.5, 0.1) ; mean2 <- rnorm(100, -0.2, 0.1) ; mean3 <- rnorm(100, 0.2, 0.1) ; mean4 <- rnorm(100, 0.5, 0.1)# Define means for each groupmeangr1 <- rnorm(100, -1, 0.2) ; meangr2 <- rnorm(100, -0.1, 0.1) ; meangr3 <- rnorm(100, 0, 0.1) ; meangr4 <- rnorm(100, 0.1, 0.1) ; meangr5 <- rnorm(100, 1, 0.2)# Generate response variable y based on the treatment means, group means and the value of the continuous variabledata$y <- mean1 * data$treat1 + mean2 * data$treat2 + mean3 * data$treat3 + mean4 * data$treat4data$y <- data$y * data$contvardata$y <- data$y + meangr1 * data$group1 + meangr2 * data$group2 + meangr3 * data$group3 + meangr4 * data$group4 + meangr5 * data$group5# Fit a no-intercept modelmodel0 <- lm(y ~ 0 + treat1 + treat2 + treat3 + treat4 + group1 + group2 + group3 + group4 + group5 + contvar, data = data)# Summarize the no-intercept modelsummary(model0)
The generated outcome reads:
Call:lm(formula = y ~ 0 + treat1 + treat2 + treat3 + treat4 + group1 + group2 + group3 + group4 + group5 + contvar, data = data)Residuals: Min 1Q Median 3Q Max -0.54069 -0.11705 -0.00558 0.11709 0.56169 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) treat1 0.67980 0.07089 9.590 1.84e-15 ***treat2 0.96646 0.06949 13.908 < 2e-16 ***treat3 1.14318 0.07113 16.071 < 2e-16 ***treat4 1.22735 0.06751 18.180 < 2e-16 ***group1 -2.01289 0.06154 -32.710 < 2e-16 ***group2 -1.11811 0.06144 -18.200 < 2e-16 ***group3 -1.07177 0.06311 -16.983 < 2e-16 ***group4 -0.89593 0.06313 -14.193 < 2e-16 ***group5 NA NA NA NA contvar -0.01132 0.06987 -0.162 0.872 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1Residual standard error: 0.1938 on 91 degrees of freedomMultiple R-squared: 0.9301, Adjusted R-squared: 0.9232 F-statistic: 134.6 on 9 and 91 DF, p-value: < 2.2e-16
There is a coefficient for all treatments but not for all groups. Why is this the case? I understand that getting coefficient estimates for all treatments and all groups would not work with a random intercept model, where one category from each would be selected as reference, but with a zero-intercept model I expected it would work.
Just to illustrate, this is what happens if I fit the model with a random intercept:
Call:lm(formula = y ~ treat1 + treat2 + treat3 + treat4 + group1 + group2 + group3 + group4 + group5 + contvar, data = data)Residuals: Min 1Q Median 3Q Max -0.54069 -0.11705 -0.00558 0.11709 0.56169 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.22735 0.06751 18.180 < 2e-16 ***treat1 -0.54755 0.05509 -9.940 3.41e-16 ***treat2 -0.26089 0.05491 -4.751 7.51e-06 ***treat3 -0.08417 0.05513 -1.527 0.130 treat4 NA NA NA NA group1 -2.01289 0.06154 -32.710 < 2e-16 ***group2 -1.11811 0.06144 -18.200 < 2e-16 ***group3 -1.07177 0.06311 -16.983 < 2e-16 ***group4 -0.89593 0.06313 -14.193 < 2e-16 ***group5 NA NA NA NA contvar -0.01132 0.06987 -0.162 0.872 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1Residual standard error: 0.1938 on 91 degrees of freedomMultiple R-squared: 0.9301, Adjusted R-squared: 0.9239 F-statistic: 151.3 on 8 and 91 DF, p-value: < 2.2e-16
This happens if I fit a linear model manually, however, by building a no-intercept linear model function, and then minimising the residual sum of squares with optim. Then I do get coefficients for all treatments and for the continuous variable:
lm_manual <- function(b){ b1 <- b[1]; b2 <- b[2]; b3 <- b[3]; b4 <- b[4]; bgr1 <- b[5]; bgr2 <- b[6]; bgr3 <- b[7]; bgr4 <- b[8]; bgr5 <- b[9]; bcont <- b[10] y <- 0 + b1*data$treat1 + b2*data$treat2 + b3*data$treat3 + b4*data$treat4 + bcont*data$contvar SSR <- sum((y - data$y)^2)}out <- optim(par = c(-0.5, -0.2, 0.2, 0.5, -1, -0.1, 0, 0.1, 1, 0.01), fn = lm_manual)(out$par) # coefficients for treatment 1 - 4, group 1 - 5 and the continuous variable
This reads: -0.42869647 -0.13663451 0.03372066 0.13213677 -1.64310116 0.61429850 0.51688837 -0.10915319 1.14837076 0.15682632
. These estimates were what I expected to get when fitting the lm zero-intercept model. In the lm zero-intercept model both the values of the coefficients, and the NA for group5 were unexpected.
How can I get such coefficient estimates using lm? Or if that is not possible: why would this only work when fitting the model "manually" and not when fitting with lm?