I want to run lm()
on a large dataset with 50M+ observations with 2 predictors. The analysis is run on a remote server with only 10GB for storing the data. I have tested ´lm()´ on 10K observations sampled from the data and the resulting object had size 2GB+.
I need the object of class "lm" returned from lm()
ONLY to produce the summary statistics of the model (summary(lm_object)
) and to make predictions (predict(lm_object)
).
I have done some experiment with the options model, x, y, qr
of lm
. If I set them all to FALSE
I reduce the size by 38%
library(MASS)fit1=lm(medv~lstat,data=Boston)size1 <- object.size(fit1)print(size1, units = "Kb")# 127.4 Kb bytesfit2=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=F)size2 <- object.size(fit2)print(size2, units = "Kb")# 78.5 Kb Kb bytes- ((as.integer(size1) - as.integer(size2)) / as.integer(size1)) * 100# -38.37994
but
summary(fit2)# Error in qr.lm(object) : lm object does not have a proper 'qr' component.# Rank zero or should not have used lm(.., qr=FALSE).predict(fit2,data=Boston)# Error in qr.lm(object) : lm object does not have a proper 'qr' component.# Rank zero or should not have used lm(.., qr=FALSE).
Apparently I need to keep qr=TRUE
which reduce the object size by only 9% if compared with the default object
fit3=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=T)size3 <- object.size(fit3)print(size3, units = "Kb")# 115.8 Kb- ((as.integer(size1) - as.integer(size3)) / as.integer(size1)) * 100# -9.142752
How do I bring the size of the "lm" object to a minimum without dumping a lot of unneeded information in memory and storage?