Why does summary overestimate the R-squared with a "no-intercept" model formula

Question 1

Why does summary overestimate the R-squared with a "no-intercept" model formula

r summary intercept lm

SESman · Dec 2, 2013 · Viewed 7.3k times · Source

Answer

Answer

Oh yeah, I fell into this trap too! Very good question!! It is because

enter image description here

and

enter image description here

in case of model with intercept (your mylm1), the y̅ is mean(y_i) - this is what you expect, this is the SS_tot you basicly want for proper R²
whereas in case of model without intercept, the y̅ is taken as 0 - so the SS_tot will be very high, so the R² will be very close to 1! SS_res will differ according to the worse fit (will be little higher without intercept), but not much.

Code:

attach(mylm1) # in general be careful with attach, here only for code clarity

y_fit <- mylm1$fitted.values
SSE <- sum((y_fit - y)^2)
SST <- sum((y - mean(y))^2)
1-SSE/SST  # R^2 with intercept

y_fit2 <- mylm2$fitted.values
SSE2 <- sum((y_fit2 - y)^2) # SSE2 only slightly higher than SSE..
SST2 <- sum((y - 0)^2)  # !!! the key difference is here !!!
1-SSE2/SST2 # R^2 without intercept

Note: It is not clear to me why in the model without intercept the y̅ is 0 and not mean(y_i), but that's how it is. I myself found out hard way by investigating and hacking with the above code..

Question 2

I wanted to make a simple linear model (lm()) without intercept coefficient so I put -1 in my model formula as in the following example. The problem is that the R-squared return by summary(myModel) seems to be overestimated. lm(), summary() and -1 are among the very classic function/functionality in R. Hence I am a bit surprised and I wonder if this is a bug or if there is any reason for this behaviour.

Here is an example:

x <- rnorm(1000, 3, 1)
mydf <- data.frame(x=x, y=1+x+rnorm(1000, 0, 1))
plot(y ~ x, mydf, xlim=c(-2, 10), ylim=c(-2, 10))

mylm1 <- lm(y ~ x, mydf)
mylm2 <- lm(y ~ x - 1, mydf)

abline(mylm1, col="blue") ; abline(mylm2, col="red")
abline(h=0, lty=2) ; abline(v=0, lty=2)

r2.1 <- 1 - var(residuals(mylm1))/var(mydf$y)
r2.2 <- 1 - var(residuals(mylm2))/var(mydf$y)
r2 <- c(paste0("Intercept - r2: ", format(summary(mylm1)$r.squared, digits=4)),
        paste0("Intercept - manual r2: ", format(r2.1, digits=4)),
        paste0("No intercept - r2: ", format(summary(mylm2)$r.squared, digits=4)),
        paste0("No intercept - manual r2: ", format(r2.2, digits=4)))
legend('bottomright', legend=r2, col=c(4,4,2,2), lty=1, cex=0.6)

enter image description here

Why does summary overestimate the R-squared with a "no-intercept" model formula

Answer

Related questions