I have been trying to figure out how the subset
argument in R's lm()
function works. Especially the follwoing code seems dubious for me:
data(mtcars)
summary(lm(mpg ~ wt, data=mtcars))
summary(lm(mpg ~ wt, cyl, data=mtcars))
In every case the regression has 32 observations
dim(lm(mpg ~ wt, cyl ,data=mtcars)$model)
[1] 32 2
dim(lm(mpg ~ wt ,data=mtcars)$model)
[1] 32 2
yet the coefficients change (along with the R²). The help doesn't provide too much information on this matter:
subset an optional vector specifying a subset of observations to be used in the fitting process
As a general principle, vectors used in subsetting can either logical (e.g. a TRUE or FALSE for every element) or numeric (e.g. a number). As a feature to help with sampling, if it is numeric R will include the same element multiple times if it appears in a subsetting numeric vector.
Let's take a look at cyl
:
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
So you're getting a data.frame of the same length, but it's comprised of row 6, row 6, row 4, row 6, etc.
You can see this if you do the subsetting yourself:
> head(mtcars[mtcars$cyl,])
mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Valiant.1 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant.2 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Valiant.3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Did you mean to do something like this?
summary(lm(mpg ~ wt, cyl==6, data=mtcars))