I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.
My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:
Model=lm(…,weights=(1/population))
Is that correct? Or should it be:
Model=lm(…,weights=(1/sqrt(population)))
(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)
From ?lm
: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights
(that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.
So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population))
. If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population
.
As to which of those is most appropriate... that's a question for CrossValidated!