R Language: How do I print / see summary statistics for sample subset?

baha-kev picture baha-kev · Jan 29, 2011 · Viewed 17k times · Source

These are some newbie questions about statistical programming for R for which I haven't been able to find an answer online. My dataframe is labeled "eitc" in the code below.

1) Once I've loaded in a data frame, I would like to look at summary statistics. I've used the functions:

eitc <- read.dta(file="/Users/Documents/eitc.dta")
summary(eitc)
sapply(eitc,mean,na.rm=TRUE) #for sample mean, min, max, etc.

How do I find summary statistics on my dataframe when certain qualifications are met. For example, I would like to see the summary statistics on all variables when the variable "children" is greater than or equal to 1. The equivalent Stata code is:

summarize if children >= 1

2) Similarly, how do I find specific parameters when certain qualifications are met? For example, I want to find the mean of the variable "work" when both "post93" variable is equal to zero and "anykids" variable is equal to 1. The equivalent Stata code is:

mean work if post93==0 & anykids==1

3) Ideally, when I run the summary statistics above, I would like to find out how many observations were included in the calculation / fit the criteria.

4) When I read in my data frame, it would also be nice to see how many observations are included in the data set (and perhaps how many rows have missing values or "NA" in them).

5) Also, I have been creating dummy variables using the following code. Is this the correct way to do it or is there a more efficient route?

post93.dummy <- as.numeric(eitc$year>1993)
eitc=cbind(eitc,post93.dummy)

Answer

Michael Dunn picture Michael Dunn · Jan 29, 2011

A lot of your requirements are answered by subset, e.g.

summary(subset(eitc, post93 == 0 & anykids == 1, select=work))
nrow(subset(eitc, post93 == 0 & anykids == 1, select=work)) # for number of obs.

The ?subset documentation has good examples.

The cbind method of attaching dummy variables is unneccesary. Just do:

eitc$post93.dummy <- as.numeric(eitc$year>1993)