We're teaching a stats class for biology students and trying to use R as the computing and data visualization platform. As much as possible, we'd like to avoid using extra packages and doing anything terribly "fancy" in R; the focus of the course is on the statistics, not the programming. Nevertheless, we haven't found a very good way of generating an errorbar plot in R for a two factor ANOVA design. We're using the ggplot2 package to make the plot, and while it does have a built-in stat_summary method of generating 95% CI errorbars, the way these are calculated may not always be the right way . Below, I go through the code for the ANOVA by hand and calculate the 95% CIs by hand also (with standard error estimated from the total residual variance, not just the within-group variance ggplot's summary method would use). At the end, there's actually a plot.
So the question is... is there an easier/faster/simpler way to do all of this?
# LIZARD LENGTH DATA
island.1 <- c(0.2, 5.9, 6.1, 6.5)
island.2 <- c(5.6, 14.8, 15.5, 16.4)
island.3 <- c(0.8, 3.9, 4.3, 4.9)
sex.codes <- c("Male", "Female", "Male", "Female")
# PUTTING DATA TOGETHER IN A DATA FRAME
df.1 <- data.frame(island.1, island.2, island.3, sex.codes)
# MELTING THE DATA FRAME INTO LONG FORM
library(reshape)
df.2 <- melt(df.1)
# MEAN BY CELL
mean.island1.male <- with(df.2, mean(value[variable == "island.1" & sex.codes == "Male"]))
mean.island1.female <- with(df.2, mean(value[variable == "island.1" & sex.codes == "Female"]))
mean.island2.male <- with(df.2, mean(value[variable == "island.2" & sex.codes == "Male"]))
mean.island2.female <- with(df.2, mean(value[variable == "island.2" & sex.codes == "Female"]))
mean.island3.male <- with(df.2, mean(value[variable == "island.3" & sex.codes == "Male"]))
mean.island3.female <- with(df.2, mean(value[variable == "island.3" & sex.codes == "Female"]))
# ADDING CELL MEANS TO DATA FRAME
df.2$means[df.2$variable == "island.1" & df.2$sex.codes == "Male"] <- mean.island1.male
df.2$means[df.2$variable == "island.1" & df.2$sex.codes == "Female"] <- mean.island1.female
df.2$means[df.2$variable == "island.2" & df.2$sex.codes == "Male"] <- mean.island2.male
df.2$means[df.2$variable == "island.2" & df.2$sex.codes == "Female"] <- mean.island2.female
df.2$means[df.2$variable == "island.3" & df.2$sex.codes == "Male"] <- mean.island3.male
df.2$means[df.2$variable == "island.3" & df.2$sex.codes == "Female"] <- mean.island3.female
# LINEAR MODEL
lizard.model <- lm(value ~ variable*sex.codes, data=df.2)
# CALCULATING RESIDUALS BY HAND:
df.2$residuals.1 <- df.2$value - df.2$means
# CONFIRMING RESIDUALS FROM LINEAR MODEL:
df.2$residuals.2 <- residuals(lizard.model)
# TWO FACTOR MAIN EFFECT ANOVA
lizard.anova <- anova(lizard.model)
# INTERACTION PLOT
interaction.plot(df.2$variable, df.2$sex.codes, df.2$value)
# SAMPLE SIZE IN EACH CELL
n <- length(df.2$value[df.2$variable == "island.1" & df.2$sex.codes == "Male"])
# > n
# [1] 2
# NOTE: JUST FOR CLARITY, PRETEND n=10
n <- 10
# CALCULATING STANDARD ERROR
island.se <- sqrt(lizard.anova$M[4]/n)
# HALF CONFIDENCE INTERVAL
island.ci.half <- qt(0.95, lizard.anova$D[4]) * island.se
# MAKING SUMMARY DATA FRAME
summary.df <- data.frame(
Means = c(mean.island1.male,
mean.island1.female,
mean.island2.male,
mean.island2.female,
mean.island3.male,
mean.island3.female),
Location = c("island1",
"island1",
"island2",
"island2",
"island3",
"island3"),
Sex = c("male",
"female",
"male",
"female",
"male",
"female"),
CI.half = rep(island.ci.half, 6)
)
# > summary.df
# Means Location Sex CI.half
# 1 3.15 island1 male 2.165215
# 2 6.20 island1 female 2.165215
# 3 10.55 island2 male 2.165215
# 4 15.60 island2 female 2.165215
# 5 2.55 island3 male 2.165215
# 6 4.40 island3 female 2.165215
# GENERATING THE ERRORBAR PLOT
library(ggplot2)
qplot(data=summary.df,
y=Means,
x=Location,
group=Sex,
ymin=Means-CI.half,
ymax=Means+CI.half,
geom=c("point", "errorbar", "line"),
color=Sex,
shape=Sex,
width=0.25) + theme_bw()
I have to admit I'm quite baffled by your code. Don't take this as personal criticism, but I strongly advise you to learn your students to use the power of R as much as possible. They can only benefit from it, and my experience is that they understand faster what's going on if I don't throw lines and lines of code clutter to their heads.
First of all, you don't have to calculate the means by hand. Just do:
df.2$mean <- with(df.2,ave(value,sex.codes,variable,FUN=mean))
See also ?ave
. That is more clear than the clutter of code in your example. If you have the lizard.model, you can just use
fitted(lizard.model)
and compare these values to the means.
Then I strongly disagree with you. What you calculate, is not the standard error on your prediction. To do that correctly, use the predict()
function
outcome <- predict(lizard.model,se.fit=TRUE)
df.2$CI.half <- outcome$se / 2
To get the confidence interval on the predicted means, you have to use the correct formulae if you ever want your students to understand this correctly. Take a look at section 3.5 of this incredibly great Practical Regression and Anova using R from Faraway. It contains tons of code examples where everything is calculated by hand in a convenient and concise way. It will serve both you and your students. I learnt a lot from it and use it often as a guideline when explaining these things to students.
Now to get the summary dataframe you have a few options, but this one works and is quite understandible.
summary.df <- unique(df.2[,-c(3,5,6)])
names(summary.df) <- c('Sex','Location','Means','CI.half')
And now you can just run your plot code as it stands there.
Alternatively, if you want the prediction error on your values, you can use the following:
lizard.predict <- predict(lizard.model,interval='prediction')
df.2$lower <- lizard.predict[,2]
df.2$upper <- lizard.predict[,3]
summary.df <- unique(df.2[,-3])
names(summary.df)[1:3] <- c('Sex','Location','Means')
qplot(data=summary.df,
y=Means,
x=Location,
group=Sex,
ymin=lower,
ymax=upper,
geom=c("point", "errorbar", "line"),
color=Sex,
shape=Sex,
width=0.25) + theme_bw()
PS: If I sound harsh here and there, that's not intended. English is not my mothertongue, and I'm still not acquainted with the subtleties of the language.