I built a simple linear regression model with 'Score' as the dependent variable, and 'Activity' as the independent one. 'Activity' has 5 levels: 'listen' (reference level), 'read1', 'read2', 'watch1', 'watch2'.
Call:
lm(formula = Score ~ Activity)
Residuals:
Min 1Q Median 3Q Max
-22.6154 -8.6154 -0.6154 7.1346 31.3846
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.615 2.553 16.302 <2e-16 ***
Activityread1 6.385 7.937 0.804 0.4254
Activityread2 20.885 9.552 2.186 0.0340 *
Activitywatch1 3.885 4.315 0.900 0.3728
Activitywatch2 -11.415 6.357 -1.796 0.0792 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.02 on 45 degrees of freedom
Multiple R-squared: 0.1901, Adjusted R-squared: 0.1181
F-statistic: 2.64 on 4 and 45 DF, p-value: 0.04594
In order to obtain all pairwise comparisons, I performed a TukeyHSD test, whose output I'm having difficulty interpreting. While the output of the model shows that the only significant effect we have is due to the contrast between 'listen' and 'read2', the TukeyHSD results yield that the only significant contrast exists between 'watch2' and 'read2'. What does this mean?
In your initial model summary, Estimate
is showing the estimated difference in mean for each group relative to the mean of the "listen" group (40.615). The "read2" group, has the largest shift (+20.885) away from the "listen" group is called significant with p = .0340
when only these 4 comparisons are calculated.
Since TUKEYHSD
is performing all pairwise comparisons for the group means (not just to reference level "listen" anymore), it is also performing p-value adjustments to account for all of these extra tests. Reason being, if you performed 20 comparisons on random data you'd expect one (1/20 or .05) to be called significant with p < .05
simply because of doing that many tests. With the p-value adjustment factored in, your originally significant comparison between "listen - read2" no longer qualifies as significant.
But the larger difference between "watch2 - read2" (-32.3), which wasn't tested in the original model summary, is large enough to be considered significant with p = .03688
even after doing all of the extra comparison adjusting.
Hope that helps, you can read more about the multiple comparrison problem here
. And see ?p.adjust
for R's implementation of the most popular methods.