How do I interpret the TukeyHSD output in R? (in relation to the underlying regression model)

fannilegoza picture fannilegoza · Jan 8, 2017 · Viewed 7.5k times · Source

I built a simple linear regression model with 'Score' as the dependent variable, and 'Activity' as the independent one. 'Activity' has 5 levels: 'listen' (reference level), 'read1', 'read2', 'watch1', 'watch2'.

Call:
lm(formula = Score ~ Activity)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6154  -8.6154  -0.6154   7.1346  31.3846 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      41.615      2.553  16.302   <2e-16 ***
Activityread1     6.385      7.937   0.804   0.4254    
Activityread2    20.885      9.552   2.186   0.0340 *  
Activitywatch1    3.885      4.315   0.900   0.3728    
Activitywatch2  -11.415      6.357  -1.796   0.0792 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.02 on 45 degrees of freedom
Multiple R-squared:  0.1901,    Adjusted R-squared:  0.1181 
F-statistic:  2.64 on 4 and 45 DF,  p-value: 0.04594

In order to obtain all pairwise comparisons, I performed a TukeyHSD test, whose output I'm having difficulty interpreting. While the output of the model shows that the only significant effect we have is due to the contrast between 'listen' and 'read2', the TukeyHSD results yield that the only significant contrast exists between 'watch2' and 'read2'. What does this mean?

Answer

Nate picture Nate · Jan 8, 2017

In your initial model summary, Estimate is showing the estimated difference in mean for each group relative to the mean of the "listen" group (40.615). The "read2" group, has the largest shift (+20.885) away from the "listen" group is called significant with p = .0340 when only these 4 comparisons are calculated.

Since TUKEYHSD is performing all pairwise comparisons for the group means (not just to reference level "listen" anymore), it is also performing p-value adjustments to account for all of these extra tests. Reason being, if you performed 20 comparisons on random data you'd expect one (1/20 or .05) to be called significant with p < .05 simply because of doing that many tests. With the p-value adjustment factored in, your originally significant comparison between "listen - read2" no longer qualifies as significant.

But the larger difference between "watch2 - read2" (-32.3), which wasn't tested in the original model summary, is large enough to be considered significant with p = .03688 even after doing all of the extra comparison adjusting.

Hope that helps, you can read more about the multiple comparrison problem here . And see ?p.adjust for R's implementation of the most popular methods.