Perform multiple paired t-tests based on groups/categories

User100009 picture User100009 · Mar 5, 2017 · Viewed 13.1k times · Source

I am stuck at performing t.tests for multiple categories in Rstudio. I want to have the results of the t.test of each product type, comparing the online and offline prices. I have over 800 product types so that's why don't want to do it manually for each product group.

I have a dataframe (more than 2 million rows) named data that looks like:

> Product_type   Price_Online   Price_Offline   
1   A            48             37
2   B            29             22
3   B            32             40
4   A            38             36
5   C            32             27
6   C            31             35
7   C            28             24
8   A            47             42
9   C            40             36

Ideally I want R to write the result of the t.test to another data frame called product_types:

    > Product_type   
    1   A           
    2   B            
    3   C          
    4   D          
    5   E         
    6   F            
    7   G            
    8   H            
    9   I            
   800 ...

becomes:

> Product_type   t         df       p-value   interval    mean of difference            
    1   A           
    2   B            
    3   C          
    4   D          
    5   E         
    6   F            
    7   G            
    8   H            
    9   I            
   800 ...

This is the formula if I had all product types in different dataframes:

t.test(Product_A$Price_Online, Product_A$Price_Offline, mu=0, alt="two.sided", paired = TRUE, conf.level = 0.99)

There must be an easier way to do this. Otherwise I need to make 800+ data frames and then perform the t test 800 times.

I tried things with lists & lapply but so far it doesn't work. I also tried t-Test on multiple columns: https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/

However, at the end he is still manually inserting male & female (for me over 800 categories).

Answer

yeedle picture yeedle · Mar 5, 2017

The tidy way of doing it is using dplyr and broom:

library(dplyr)
library(broom)

df <- data %>% 
  group_by(Product_type) %>% 
  do(tidy(t.test(.$Price_Online, 
                 .$Price_Offline, 
                 mu = 0, 
                 alt = "two.sided", 
                 paired = TRUE, 
                 conf.level = 0.99))))

Much more readable than my base r solution, and it handles the column names for you!

EDIT A more idiomatic way to do it rather than using do (see r4ds) is to use nest to create nested dataframes for each product type, then run a t-test for each nested dataframe using map from purrr.

library(broom)
library(dplyr)
library(purrr)
library(tidyr)

t_test <- function(df, mu = 0, alt = "two.sided", paired = T, conf.level = .99) {
  tidy(t.test(df$Price_Offline, 
              df$Price_Online,
              mu = mu, 
              alt = alt,
              paired = paired,
              conf.level = conf.level))
}

d <- df %>%
  group_by(Product_type) %>%
  nest() %>%
  mutate(ttest = map(data, t_test)) %>%
  unnest(ttest, .drop = T)