Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?

J.D picture J.D · Sep 28, 2018 · Viewed 7.3k times · Source

Actually there are 2 questions, one is more advanced than the other.

Q1: I am looking for a method that similar to corrplot() but can deal with factors.

I originally tried to use chisq.test() then calculate the p-value and Cramer's V as correlation, but there too many columns to figure out. So could anyone tell me if there is a quick way to create a "corrplot" that each cell contains the value of Cramer's V, while the colour is rendered by p-value. Or any other kind of similar plot.

Regarding Cramer's V, let's say tbl is a 2-dimensional factor data frame.

chi2 <- chisq.test(tbl, correct=F)
Cramer_V <- sqrt(chi2$/nrow(tbl)) 

I prepared a test data frame with factors:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 

Q2: Then I would like to compute a correlation/association matrix on a mixed-types dataframe e.g.:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 
df$group <- factor(df$group, levels = c('A', 'B', 'C'), ordered = T)
df$student <- as.integer(df$student)

Answer

AntoniosK picture AntoniosK · Sep 28, 2018

Here's a tidyverse solution:

# example dataframe
df <- data.frame(
  group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
  student = c('01', '01', '01', '02', '02', '01', '02'),
  exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
  subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 

library(tidyverse)
library(lsr)

# function to get chi square p value and Cramers V
f = function(x,y) {
    tbl = df %>% select(x,y) %>% table()
    chisq_pval = round(chisq.test(tbl)$p.value, 4)
    cramV = round(cramersV(tbl), 4) 
    data.frame(x, y, chisq_pval, cramV) }

# create unique combinations of column names
# sorting will help getting a better plot (upper triangular)
df_comb = data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)

# apply function to each variable combination
df_res = map2_df(df_comb$X1, df_comb$X2, f)

# plot results
df_res %>%
  ggplot(aes(x,y,fill=chisq_pval))+
  geom_tile()+
  geom_text(aes(x,y,label=cramV))+
  scale_fill_gradient(low="red", high="yellow")+
  theme_classic()

enter image description here

Note that I'm using lsr package to calculate Cramers V using the cramersV function.