Using the iris
dataset I'm trying to calculate a z score for each of the variables. I have the data in tidy format, by performing the following:
library(reshape2)
library(dplyr)
test <- iris
test <- melt(iris,id.vars = 'Species')
That gives me the following:
Species variable value
1 setosa Sepal.Length 5.1
2 setosa Sepal.Length 4.9
3 setosa Sepal.Length 4.7
4 setosa Sepal.Length 4.6
5 setosa Sepal.Length 5.0
6 setosa Sepal.Length 5.4
But when I try to create a z-score column for each group (e.g. the z-score for Sepal.Length will not be comparable to that of Sepal. Width) using the following:
test <- test %>%
group_by(Species, variable) %>%
mutate(z_score = (value - mean(value)) / sd(value))
The resulting z-scores have not been grouped, and are based on all of the data.
What's the best way to return the z-scores by group using dpylr?
Many thanks!
Your code is giving you z-scores by group. It seems to me these z-scores should be comparable exactly because you've individually scaled each group to mean=0 and sd=1, rather than scaling each value based on the mean and sd of the full data frame. For example:
library(tidyverse)
First, set up the melted data frame:
dat = iris %>%
gather(variable, value, -Species) %>%
group_by(Species, variable) %>%
mutate(z_score_group = (value - mean(value)) / sd(value)) %>% # You can also use scale(value) as pointed out by @RuiBarradas
ungroup %>%
mutate(z_score_ungrouped = (value - mean(value)) / sd(value))
Now look at the first three rows and compare with direct calculation:
head(dat, 3)
# Species variable value z_score_group z_score_ungrouped
# 1 setosa Sepal.Length 5.1 0.2666745 0.8278959
# 2 setosa Sepal.Length 4.9 -0.3007180 0.7266552
# 3 setosa Sepal.Length 4.7 -0.8681105 0.6254145
# z-scores by group
with(dat, (value[1:3] - mean(value[Species=="setosa" & variable=="Sepal.Length"])) / sd(value[Species=="setosa" & variable=="Sepal.Length"]))
# [1] 0.2666745 -0.3007180 -0.8681105
# ungrouped z-scores
with(dat, (value[1:3] - mean(value)) / sd(value))
# [1] 0.8278959 0.7266552 0.6254145
Now visualize the z-scores: The first graph below is the raw data. The second is the ungrouped z-scores--we've just rescaled the data to an overall mean=0 and SD=1. The third graph is what your code produces. Each group has been individually scaled to mean=0 and SD=1.
gridExtra::grid.arrange(
grobs=setNames(names(dat)[c(3,5,4)], names(dat)[c(3,5,4)]) %>%
map(~ ggplot(dat %>% mutate(group=paste(Species,variable,sep="_")),
aes_string(.x, colour="group")) + geom_density()),
ncol=1)