I have a data frame, like so:
data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa",
"Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))
As you can see, some entries in the director
column are multiple names separated by commas. I would like to split these entries up into separate rows while maintaining the values of the other column. As an example, the first row in the data frame above should be split into two rows, with a single name each in the director
column and 'A' in the AB
column.
Several alternatives:
1) two ways with data.table:
library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
][,.(director = V1, AB)]
2) a dplyr / tidyr combination:
library(dplyr)
library(tidyr)
v %>%
mutate(director = strsplit(as.character(director), ",")) %>%
unnest(director)
3) with tidyr only: With tidyr 0.5.0
(and later), you can also just use separate_rows
:
separate_rows(v, director, sep = ",")
You can use the convert = TRUE
parameter to automatically convert numbers into numeric columns.
4) with base R:
# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))
# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))