Split delimited strings in a column and insert as new rows

Boxuan picture Boxuan · Mar 11, 2013 · Viewed 79.5k times · Source

I have a data frame as follow:

+-----+-------+
|  V1 |  V2   |
+-----+-------+
|  1  | a,b,c |
|  2  | a,c   |
|  3  | b,d   |
|  4  | e,f   |
|  .  | .     |
+-----+-------+

Each of the alphabet is a character separated by comma. I would like to split V2 on each comma and insert the split strings as new rows. For instance, the desired output will be:

+----+----+
| V1 | V2 |
+----+----+
|  1 |  a |
|  1 |  b |
|  1 |  c |
|  2 |  a |
|  2 |  c |
|  3 |  b |
|  3 |  d |
|  4 |  e |
|  4 |  f |
+----+----+

I am trying to use strsplit() to spit V2 first, then cast the list into a data frame. It didn't work. Any help will be appreciated.

Answer

dalloliogm picture dalloliogm · Dec 8, 2014

As of Dec 2014, this can be done using the unnest function from Hadley Wickham's tidyr package (see release notes http://blog.rstudio.org/2014/12/08/tidyr-0-2-0/)

> library(tidyr)
> library(dplyr)
> mydf

  V1    V2
2  1 a,b,c
3  2   a,c
4  3   b,d
5  4   e,f
6  .     .


> mydf %>% 
    mutate(V2 = strsplit(as.character(V2), ",")) %>% 
    unnest(V2)

   V1 V2
1   1  a
2   1  b
3   1  c
4   2  a
5   2  c
6   3  b
7   3  d
8   4  e
9   4  f
10  .  .

Update 2017: note the separate_rows function as described by @Tif below.

It works so much better, and it allows to "unnest" multiple columns in a single statement:

> head(mydf)
geneid              chrom    start  end strand  length  gene_count
ENSG00000223972.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1    11869;12010;12179;12613;12613;12975;13221;13221;13453   12227;12057;12227;12721;12697;13052;13374;14409;13670   +;+;+;+;+;+;+;+;+   1735    11
ENSG00000227232.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1  14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;29534   14501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570   -;-;-;-;-;-;-;-;-;-;-   1351    380
ENSG00000278267.1   chr1    17369   17436   -   68  14
ENSG00000243485.4   chr1;chr1;chr1;chr1;chr1    29554;30267;30564;30976;30976   30039;30667;30667;31097;31109   +;+;+;+;+   1021    22
ENSG00000237613.2   chr1;chr1;chr1  34554;35277;35721   35174;35481;36081   -;-;-   1187    24
ENSG00000268020.3   chr1    52473   53312   +   840 14


> mydf %>% separate_rows(strand, chrom, gene_start, gene_end)
geneid  length  gene_count  strand  chrom   start   end
ENSG00000223972.5   1735    11  +   chr1    11869   12227
ENSG00000223972.5   1735    11  +   chr1    12010   12057
ENSG00000223972.5   1735    11  +   chr1    12179   12227
ENSG00000223972.5   1735    11  +   chr1    12613   12721
ENSG00000223972.5   1735    11  +   chr1    12613   12697
ENSG00000223972.5   1735    11  +   chr1    12975   13052
ENSG00000223972.5   1735    11  +   chr1    13221   13374
ENSG00000223972.5   1735    11  +   chr1    13221   14409
ENSG00000223972.5   1735    11  +   chr1    13453   13670
ENSG00000227232.5   1351    380 -   chr1    14404   14501
ENSG00000227232.5   1351    380 -   chr1    15005   15038
ENSG00000227232.5   1351    380 -   chr1    15796   15947
ENSG00000227232.5   1351    380 -   chr1    16607   16765
ENSG00000227232.5   1351    380 -   chr1    16858   17055
ENSG00000227232.5   1351    380 -   chr1    17233   17368
ENSG00000227232.5   1351    380 -   chr1    17606   17742
ENSG00000227232.5   1351    380 -   chr1    17915   18061
ENSG00000227232.5   1351    380 -   chr1    18268   18366
ENSG00000227232.5   1351    380 -   chr1    24738   24891
ENSG00000227232.5   1351    380 -   chr1    29534   29570
ENSG00000278267.1   68  5   -   chr1    17369   17436
ENSG00000243485.4   1021    8   +   chr1    29554   30039
ENSG00000243485.4   1021    8   +   chr1    30267   30667
ENSG00000243485.4   1021    8   +   chr1    30564   30667
ENSG00000243485.4   1021    8   +   chr1    30976   31097
ENSG00000243485.4   1021    8   +   chr1    30976   31109
ENSG00000237613.2   1187    24  -   chr1    34554   35174
ENSG00000237613.2   1187    24  -   chr1    35277   35481
ENSG00000237613.2   1187    24  -   chr1    35721   36081
ENSG00000268020.3   840 0   +   chr1    52473   53312