split characters into two variables in data frame

Alby picture Alby · Apr 24, 2013 · Viewed 32.8k times · Source

Let's say I have a vector of variables like this:

>variable
[1] "A1" "A1" "A1" "A1" "A2" "A2" "A2" "A2" "B1" "B1" "B1" "B1"

and I want to covert this into into a data frame like this:

  treatment time
1         A    1
2         A    1
3         A    1
4         A    1
5         A    2
6         A    2
7         A    2
8         A    2
9         B    1
10        B    1
11        B    1
12        B    1

To that end, I used reshape2's colsplit function. It rquires a pattern to split the string, but I quickly realize there is no obvious pattern to split the two characters without any space. I tried "" and got the following results:

> colsplit(trialm$variable,"",names=c("treatment","time"))
   treatment time
1         NA   A1
2         NA   A1
3         NA   A1
4         NA   A1
5         NA   A2
6         NA   A2
7         NA   A2
8         NA   A2
9         NA   B1
10        NA   B1
11        NA   B1
12        NA   B1

I also tried a lookbehind or lookahead regular expression :

>colsplit(trialm$variable,"(?<=\\w)",names=c("treatment","time"))
Error in gregexpr("(?<=\\w)", c("A1", "A1", "A1", "A1", "A2", "A2", "A2",  : 
  invalid regular expression '(?<=\w)', reason 'Invalid regexp'

but it gave me the above error. How can I solve this problem?

Answer

A5C1D2H2I1M1N2O1R2T1 picture A5C1D2H2I1M1N2O1R2T1 · Apr 24, 2013

Update: 24 December 2017

Somewhere along the line, the "stringr" package (which is imported with "reshape2" and which is responsible for the splitting that takes place with colsplit) started to use "stringi" for several of its functions. Some behavior seems to have changed because of that.

Using the current "reshape2" (and current "stringr" package), colsplit works the way you would have expected it to with your code:

packageVersion("reshape2")
## [1] ‘1.4.3’
packageVersion("stringr")
## [1] ‘1.2.0’

colsplit(variable, "", names = c("treatment", "time"))
##    treatment time
## 1          A    1
## 2          A    1
## 3          A    1
## 4          A    1
## 5          A    2
## 6          A    2
## 7          A    2
## 8          A    2
## 9          B    1
## 10         B    1
## 11         B    1
## 12         B    1

Original Answer: 24 April 2013

If a pattern can be detected in your "variable" but there is no clean split character that can be used, then add one :)

library(reshape2)
variable <- c("A1", "A1", "A1", "A1", "A2", "A2", 
              "A2", "A2", "B1", "B1", "B1", "B1")
## Here, we add a "." between upper case letters and numbers
colsplit(gsub("([A-Z])([0-9])", "\\1\\.\\2", variable), 
         "\\.", c("Treatment", "Time"))
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# ::::: snip :::: #
# 11         B    1
# 12         B    1

Additional Options: 23 December 2017

My "splitstackshape" package has a single-purpose non-exported helper function called NoSep that can be used for this:

splitstackshape:::NoSep(variable)
##    .var .time_1
## 1     A       1
## 2     A       1
## 3     A       1
## 4     A       1
## 5     A       2
## ::: snip :::: #
## 11    B       1
## 12    B       1

The "tidyverse" (specifically the "tidyr" package) has a couple of convenient functions for splitting values into different columns: separate and extract. separate has already been demonstrated by jazzuro, but the solution is very specific to this particular problem. Also, it generally works better with a delimiter. extract expects you to specify a regular expression with the groups you want to capture:

library(tidyverse)
data.frame(variable) %>% 
  extract(variable, into = c("Treatment", "Time"), regex = "([A-Z]+)([0-9]+)")
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# ::::: snip :::: #
# 11         B    1
# 12         B    1