Let's say I have a vector of variables like this:
>variable
[1] "A1" "A1" "A1" "A1" "A2" "A2" "A2" "A2" "B1" "B1" "B1" "B1"
and I want to covert this into into a data frame like this:
treatment time
1 A 1
2 A 1
3 A 1
4 A 1
5 A 2
6 A 2
7 A 2
8 A 2
9 B 1
10 B 1
11 B 1
12 B 1
To that end, I used reshape2's colsplit function. It rquires a pattern to split the string, but I quickly realize there is no obvious pattern to split the two characters without any space. I tried "" and got the following results:
> colsplit(trialm$variable,"",names=c("treatment","time"))
treatment time
1 NA A1
2 NA A1
3 NA A1
4 NA A1
5 NA A2
6 NA A2
7 NA A2
8 NA A2
9 NA B1
10 NA B1
11 NA B1
12 NA B1
I also tried a lookbehind or lookahead regular expression :
>colsplit(trialm$variable,"(?<=\\w)",names=c("treatment","time"))
Error in gregexpr("(?<=\\w)", c("A1", "A1", "A1", "A1", "A2", "A2", "A2", :
invalid regular expression '(?<=\w)', reason 'Invalid regexp'
but it gave me the above error. How can I solve this problem?
Somewhere along the line, the "stringr" package (which is imported with "reshape2" and which is responsible for the splitting that takes place with colsplit
) started to use "stringi" for several of its functions. Some behavior seems to have changed because of that.
Using the current "reshape2" (and current "stringr" package), colsplit
works the way you would have expected it to with your code:
packageVersion("reshape2")
## [1] ‘1.4.3’
packageVersion("stringr")
## [1] ‘1.2.0’
colsplit(variable, "", names = c("treatment", "time"))
## treatment time
## 1 A 1
## 2 A 1
## 3 A 1
## 4 A 1
## 5 A 2
## 6 A 2
## 7 A 2
## 8 A 2
## 9 B 1
## 10 B 1
## 11 B 1
## 12 B 1
If a pattern can be detected in your "variable" but there is no clean split character that can be used, then add one :)
library(reshape2)
variable <- c("A1", "A1", "A1", "A1", "A2", "A2",
"A2", "A2", "B1", "B1", "B1", "B1")
## Here, we add a "." between upper case letters and numbers
colsplit(gsub("([A-Z])([0-9])", "\\1\\.\\2", variable),
"\\.", c("Treatment", "Time"))
# Treatment Time
# 1 A 1
# 2 A 1
# 3 A 1
# 4 A 1
# 5 A 2
# ::::: snip :::: #
# 11 B 1
# 12 B 1
My "splitstackshape" package has a single-purpose non-exported helper function called NoSep
that can be used for this:
splitstackshape:::NoSep(variable)
## .var .time_1
## 1 A 1
## 2 A 1
## 3 A 1
## 4 A 1
## 5 A 2
## ::: snip :::: #
## 11 B 1
## 12 B 1
The "tidyverse" (specifically the "tidyr" package) has a couple of convenient functions for splitting values into different columns: separate
and extract
. separate
has already been demonstrated by jazzuro, but the solution is very specific to this particular problem. Also, it generally works better with a delimiter. extract
expects you to specify a regular expression with the groups you want to capture:
library(tidyverse)
data.frame(variable) %>%
extract(variable, into = c("Treatment", "Time"), regex = "([A-Z]+)([0-9]+)")
# Treatment Time
# 1 A 1
# 2 A 1
# 3 A 1
# 4 A 1
# 5 A 2
# ::::: snip :::: #
# 11 B 1
# 12 B 1