dplyr: inner_join with a partial string match

Stephen Turner picture Stephen Turner · Oct 2, 2015 · Viewed 13k times · Source

I'd like to join two data frames if the seed column in data frame y is a partial match on the string column in x. This example should illustrate:

# What I have
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))


x

  idX         string
1   1     Motorcycle
2   2 TractorTrailer
3   3       Sailboat

y

Source: local data frame [3 x 2]

    idY   seed
  (chr)  (chr)
1     a ractor
2     b otorcy
3     c irplan


# What I want
want <- data.frame(idX=c(1,2), idY=c("b", "a"), string=c("Motorcycle", "TractorTrailer"), seed=c("otorcy", "ractor"))

want

  idX idY         string   seed
1   1   b     Motorcycle otorcy
2   2   a TractorTrailer ractor

That is, something like

inner_join(x, y, by=stringr::str_detect(x$string, y$seed))

Answer

Feng Mai picture Feng Mai · Oct 18, 2016

The fuzzyjoin library has two functions regex_inner_join and fuzzy_inner_join that allow you to match partial strings:

x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data.frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
x$string = as.character(x$string)
y$seed = as.character(y$seed)


library(fuzzyjoin)
x %>% regex_inner_join(y, by = c(string = "seed"))

  idX         string idY   seed
1   1     Motorcycle   b otorcy
2   2 TractorTrailer   a ractor


library(stringr)
x %>% fuzzy_inner_join(y, by = c("string" = "seed"), match_fun = str_detect)


  idX         string idY   seed
1   1     Motorcycle   b otorcy
2   2 TractorTrailer   a ractor