longest common substring in R finding non-contiguous matches between the two strings

r lcs
IAMTubby picture IAMTubby · Feb 1, 2015 · Viewed 8.9k times · Source

I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However, I see that the LCS function in this package actually finds all characters from string1 which are present in string2, even if they are not contiguous.

To explain, if the strings are string1 : "hello" string2 : "hel12345lo" I expect the output to be hel, however I get the output as hello. I must be doing something wrong. Please see my code below.

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

I have also tried the Rlibstree method but I still get substrings which are not contiguous. Also, the length of the substring is also off from my expectation.s Please see below.

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

Answer

Rich Scriven picture Rich Scriven · Feb 1, 2015

Here are three possible solutions.

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

There are also adist() and agrep() in base R, and the stringdist package has a few functions that run the LCS method. Here's a look at stringsidt. It returns the number of unpaired characters.

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0 

Now that I've explored this a bit more, I think adist() might be the way to go. If we set counts=TRUE we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate() we can use that matrix to get the matches from a to b.

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

So the M values denote straight across matches. We can go and get the substrings with stri_sub()

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o" 

Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.