R: producing a list of near matches with stringdist and stringdistmatrix

vielmetti picture vielmetti · Jul 18, 2015 · Viewed 8.3k times · Source

I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein distance.

I have extremely slow working code in a shell script, and I was able to load in stringdist and produce a matrix with metrics. Now I want to boil down that matrix into a smaller matrix that only has the near matches, e.g. where the metric is non-zero but less that some threshold.

kp <-  c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")
> kpm
                     leaflet leafletr lego levenshtein-distance
leafletr                   1                                   
lego                       5        6                          
levenshtein-distance      16       16   18                     
logo                       6        7    1                   19
m = as.matrix(kpm)
close = apply(m, 1, function(x) x>0 & x<5)
>  close
                     leaflet leafletr  lego levenshtein-distance  logo
 leaflet                FALSE     TRUE FALSE                FALSE FALSE
 leafletr                TRUE    FALSE FALSE                FALSE FALSE
 lego                   FALSE    FALSE FALSE                FALSE  TRUE
 levenshtein-distance   FALSE    FALSE FALSE                FALSE FALSE
 logo                   FALSE    FALSE  TRUE                FALSE FALSE

OK, now I have a (big) dist, how do I reduce it back to a list where the output would be something like

leafletr,leaflet,1
logo,lego,1

for cases only where the metric is non-zero and less than n=5? I found "apply()" which lets me do the test, now I need to sort out how to use it.

The problem is not specific to stringdist and stringdistmatrix and is very elementary R, but still I'm stuck. I suspect the answer involves subset(), but I don't know how to transform a "dist" into something else.

Answer

tumultous_rooster picture tumultous_rooster · Jul 18, 2015

Set up your data:

library('stringdist')
library('dplyr')
kp <-  c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")

Here's where we can change kpm into a dataframe:

kpm <- data.frame(as.matrix(kpm))

This is a way to get a dataframe that has a '1' to mark where words are close enough:

idx <- apply(kpm, 2, function(x) x >0 & x<5)
idx <- apply(idx, 1:2, function(x) if(isTRUE(x)) x<-1 else x<-NA)
#> idx
#                     leaflet leafletr lego levenshtein.distance logo
#  leaflet                   NA        1   NA                   NA   NA
#  leafletr                   1       NA   NA                   NA   NA
#  lego                      NA       NA   NA                   NA    1
#  levenshtein-distance      NA       NA   NA                   NA   NA
#  logo                      NA       NA    1                   NA   NA

To make things easy, melt the dataframe, filter it and get rid of the last column:

final <- melt(idx) %>%
        filter(value==1) %>%
        select(Var1, Var2)

Don't forget to turn everything back into characters, not factors! (It's like a broken record in R sometimes...)

final[] <- lapply(final, as.character)
#> final
#      Var1     Var2
#  leafletr  leaflet
#   leaflet leafletr
#      logo     lego
#      lego     logo

Now we get rid of the duplicates:

final <- final[!duplicated(data.frame(list(do.call(pmin,final),do.call(pmax,final)))),]

Tack on some good names and you are good to go.

names(final) <- c('string 1', 'string 2')
#> final
# string 1 string 2
# leafletr  leaflet
#     logo     lego

(Although you requested a list, this is a dataframe. From here it's pretty easy to convert into whatever you want depending on your need, eg, write to a csv, etc etc.)