I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein distance.
I have extremely slow working code in a shell script, and I was able to load in stringdist and produce a matrix with metrics. Now I want to boil down that matrix into a smaller matrix that only has the near matches, e.g. where the metric is non-zero but less that some threshold.
kp <- c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")
> kpm
leaflet leafletr lego levenshtein-distance
leafletr 1
lego 5 6
levenshtein-distance 16 16 18
logo 6 7 1 19
m = as.matrix(kpm)
close = apply(m, 1, function(x) x>0 & x<5)
> close
leaflet leafletr lego levenshtein-distance logo
leaflet FALSE TRUE FALSE FALSE FALSE
leafletr TRUE FALSE FALSE FALSE FALSE
lego FALSE FALSE FALSE FALSE TRUE
levenshtein-distance FALSE FALSE FALSE FALSE FALSE
logo FALSE FALSE TRUE FALSE FALSE
OK, now I have a (big) dist, how do I reduce it back to a list where the output would be something like
leafletr,leaflet,1
logo,lego,1
for cases only where the metric is non-zero and less than n=5? I found "apply()" which lets me do the test, now I need to sort out how to use it.
The problem is not specific to stringdist and stringdistmatrix and is very elementary R, but still I'm stuck. I suspect the answer involves subset(), but I don't know how to transform a "dist" into something else.
Set up your data:
library('stringdist')
library('dplyr')
kp <- c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")
Here's where we can change kpm
into a dataframe:
kpm <- data.frame(as.matrix(kpm))
This is a way to get a dataframe that has a '1' to mark where words are close enough:
idx <- apply(kpm, 2, function(x) x >0 & x<5)
idx <- apply(idx, 1:2, function(x) if(isTRUE(x)) x<-1 else x<-NA)
#> idx
# leaflet leafletr lego levenshtein.distance logo
# leaflet NA 1 NA NA NA
# leafletr 1 NA NA NA NA
# lego NA NA NA NA 1
# levenshtein-distance NA NA NA NA NA
# logo NA NA 1 NA NA
To make things easy, melt the dataframe, filter it and get rid of the last column:
final <- melt(idx) %>%
filter(value==1) %>%
select(Var1, Var2)
Don't forget to turn everything back into characters, not factors! (It's like a broken record in R sometimes...)
final[] <- lapply(final, as.character)
#> final
# Var1 Var2
# leafletr leaflet
# leaflet leafletr
# logo lego
# lego logo
Now we get rid of the duplicates:
final <- final[!duplicated(data.frame(list(do.call(pmin,final),do.call(pmax,final)))),]
Tack on some good names and you are good to go.
names(final) <- c('string 1', 'string 2')
#> final
# string 1 string 2
# leafletr leaflet
# logo lego
(Although you requested a list, this is a dataframe. From here it's pretty easy to convert into whatever you want depending on your need, eg, write to a csv, etc etc.)