R: Replacing foreign characters in a string

krishnan picture krishnan · Jul 8, 2013 · Viewed 8.8k times · Source

I'm dealing with a large amount of data, mostly names with non-English characters. My goal is to match these names against some information on them collected in the USA.

ie, I might want to match the name 'Sølvsten' (from some list of names) to 'Soelvsten' (the name as stored in some American database). Here is a function I wrote to do this. It's clearly clunky and somewhat arbitrary, but I wonder if there is a simple R function that translates these foreign characters to their nearest English neighbours. I understand that there might not be any standard way to do this conversion, but I'm just curious if there is and if that conversion can be done through an R function.

# a function to replace foreign characters
replaceforeignchars <- function(x)
{
    require(gsubfn);
    x <- gsub("š","s",x)
    x <- gsub("œ","oe",x)
    x <- gsub("ž","z",x)
    x <- gsub("ß","ss",x)
    x <- gsub("þ","y",x)
    x <- gsub("à","a",x)
    x <- gsub("á","a",x)
    x <- gsub("â","a",x)
    x <- gsub("ã","a",x)
    x <- gsub("ä","a",x)
    x <- gsub("å","a",x)
    x <- gsub("æ","ae",x)
    x <- gsub("ç","c",x)
    x <- gsub("è","e",x)
    x <- gsub("é","e",x)
    x <- gsub("ê","e",x)
    x <- gsub("ë","e",x)
    x <- gsub("ì","i",x)
    x <- gsub("í","i",x)
    x <- gsub("î","i",x)
    x <- gsub("ï","i",x)
    x <- gsub("ð","d",x)
    x <- gsub("ñ","n",x)
    x <- gsub("ò","o",x)
    x <- gsub("ó","o",x)
    x <- gsub("ô","o",x)
    x <- gsub("õ","o",x)
    x <- gsub("ö","o",x)
    x <- gsub("ø","oe",x)
    x <- gsub("ù","u",x)
    x <- gsub("ú","u",x)
    x <- gsub("û","u",x)
    x <- gsub("ü","u",x)
    x <- gsub("ý","y",x)
    x <- gsub("ÿ","y",x)
    x <- gsub("ğ","g",x)

    return(x)
}

Note: I know there exist name matching algorithms such as Jaro Winkler Distance Matching, but I'd rather do exact matches.

Answer

G. Grothendieck picture G. Grothendieck · Jul 8, 2013

Try using the chartr R function for the one character substitutions (which should be quite fast) and then clean it up with a series of gsub calls for each of the one-to-two character substitutions (which presumably will be slower but there are not many of them).

to.plain <- function(s) {

   # 1 character substitutions
   old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý"
   new1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
   s1 <- chartr(old1, new1, s)

   # 2 character substitutions
   old2 <- c("œ", "ß", "æ", "ø")
   new2 <- c("oe", "ss", "ae", "oe")
   s2 <- s1
   for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)

   s2
}

Add to old1, new1, old2 and new2 as needed.

Here is a test:

> s <- "æxš"
> to.plain(s)
[1] "aexs"

UPDATE: corrected variable names in chartr.