Stemming with R Text Analysis

RUser picture RUser · Jun 27, 2014 · Viewed 18.4k times · Source

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.

Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:

accounts   -> account  
account    -> account  
accounting -> account  
acounting  -> acount  
acount     -> acount  
acounts    -> acount  
accounnt   -> accounnt  

Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term.

1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible?

2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R?

Once again, any help/suggestions would be greatly appreciated.

Answer

MrFlick picture MrFlick · Jun 27, 2014

We could set up a list of synonyms and replace those values. For example

synonyms <- list(
    list(word="account", syns=c("acount", "accounnt"))
)

This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.

raw<-c("accounts", "account", "accounting", "acounting", 
     "acount", "acounts", "accounnt")

And now let's define a transformation function that will replace the words in our list with the primary synonym.

library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) { 
    Reduce(function(a,b) {
        gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)   
})

Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus

tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)

and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".