stemCompletion is not working

tm
Sunil picture Sunil · Aug 8, 2014 · Viewed 8.9k times · Source

I am using tm package for text analysis of repair data, Reading data into data frame, converting to Corpus object, applied various methods to clean data using lower, stipWhitespace, removestopwords and so on.

Taken back of Corpus object for stemCompletion.

Performed stemDocument using tm_map function, my object words got stemmed

got results at expected.

When I am running stemCompletion operation using tm_map function, it is not working and got below error

Error in UseMethod("words") : no applicable method for 'words' applied to an object of class "character"

Executed trackback() to show and got steps as below

> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)

How can I resolve this error?

Answer

cdxsza picture cdxsza · Aug 19, 2014

I received the same error when using tm v0.6. I suspect this occurs because stemCompletion is not in the default transformations for this version of the tm package:

>  getTransformations
function () 
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument", 
    "stripWhitespace")
<environment: namespace:tm>

Now, the tolower function has the same problem, but can be made operational by using the content_transformer function. I tried a similar approach for stemCompletion but was not successful.

Note, even though stemCompletion isn't a default transformation, it still works when manually fed stemmed words:

> stemCompletion("compani",dictCorpus)
    compani 
"companies" 

So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion, and concatenated them back together with the following (clunky and not graceful!) function:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

where dictCorpus is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace is specific for my corpus, but is likely benign for a general corpus. You may want to change the type option from "shortest" as needed.


For a full example, let's setup a dummy corpus using the crude data in the tm package:

> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)

> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter

> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today 
made light fall oil product price weak crude oil market compani spokeswoman said diamond 
latest line us oil compani cut contract post price last two day cite weak oil market reuter

> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today 
made light fall oil product price weak crude oil market companies spokeswoman said diamond 
latest line us oil companies cut contract posted price last two day cited weak oil market reuter

Note: This example is odd, since the misspelled word "copany" is mapped: -> "copani" -> "NA", in this process. Not sure how to correct this...

To run the stemCompletion_mod through the entire corpus, I just use sapply (or parSapply with snow package).

Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion to work in v0.6 of the tm package.