I am using tm package for text analysis of repair data, Reading data into data frame, converting to Corpus object, applied various methods to clean data using lower, stipWhitespace, removestopwords and so on.
Taken back of Corpus object for stemCompletion.
Performed stemDocument using tm_map function, my object words got stemmed
got results at expected.
When I am running stemCompletion operation using tm_map function, it is not working and got below error
Error in UseMethod("words") : no applicable method for 'words' applied to an object of class "character"
Executed trackback() to show and got steps as below
> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)
How can I resolve this error?
I received the same error when using tm v0.6. I suspect this occurs because stemCompletion
is not in the default transformations for this version of the tm package:
> getTransformations
function ()
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument",
"stripWhitespace")
<environment: namespace:tm>
Now, the tolower
function has the same problem, but can be made operational by using the content_transformer
function. I tried a similar approach for stemCompletion
but was not successful.
Note, even though stemCompletion
isn't a default transformation, it still works when manually fed stemmed words:
> stemCompletion("compani",dictCorpus)
compani
"companies"
So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion
, and concatenated them back together with the following (clunky and not graceful!) function:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
where dictCorpus
is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace
is specific for my corpus, but is likely benign for a general corpus. You may want to change the type
option from "shortest" as needed.
For a full example, let's setup a dummy corpus using the crude
data in the tm package:
> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)
> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter
> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today
made light fall oil product price weak crude oil market compani spokeswoman said diamond
latest line us oil compani cut contract post price last two day cite weak oil market reuter
> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today
made light fall oil product price weak crude oil market companies spokeswoman said diamond
latest line us oil companies cut contract posted price last two day cited weak oil market reuter
Note: This example is odd, since the misspelled word "copany" is mapped: -> "copani" -> "NA", in this process. Not sure how to correct this...
To run the stemCompletion_mod
through the entire corpus, I just use sapply
(or parSapply
with snow package).
Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion
to work in v0.6 of the tm package.