Removing stopwords from a user-defined corpus in R

StatsSorceress picture StatsSorceress · May 30, 2016 · Viewed 39k times · Source

I have a set of documents:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

First I convert to a Corpus object:

documents <- Corpus(VectorSource(documents))

Then I try to remove the stopwords:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

But this last line results in the following error:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() to debug.

This has already been asked here but an answer was not given. What does this error mean?

EDIT

Yes, I am using the tm package.

Here is the output of sessionInfo():

R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit)

Answer

Mhairi McNeill picture Mhairi McNeill · May 30, 2016

When I run into tm problems I often end up just editing the original text.

For removing words it's a little awkward, but you can paste together a regex from tm's stopword list.

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "