R's tm package for word count

monarque13 picture monarque13 · Oct 22, 2014 · Viewed 17.7k times · Source

I have a corpus with over 5000 text files. I would like to get individual word counts for each file after running pre-processing each (turning to lower, removing stopwords, etc). I haven't had any luck with the word count for the individual text files. Any help would be appreciated.

library(tm)
revs<-Corpus(DirSource("data/")) 
revs<-tm_map(revs,tolower) 
revs<-tm_map(revs,removeWords, stopwords("english")) 
revs<-tm_map(revs,removePunctuation) 
revs<-tm_map(revs,removeNumbers) 
revs<-tm_map(revs,stripWhitespace) 
dtm<-DocumentTermMatrix(revs) 

Answer

Ben picture Ben · Nov 2, 2014

As Tyler notes, your question is incomplete without a reproducible example. Here's how to make a reproducible example for this kind of question - use the data that comes built-in with the package:

library("tm") # version 0.6, you seem to be using an older version
data(crude)
revs <- tm_map(crude, content_transformer(tolower)) 
revs <- tm_map(revs, removeWords, stopwords("english")) 
revs <- tm_map(revs, removePunctuation) 
revs <- tm_map(revs, removeNumbers) 
revs <- tm_map(revs, stripWhitespace) 
dtm <- DocumentTermMatrix(revs)

And here's how to get a word count per document, each row of the dtm is one document, so you simply sum the columns for a row and you have the word count for the document:

# Word count per document
rowSums(as.matrix(dtm))