R: find most frequent group of words in corpus

Ollaws picture Ollaws · May 14, 2014 · Viewed 8.4k times · Source

Is there an easy way how to find not only most frequent terms, but also expressions (so more than one word, groups of words) in text corpus in R?

Using the tm package, I can find most frequent terms like this:

tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=3, highfreq=Inf)

I can find associated words to the most frequent words using findAssocs() function, so I could manually group these words. But how can I find the number of occurrences of these groups of words in corpus?

Thx

Answer

knb picture knb · May 16, 2014

If I remember correctly, you can construct a TermDocumentMatrix of Bigrams (2 words that always occur together) using weka, and then process them as needed

library("tm") #text mining
library("RWeka") # for tokenization algorithms more complicated than single-word


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

# process tdm 
# findFreqTerms(tdm, lowfreq=3, highfreq=Inf)
# ...

tdm <- removeSparseTerms(tdm, 0.99)
print("----")
print("tdm properties")
str(tdm)
tdm_top_N_percent = tdm$nrow / 100 * topN_percentage_wanted

Alternatively,

#words combinations that occur at least once together an at most 5 times
wmin=1
wmax = 5

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = wmin, max = wmax))

Sometimes it helps to perform word stemming first in order to get "better" word groups.