Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords.
Web Search: Being new to text-mining and the tm
package in R
, I went to the web to figure out how to do this. Below are some relevant links that I found:
Background: Of these, I preferred the solution that uses NGramTokenizer
in the RWeka
package in R
, but I ran into a problem. In the example code below, I create three documents and place them in a corpus. Note that Docs 1
and 2
each contain two words. Doc 3
only contains one word. My dictionary keywords are two bigrams and a unigram.
Problem: The NGramTokenizer
solution in the above links does not correctly count the unigram keyword in the Doc 3
.
library(tm)
library(RWeka)
my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=BigramTokenizer,
dictionary=my.dict)))
# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs jedi jedi grandmaster jedi master
# 1 1 0 1
# 2 1 1 0
# 3 0 0 0
I was expecting the row for Doc 3
to give 1
for jedi
and 0
for the other two. Is there something I am misunderstanding?
I ran into the same problem and found that token counting functions from the TM package rely on an option called wordLengths, which is a vector of two numbers -- the minimum and the maximum length of tokens to keep track of. By default, TM uses a minimum word length of 3 characters (wordLengths = c(3, Inf)
). You can override this option by adding it to the control list in a call to DocumentTermMatrix like this:
DocumentTermMatrix(my.corpus,
control=list(
tokenize=newBigramTokenizer,
wordLengths = c(1, Inf)))
However, your 'jedi' word is more than 3 characters long. Although, you probably tweaked the option's value earlier while trying to figure out how to count ngrams, so still try this. Also, look at the bounds option, which tells TM to discard words less or more frequent than specified values.