Trying to get tf-idf weighting working in R

cforster picture cforster · Feb 11, 2013 · Viewed 23.3k times · Source

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses, the second containing the second three episodes, if you must know).

R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3

Relevant bit of code:

library('tm')
corpus <- Corpus(DirSource('.'))
dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf))

str(dtm)
List of 6
 $ i       : int [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:12456] 2 10 12 17 20 24 29 30 32 34 ...
 $ v       : num [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2
 $ ncol    : int 10646
 $ dimnames:List of 2
  ..$ Docs : chr [1:2] "bloom.txt" "telemachiad.txt"
  ..$ Terms: chr [1:10646] "_--c'est" "_--et" "_--for" "_--goodbye," ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

You will note, that the weighting appears to still be the default term frequency (tf) rather than the weighted tf-idf scores that I'd like.

Apologies if I'm missing something obvious, but based on the documentation I've read, this should work. The fault, no doubt, lies not in the stars...

Answer

juba picture juba · Feb 11, 2013

If you look at the DocumentTermMatrix help page, an at the example, you will see that the control argument is specified this way :

data(crude)
dtm <- DocumentTermMatrix(crude,
           control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
                          stopwords = TRUE))

So, the weighting is specified with the list element named weighting, not weight. And you can specify this weighting by passing a function name or a custom function, as in the example. But the following works too :

data(crude)
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf))