Frequency Per Term - R TM DocumentTermMatrix

Question 1

Frequency Per Term - R TM DocumentTermMatrix

r tm term-document-matrix

user1994952 · Jan 20, 2013 · Viewed 9.1k times · Source

Answer

Answer

It appears to be a sparse matrix organization of the data. It appears that the frequency is in the "v" list and you get that by looking up the position of your term in the Terms attribute. Why not provide dput(head(results, 30)) so your code (and your SO audience) will have something to work on? After plying around with the examples in the package, I suspect you actually want something along the lines of:

tdm <- TermDocumentMatrix(x)
z <- inspect( tdm[ c("the", "is", "a"), dimnames(tdm)$Docs] )
rowSums(z)

Question 2

I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them.

Ideally, I would like:

    Term  # 
    "the" 200 
    "is"  400 
    "a"   200

Currently my code is:

    library(tm)
    common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you")
    x <- Corpus(VectorSource(results)) 
    x <- tm_map(x, stripWhitespace) 
    x <- tm_map(x, removeNumbers) 
    x <- tm_map(x, removePunctuation) 
    x <- tm_map(x, stripWhitespace)

    dtm <- DocumentTermMatrix(x)
    for(i in 1:length(common.words)) {
    dtm <- dtm[,!colnames(dtm)%in%c(common.words[i])]
    }

This is the output from str(dtm)

   List of 6
   $ i       : int [1:9769] 1 1 1 1 1 1 1 1 2 2 ...
   $ j       : int [1:9769] 1596 1684 1858 2112 2175 2490 2714 2814 873 961 ...
   $ v       : num [1:9769] 1 1 2 1 1 2 1 1 1 1 ...
   $ nrow    : int 1477
   $ ncol    : int 3201
   $ dimnames:List of 2
   ..$ Docs : chr [1:1477] "1" "2" "3" "4" ...
   ..$ Terms: chr [1:3201] "\u0093\u0085a" "aardvark" "aaron" "abbie" ...
    - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
    - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

Thank you,

-A

Frequency Per Term - R TM DocumentTermMatrix

Answer

Related questions