In R I used the [tm package][1]
for building a term-document matrix from a corpus of documents.
My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs()
can do his job.
This is my code so far:
library(tm)
library(RWeka)
txtData <- read.csv("file.csv", header = T, sep = ",")
txtCorpus <- Corpus(VectorSource(txtData$text))
...further preprocessing
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(txtCorpus, control = list(tokenize = BigramTokenizer))
#term argument holds two words since the BigramTokenizer extracted all pairs from txtCorpus
findAssocs(txtTdmBi, "cat shop", 0.5)
cat cabi cat scratch ...
0.96 0.91
I tried to define a variable with all the row.names from txtTdmBi
and feed it to the findAssocs()
function. However, with the following result:
allRows <- c(row.names(txtTdmBi))
findAssocs(txtTdmBi, allRows, 0.5)
Error in which(x[term, ] > corlimit) : subscript out of bounds
In addition: Warning message:
In term == Terms(x) :
longer object length is not a multiple of shorter object length
Because extracting associations for a term spent over multiple term-document matrices is already explained here, I guess it would be possible to find the associations for multiple terms in a single term-document matrix. Except how?
I hope someone can clarify me how to solve this. Thanks in advance for any support.
If I understand correctly, an lapply
solution is probably the way to answer your question. This is the same approach as the answer that you link to, but here's a self-contained example that might be closer to your use case:
Load libraries and reproducible data (please include these in your future questions here)
library(tm)
library(RWeka)
data(crude)
Your bigram tokenizer...
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
Check that it worked by inspecting a random sample...
inspect(txtTdmBi[1000:1005, 10:15])
A term-document matrix (6 terms, 6 documents)
Non-/sparse entries: 1/35
Sparsity : 97%
Maximal term length: 18
Weighting : term frequency (tf)
Docs
Terms 248 273 349 352 353 368
for their 0 0 0 0 0 0
for west 0 0 0 0 0 0
forced it 0 0 0 0 0 0
forced to 0 0 0 0 0 0
forces trying 1 0 0 0 0 0
foreign investment 0 0 0 0 0 0
Here is the answer to your question:
Now use a lapply
function to calculate the associated words for every item in the vector of terms in the term-document matrix. The vector of terms is most simply accessed with txtTdmBi$dimnames$Terms
. For example txtTdmBi$dimnames$Terms[[1005]]
is "foreign investment".
Here I've used llply
from the plyr
package so we can have a progress bar (comforting for big jobs), but it's basically the same as the base lapply
function.
library(plyr)
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5), .progress = "text" )
The output is a list where each item in the list is a vector of named numbers where the name is the term and the number is the correlation value. For example, to see the terms associated with "foreign investment", we can access the list like so:
dat[[1005]]
and here are the terms associated with that term (I've just pasted in the top few)
168 million 1986 was 1987 early 300 mln 31 pct
1.00 1.00 1.00 1.00 1.00
a bit a crossroads a leading a political a population
1.00 1.00 1.00 1.00 1.00
a reduced a series a slightly about zero activity continues
1.00 1.00 1.00 1.00 1.00
advisers are agricultural sector agriculture the all such also reviews
1.00 1.00 1.00 1.00 1.00
and advisers and attract and imports and liberalised and steel
1.00 1.00 1.00 1.00 1.00
and trade and virtual announced since appears to are equally
1.00 1.00 1.00 1.00 1.00
are recommending areas for areas of as it as steps
1.00 1.00 1.00 1.00 1.00
asia with asian member assesses indonesia attract new balance of
1.00 1.00 1.00 1.00 1.00
Is that what you want to do?
Incidentally, if your term-document matrix is very large, you may want to try this version of findAssocs
:
# u is a term document matrix
# term is your term
# corlimit is a value -1 to 1
findAssocsBig <- function(u, term, corlimit){
suppressWarnings(x.cor <- gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),
as.matrix(t(u[ u$dimnames$Terms == term, ])) ))
x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
return(x)
}
This can be used like so:
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5), .progress = "text" )
The advantage of this is that it uses a different method of converting the TDM to a matrix tm:findAssocs
. This different method uses memory more efficiently and so prevents this kind of message: Error: cannot allocate vector of size 1.9 Gb
from occurring.
Quick benchmarking shows that both findAssocs
functions are about the same speed, so the main difference is in the use of memory:
library(microbenchmark)
microbenchmark(
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)),
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)),
times = 10)
Unit: seconds
expr min lq median
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)) 10.82369 11.03968 11.25492
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)) 10.70980 10.85640 11.14156
uq max neval
11.39326 11.89754 10
11.18877 11.97978 10