My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.
Consider the following code:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
The output:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.
I have heard that I could use data.table
but I am not sure how.
I have also looked at the qdap
package, but it gives me an error when trying to load the package, plus I don't even know if it will work.
data.table
is definitely the right way to go. Regex operations are slow, although the ones in stringi
are much faster (in addition to being much better). Anything with
I went through many iterations of solving problem in creating quanteda::dfm()
for my quanteda package (see the GitHub repo here). The fastest solution, by far, involves using the data.table
and Matrix
packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix.
In the code below, I've taken for an example texts found with the quanteda package, which you can (and should!) install from CRAN or the development version from
devtools::install_github("kbenoit/quanteda")
I'd be very interested to see how it works on your 4m documents. Based on my experience working with corpuses of that size, it will work pretty well (if you have enough memory).
Note that in all my profiling, I could not improve the speed of the data.table operations through any sort of parallelisation, because of the way they are written in C++.
dfm()
functionHere is the bare bones of the data.table
based source code, in case any one wants to have a go at improving it. It takes a input a list of character vectors representing the tokenized texts. In the quanteda package, the full-featured dfm()
works directly on character vectors of documents, or corpus objects, directly and implements lowercasing, removal of numbers, and removal of spacing by default (but these can all be modified if wished).
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R
).
How's this for simplicity?
require(quanteda)
mytext <- c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.
# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs bar big child dog hold honor hunt let student
# text1 0 1 0 1 0 0 1 1 0
# text2 1 0 0 0 1 0 0 0 0
# text3 0 0 1 0 0 1 0 0 1