More efficient means of creating a corpus and DTM with 4M rows

user1477388 picture user1477388 · Aug 15, 2014 · Viewed 12.2k times · Source

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

Consider the following code:

library(tm)

GetCorpus <-function(textVector)
{
  doc.corpus <- Corpus(VectorSource(textVector))
  doc.corpus <- tm_map(doc.corpus, tolower)
  doc.corpus <- tm_map(doc.corpus, removeNumbers)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
  doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

corp <- GetCorpus(data[,1])

inspect(corp)

dtm <- DocumentTermMatrix(corp)

inspect(dtm)

The output:

> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt

[[2]]
<<PlainTextDocument (metadata: 7)>>
 holds bar

[[3]]
<<PlainTextDocument (metadata: 7)>>
 child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

              Terms
Docs           bar big child dogs holds honor hunt let stud
  character(0)   0   1     0    1     0     0    1   1    0
  character(0)   1   0     0    0     1     0    0   0    0
  character(0)   0   0     1    0     0     1    0   0    1

My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.

I have heard that I could use data.table but I am not sure how.

I have also looked at the qdap package, but it gives me an error when trying to load the package, plus I don't even know if it will work.

Ref. http://cran.r-project.org/web/packages/qdap/qdap.pdf

Answer

Ken Benoit picture Ken Benoit · Jul 9, 2015

Which approach?

data.table is definitely the right way to go. Regex operations are slow, although the ones in stringi are much faster (in addition to being much better). Anything with

I went through many iterations of solving problem in creating quanteda::dfm() for my quanteda package (see the GitHub repo here). The fastest solution, by far, involves using the data.table and Matrix packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix.

In the code below, I've taken for an example texts found with the quanteda package, which you can (and should!) install from CRAN or the development version from

devtools::install_github("kbenoit/quanteda")

I'd be very interested to see how it works on your 4m documents. Based on my experience working with corpuses of that size, it will work pretty well (if you have enough memory).

Note that in all my profiling, I could not improve the speed of the data.table operations through any sort of parallelisation, because of the way they are written in C++.

Core of the quanteda dfm() function

Here is the bare bones of the data.table based source code, in case any one wants to have a go at improving it. It takes a input a list of character vectors representing the tokenized texts. In the quanteda package, the full-featured dfm() works directly on character vectors of documents, or corpus objects, directly and implements lowercasing, removal of numbers, and removal of spacing by default (but these can all be modified if wished).

require(data.table)
require(Matrix)

dfm_quanteda <- function(x) {
    docIndex <- 1:length(x)
    if (is.null(names(x))) 
        names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
            names(docIndex) <- names(x)

    alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
                            features = unlist(x, use.names = FALSE))
    alltokens <- alltokens[features != ""]  # if there are any "blank" features
    alltokens[, "n":=1L]
    alltokens <- alltokens[, by=list(docIndex,features), sum(n)]

    uniqueFeatures <- unique(alltokens$features)
    uniqueFeatures <- sort(uniqueFeatures)

    featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
                               features = uniqueFeatures)
    setkey(alltokens, features)
    setkey(featureTable, features)

    alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
    alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]

    sparseMatrix(i = alltokens$docIndex, 
                 j = alltokens$featureIndex, 
                 x = alltokens$V1, 
                 dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}

require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
##  user  system elapsed 
## 0.060   0.005   0.064 

That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R).

quanteda on your example

How's this for simplicity?

require(quanteda)
mytext <- c("Let the big dogs hunt",
            "No holds barred",
            "My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.

# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs    bar big child dog hold honor hunt let student
# text1   0   1     0   1    0     0    1   1       0
# text2   1   0     0   0    1     0    0   0       0
# text3   0   0     1   0    0     1    0   0       1