R tm package vcorpus: Error in converting corpus to data frame

lmcshane picture lmcshane · Jul 11, 2014 · Viewed 30.9k times · Source

I am using the tm package to clean up some data using the following code:

mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)

I then want to convert the corpus back into a data frame in order to export a text file that contains the data in the original format of a data frame. I have tried the following:

dataframe <- as.data.frame(mycorpus)

But this returns an error:

"Error in as.data.frame.default.(mycorpus) : cannot coerce class "c(vcorpus, > corpus")" to a data.frame

How can I convert a corpus into a data frame?

Answer

MrFlick picture MrFlick · Jul 11, 2014

Your corpus is really just a character vector with some extra attributes. So it's best to convert it to character, then you can save that to a data.frame like so:

library(tm)
x <- c("Hello. Sir!","Tacos? On Tuesday?!?")
mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)

dataframe <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")), 
    stringsAsFactors=F)

which returns

              text
1        Hello Sir
2 Tacos On Tuesday

UPDATE: With newer version of tm, they seem to have updated the as.list.SimpleCorpus method which really messes with using sapplyand lapply. Now I guess you'd have to use

dataframe <- data.frame(text=sapply(mycorpus, identity), 
    stringsAsFactors=F)