R, tm-error of transformation drops documents

Julie picture Julie · Aug 21, 2018 · Viewed 14.6k times · Source

I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map:

library (tm)
library(NLP)
lirary (openNLP)

text = c('.......')
corp <- Corpus(VectorSource(text))
corp <- tm_map(corp, stripWhitespace)

Warning message:
In tm_map.SimpleCorpus(corp, stripWhitespace) :
transformation drops documents

corp <- tm_map(corp, tolower)

Warning message:
In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents

The codes were working 2 months ago, now I'm trying for a new data and it is not working anymore. Anyone please shows me where was I wrong. Thank you. I even tried with the command below, but it doesn't work either.

corp <- tm_map(corp, content_transformer(stripWhitespace))

Answer

phiver picture phiver · Aug 21, 2018

The code should still be working. You get a warning, not an error. This warning only appears when you have a corpus based on a VectorSource in combination when you use Corpus instead of VCorpus.

The reason is that there is a check in the underlying code to see if the number of names of the corpus content matches the length of the corpus content. With reading the text as a vector there are no document names and this warning pops up. And this is only a warning, no documents have been dropped.

See the difference between the 2 examples

library(tm)

text <- c("this is my text with some other text and some more")

# warning based on Corpus and Vectorsource
text_corpus <- Corpus(VectorSource(text))

# warning appears running following line
tm_map(text_corpus, content_transformer(tolower))
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1
Warning message:
In tm_map.SimpleCorpus(text_corpus, content_transformer(tolower)) :
  transformation drops documents

# Using VCorpus
text_corpus <- VCorpus(VectorSource(text))

# warning doesn't appear
tm_map(text_corpus, content_transformer(tolower))
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1
tm_map(text_corpus, content_transformer(tolower))