Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.
I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.
Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.
However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?
Specifically, what I am trying to find out is that how do I create the exact DTM on new data?
If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.
Any ideas/thoughts?
tm
has so many pitfalls... See much more efficient text2vec and vectorization vignette which fully answers to the question.
For tm
here is probably one more simple way to reconstruct DTM matrix for second corpus:
crude2.dtm <- DocumentTermMatrix(crude2, control = list
(dictionary=Terms(crude1.dtm), wordLengths = c(3,10)) )