I have two sets of data:
a set of tags (single words like php
, html
, etc)
a set of texts
I wish now to build a Term-Document-Matrix representing the number occurrences of the tags
element in the text
element.
I have looked into R library tm, and the TermDocumentMatrix
function, but I do not see the possibility to specify the tags as input.
Is there a way to do that?
I am open to any tool (R, Python, other), although using R would be great.
Let's set the data as:
TagSet <- data.frame(c("c","java","php","javascript","android"))
colnames(TagSet)[1] <- "tag"
TextSet <- data.frame(c("How to check if a java file is a javascript script java blah","blah blah php"))
colnames(TextSet)[1] <- "text"
now I'd like to have the TermDocumentMatrix of TextSet according to TagSet.
I tried this:
myCorpus <- Corpus(VectorSource(TextSet$text))
tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE, stopwords=TRUE))
>inspect(tdm)
A term-document matrix (7 terms, 2 documents)
Non-/sparse entries: 8/6
Sparsity : 43%
Maximal term length: 10
Weighting : term frequency (tf)
Docs
Terms 1 2
blah 1 2
check 1 0
file 1 0
java 2 0
javascript 1 0
php 0 1
script 1 0
but that's checking the text against the words of the text, whereas I want to check presence of already defined tags.
tdm.onlytags <- tdm[rownames(tdm)%in%TagSet$tag,]
to select only your specified words and next proceed with your analysis.