What algorithm is used for finding ngrams?
Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?
I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language.
I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done.
Edit: it's important know how many times each n-gram appears.
Edit 2: there is a R package for N-GRAMS?
If you want to use R
to identify ngrams, you can use the tm
package and the RWeka
package. It will tell you how many times the ngram occurs in your documents, like so:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
A term-document matrix (6 terms, 10 documents)
Non-/sparse entries: 4/56
Sparsity : 93%
Maximal term length: 13
Weighting : term frequency (tf)
Docs
Terms 127 144 191 194 211 236 237 242 246 248
and said 0 0 0 0 0 0 0 0 0 0
and security 0 0 0 0 0 0 0 0 1 0
and set 0 1 0 0 0 0 0 0 0 0
and six-month 0 0 0 0 0 0 0 1 0 0
and some 0 0 0 0 0 0 0 0 0 0
and stabilise 0 0 0 0 0 0 0 0 0 1