This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question.
According to Wikipedia, lemmatization is defined as:
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
A simple Google search for lemmatization in R will only point to the package wordnet
of R. When I tried this package expecting that a character vector c("run", "ran", "running")
input to the lemmatization function would result in c("run", "run", "run")
, I saw that this package only provides functionality similar to grepl
function through various filter names and a dictionary.
An example code from wordnet
package, which gives maximum of 5 words starting with "car", as the filter name explains itself:
filter <- getTermFilter("StartsWithFilter", "car", TRUE)
terms <- getIndexTerms("NOUN", 5, filter)
sapply(terms, getLemma)
The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using R
I want to find true roots of the words: (For e.g. from c("run", "ran", "running")
to c("run", "run", "run")
).
Hello you can try package koRpus
which allow to use Treetagger :
tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
TT.tknz=FALSE , lang="en",
TT.options=list(path="./TreeTagger", preset="en"))
[email protected]
## token tag lemma lttr wclass desc stop stem
## 1 run NN run 3 noun Noun, singular or mass NA NA
## 2 ran VVD run 3 verb Verb, past tense NA NA
## 3 running VVG run 7 verb Verb, gerund or present participle NA NA
See the lemma
column for the result you're asking for.