Stemming algorithm that produces real words

Dave picture Dave · Oct 10, 2008 · Viewed 34.8k times · Source

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities

I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):

http://tartarus.org/~martin/PorterStemmer/php.txt

This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".

I've tried "Snowball" (suggested within another Stack Overflow thread).

http://snowball.tartarus.org/demo.php

For my example (community / communities), Snowball stems to "communiti".

Question

Are there any other stemming algorithms that will do this? Has anyone else solved this problem?

My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.

Answer

Kaarel picture Kaarel · Mar 5, 2009

If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.

There are many lemmatizers for English, I've only used morpha though. Morpha is just a big lex-file which you can compile into an executable. Usage example:

$ cat test.txt 
Community
Communities
$ cat test.txt | ./morpha -uc
Community
Community

You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html