I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities
I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):
http://tartarus.org/~martin/PorterStemmer/php.txt
This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".
I've tried "Snowball" (suggested within another Stack Overflow thread).
http://snowball.tartarus.org/demo.php
For my example (community / communities), Snowball stems to "communiti".
Question
Are there any other stemming algorithms that will do this? Has anyone else solved this problem?
My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.
If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.
There are many lemmatizers for English, I've only used morpha
though.
Morpha is just a big lex-file which you can compile into an executable.
Usage example:
$ cat test.txt
Community
Communities
$ cat test.txt | ./morpha -uc
Community
Community
You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html