Python: Semantic similarity score for Strings

user8472 picture user8472 · Jun 10, 2013 · Viewed 54.7k times · Source

Are there any libraries for computing semantic similarity scores for a pair of sentences ?

I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like port-stemming, stop word removal, etc, on whole sentences and outputs a score for how related the two sentences are.

I found a work in progress that's written using the .NET framework that computes the score using an array of pre-processing steps. Is there any project that does this in python?

I'm not looking for the sequence of operations that would help me find the score (as is asked for here)
I'd love to implement each stage on my own, or glue functions from different libraries so that it works for sentence pairs, but I need this mostly as a tool to test inferences on data.


EDIT: I was considering using NLTK and computing the score for every pair of words iterated over the two sentences, and then draw inferences from the standard deviation of the results, but I don't know if that's a legitimate estimate of similarity. Plus, that'll take a LOT of time for long strings.
Again, I'm looking for projects/libraries that already implement this intelligently. Something that lets me do this:

import amazing_semsim_package
str1='Birthday party ruined as cake explodes'
str2='Grandma mistakenly bakes cake using gunpowder'

>>similarity(str1,str2)
>>0.889

Answer

Justin Muller picture Justin Muller · Jun 19, 2013

The best package I've seen for this is Gensim, found at the Gensim Homepage. I've used it many times, and overall been very happy with it's ease of use; it is written in Python, and has an easy to follow tutorial to get you started, which compares 9 strings. It can be installed via pip, so you won't have a lot of hassle getting it installed I hope.

Which scoring algorithm you use depends heavily on the context of your problem, but I'd suggest starting of with the LSI functionality if you want something basic. (That's what the tutorial walks you through.)

If you go through the tutorial for gensim, it will walk you through comparing two strings, using the Similarities function. This will allow you to see how your stings compare to each other, or to some other sting, on the basis of the text they contain.

If you're interested in the science behind how it works, check out this paper.