Fuzzy Text Matching C#

Myles McDonnell picture Myles McDonnell · Nov 21, 2011 · Viewed 37.6k times · Source

I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.

Answer

Fosco picture Fosco · Nov 21, 2011

Let me introduce you to the Levenshtein distance formula. It is awesome:

http://en.wikipedia.org/wiki/Levenshtein_distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.

Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.