Word comparison algorithm

disc0dancer picture disc0dancer · Jan 23, 2009 · Viewed 22.2k times · Source

I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record:

   1,   John Doe,     ACME Comapny   (the typo is on purpose)

Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in the CSV. the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:

"Acme Company" and "Acme Comapny" should have a very small difference index, but "Acme Company" and "Cmea Mpnyaco" should have a very big difference index Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different. Also, "Acme Company" and "Company Acme" should return 0.

So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.

Is there a known algorithm to do this, or maybe we can invent one :) ?

Answer

MattK picture MattK · Jan 23, 2009

You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.

This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.