I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?