Here's a puzzle...
I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the product names are not always identical. I've tried using the Levenshtein distance for measuring the string similarity however this hasn't worked. For example,
-LG 42CS560 42-Inch 1080p 60Hz LCD HDTV
-LG 42 Inch 1080p LCD HDTV
These items are the same, yet their product names vary quite a lot.
On the other hand...
-LG 42 Inch 1080p LCD HDTV
-LG 50 Inch 1080p LCD HDTV
These are different products with very similar product names.
How should I tackle this problem?
My first thought is to try to parse the names into a description of features (company LG
, size 42 Inch
, resolution 1080p
, type LCD HDTV
). Then you can match these descriptions against each other for compatibility; it's okay to omit a product number but bad to have different sizes. Simple are-the-common-attributes-compatible might be enough, or you might have to write / learn rules about how much different attributes are allowed to differ and so on.
Depending on how many different kinds of products you have and how different the listed names are, I might actually start by manually defining a set of attributes and possibly even just adding specific words / regexes to match them, iteratively seeing what isn't been parsed so far and adding rules for that. I'd imagine there's not a lot of ambiguity in terms of one vocabulary item possibly belonging to multiple attributes, though without seeing your database I guess I don't know.
If that's not going to be feasible, this extraction is kind of analogous to semi-supervised part-of-speech tagging. It's somewhat different, though, in that I imagine the vocabulary is much more limited than typical parsing, and in that the space of product names is more heirarchical: the resolution
tag only applies to certain kinds of products. I'm not very familiar with that literature; there might be some ideas you could use.