Fuzzy Search in Solr

Ravi picture Ravi · May 20, 2013 · Viewed 23.2k times · Source

I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').

  1. If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
  2. If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.

I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.

Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.

Answer

Mysterion picture Mysterion · Jun 4, 2014

You use fuzzy query in a wrong way.

According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

So you need to write queries like this - Health~2