Lucene query: bla~* (match words that start with something fuzzy), how?

Pimin Konstantin Kefaloukos picture Pimin Konstantin Kefaloukos · Apr 13, 2010 · Viewed 13.8k times · Source

In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query

Meaning: Please match words that begin with "bla" or something similar to "bla".

Update: What I do now, works for small input, is use the following (snippet of SOLR schema):

<fieldtype name="text_ngrams" class="solr.TextField">
  <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>

In case you don't use SOLR, this does the following.

Indextime: Index data by creating a field containing all prefixes of my (short) input.

Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.

Answer

Robert Muir picture Robert Muir · May 8, 2010

in the development trunk of lucene (not yet a release), there is code to support use cases like this, via AutomatonQuery. Warning: the APIs might/will change before its released, but it gives you the idea.

Here is an example for your case:

// a term representative of the query, containing the field. 
// the term text is not so important and only used for toString() and such
Term term = new Term("yourfield", "bla~*");

// builds a DFA that accepts all strings within an edit distance of 2 from "bla"
Automaton fuzzy = new LevenshteinAutomata("bla").toAutomaton(2);

// concatenate this DFA with another DFA equivalent to the "*" operator
Automaton fuzzyPrefix = BasicOperations.concatenate(fuzzy, BasicAutomata.makeAnyString());

// build a query, search with it to get results.
AutomatonQuery query = new AutomatonQuery(term, fuzzyPrefix);