Symbols in query-string for elasticsearch

Arnob picture Arnob · Apr 25, 2013 · Viewed 12k times · Source

I have "documents" (activerecords) with an attribute called deviations. The attribute has values like "Bin X" "Bin $" "Bin q" "Bin %" etc.

I am trying to use tire/elasticsearch to search the attribute. I am using the whitespace analyzer to index the deviation attribute. Here is my code for creating the indexes:

settings :analysis => {
    :filter  => {
      :ngram_filter => {
        :type => "nGram",
        :min_gram => 2,
        :max_gram => 255
      },
      :deviation_filter => {
        :type => "word_delimiter",
        :type_table => ['$ => ALPHA']
      }
    },
    :analyzer => {
      :ngram_analyzer => {
        :type  => "custom",
        :tokenizer  => "standard",
        :filter  => ["lowercase", "ngram_filter"]
      },
      :deviation_analyzer => {
        :type => "custom",
        :tokenizer => "whitespace",
        :filter => ["lowercase"]
      }
    }
  } do
    mapping do
      indexes :id, :type => 'integer'
      [:equipment, :step, :recipe, :details, :description].each do |attribute|
        indexes attribute, :type => 'string', :analyzer => 'ngram_analyzer'
      end
      indexes :deviation, :analyzer => 'whitespace'
    end
  end

The search seems to work fine when the query string contains no special characters. For example Bin X will return only those records that have the words Bin AND X in them. However, searching for something like Bin $ or Bin % shows all results that have the word Bin almost ignoring the symbol (results with the symbol do show up higher in the search that results without).

Here is the search method I have created

def self.search(params)
    tire.search(load: true) do
      query { string "#{params[:term].downcase}:#{params[:query]}", default_operator: "AND" }
        size 1000
    end
end

and here is how I am building the search form:

<div>
    <%= form_tag issues_path, :class=> "formtastic issue", method: :get do %>
        <fieldset class="inputs">
        <ol>
            <li class="string input medium search query optional stringish inline">
                <% opts = ["Description", "Detail","Deviation","Equipment","Recipe", "Step"] %>
                <%= select_tag :term, options_for_select(opts, params[:term]) %>
                <%= text_field_tag :query, params[:query] %>
                <%= submit_tag "Search", name: nil, class: "btn" %>
            </li>
        </ol>
        </fieldset>
    <% end %>
</div>

Answer

Robert Kajic picture Robert Kajic · May 8, 2013

You can sanitize your query string. Here is a sanitizer that works for everything that I've tried throwing at it:

def sanitize_string_for_elasticsearch_string_query(str)
  # Escape special characters
  # http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#Escaping Special Characters
  escaped_characters = Regexp.escape('\\/+-&|!(){}[]^~*?:')
  str = str.gsub(/([#{escaped_characters}])/, '\\\\\1')

  # AND, OR and NOT are used by lucene as logical operators. We need
  # to escape them
  ['AND', 'OR', 'NOT'].each do |word|
    escaped_word = word.split('').map {|char| "\\#{char}" }.join('')
    str = str.gsub(/\s*\b(#{word.upcase})\b\s*/, " #{escaped_word} ")
  end

  # Escape odd quotes
  quote_count = str.count '"'
  str = str.gsub(/(.*)"(.*)/, '\1\"\3') if quote_count % 2 == 1

  str
end

params[:query] = sanitize_string_for_elasticsearch_string_query(params[:query])