"to_tsquery" on tsvector yields different results when using "simple" and "english"?

phatskat picture phatskat · May 25, 2012 · Viewed 8.7k times · Source

I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery()

Example

> SELECT to_tsvector('mortgag') @@ to_tsquery('simple', 'mortgage')
?column? 
----------
 f
(1 row)

> SELECT to_tsvector('mortgag') @@ to_tsquery('english', 'mortgage')
?column? 
----------
 t
(1 row)

I would think they should both return true, but obviously the first does not - why?

Answer

mu is too short picture mu is too short · May 25, 2012

The FTS utilizes dictionaries to normalize the text:

12.6. Dictionaries

Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme.

So dictionaries are used to throw out things that are too common or meaningless to consider in a search (stop words) and to normalize everything else so city and cities, for example, will match even though they're different words.

Let us look at some output from ts_debug and see what's going on with the dictionaries:

=> select * from ts_debug('english', 'mortgage');
   alias   |   description   |  token   |  dictionaries  |  dictionary  |  lexemes  
-----------+-----------------+----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}

=> select * from ts_debug('simple', 'mortgage');
   alias   |   description   |  token   | dictionaries | dictionary |  lexemes   
-----------+-----------------+----------+--------------+------------+------------
 asciiword | Word, all ASCII | mortgage | {simple}     | simple     | {mortgage}

Notice that simple uses the simple dictionary whereas english uses the english_stem dictionary.

The simple dictionary:

operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.

The simple dictionary just throws out stop words, downcases, and that's about it. We can see its simplicity ourselves:

=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
 to_tsquery | to_tsquery  
------------+-------------
 'mortgage' | 'mortgages'

The simple dictionary is too simple to even handle simple plurals.

So what is this english_stem dictionary all about? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. From the fine manual:

12.6.6. Snowball Dictionary

The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language.

And just below that we see the english_stem dictionary:

CREATE TEXT SEARCH DICTIONARY english_stem (
    TEMPLATE = snowball,
    Language = english,
    StopWords = english
);

So the english_stem dictionary stems words and we can see that happen:

=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
 to_tsquery | to_tsquery 
------------+------------
 'mortgag'  | 'mortgag'

Executive Summary: 'simple' implies simple minded literal matching, 'english' applies stemming to (hopefully) produce better matching. The stemming turns mortgage into mortgag and that gives you your match.