How does a full text search server like Sphinx work?

0x4a6f4672 picture 0x4a6f4672 · Apr 24, 2012 · Viewed 8.4k times · Source

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:

select * from items where name like '%keyword%';

But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:

SELECT (items.id * 5 + 1) AS id, ... 
       WHERE items.id >= $start AND items.id <= $end 
       GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)

It it possible to explain in simple words how these queries work and how they are generated?

Answer

Yavar picture Yavar · Apr 24, 2012

Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index

Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.

This makes it super fast.

Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.

Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.

What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.

EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.