SQLite: Efficient substring search in large table

Aletheios picture Aletheios · Jul 4, 2012 · Viewed 7.6k times · Source

I'm developing an Android application that has to perform substring search in a large table (about 500'000 entries with street and location names, so just a few words per entry).

CREATE TABLE Elements (elementID INTEGER, type INTEGER, name TEXT, data BLOB)

Note that only 20% of all entries contain strings in the "name" column.

Performing the following query almost takes 2 minutes:

SELECT elementID, name FROM Elements WHERE name LIKE %foo%

I now tried to use FTS3 in order to speed up the query. That was quite successful, query time decreased to 1 minute (surprisingly the database file size increased by only 5%, which is also quite good for my purpose).

The problem is, FTS3 seemingly doesn't support substring search, i.e. if I want to find "bar" in "foo bar" and "foobar", I only get "foo bar", although I need both results.

So actually I have two questions:

  1. Is it possible to further speed up the query? My goal is 30 seconds for the query, but I don't know if that's realistic...

  2. How can I get real substring search using FTS3?

Answer

Hai Feng Kao picture Hai Feng Kao · Apr 2, 2013

Solution 1: If you can make every character in your database as an individual word, you can use phrase queries to search the substring.

For example, assume "my_table" contains a single column "person":

person
------
John Doe
Jane Doe

you can change it to

person
------
J o h n D o e
J a n e D o e

To search the substring "ohn", use phrase query:

SELECT * FROM my_table WHERE person MATCH '"o h n"'

Beware that "JohnD" will match "John Doe", which may not be desired. To fix it, change the space character in the original string into something else.

For example, you can replace the space character with "$":

person
------
J o h n $ D o e
J a n e $ D o e

Solution 2: Following the idea of solution 1, you can make every character as an individual word with a custom tokenizer and use phrase queries to query substrings.

The advantage over solution 1 is that you don't have to add spaces in your data, which can unnecessarily increase the size of database.

The disadvantage is that you have to implement the custom tokenizer. Fortunately, I have one ready for you. The code is in C, so you have to figure out how to integrate it with your Java code.