Following on from this answer I want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, and limit to only matching queries.
Let's assume a very simple table.
CREATE TABLE pictures (
id SERIAL PRIMARY KEY,
title varchar(300),
...
)
or whatever. Now I want to search the title
field. First I create an index:
CREATE INDEX pictures_title ON pictures
USING gin(to_tsvector('english', title));
Now I want to search for 'small dog'
. This works:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), 'small dog'
) AS score
FROM pictures
ORDER BY score DESC
But what I really want is this:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures
WHERE to_tsvector('english', pictures.title) @@ to_tsquery('small dog')
ORDER BY score DESC
Or alternatively this (which doesn't work - can't use score
in the WHERE
clause):
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC
What's the best way to do this? My questions are many-fold:
to_tsvector(...)
will it call that twice, or is it smart enough to cache the results somehow?to_ts...
function calls?score
in the WHERE
clause at all?score > 0
or use the @@
thing?The use of the @@
operator will utilize the full text GIN index, while the test for score > 0
would not.
I created a table as in the Question, but added a column named title_tsv
:
CREATE TABLE test_pictures (
id BIGSERIAL,
title text,
title_tsv tsvector
);
CREATE INDEX ix_pictures_title_tsv ON test_pictures
USING gin(title_tsv);
I populated the table with some test data:
INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data)
FROM some_table T;
Then I ran the previously accepted answer with explain analyze
:
EXPLAIN ANALYZE
SELECT score, id, title
FROM (
SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;
And got the following. Note the execution time of 5,015 ms
QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3) |
Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 131kB |
-> Parallel Seq Scan on test_pictures p (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision) |
Rows Removed by Filter: 497296 |
Planning time: 0.159 ms |
Execution time: 5015.664 ms |
Now compare that with the @@
operator:
EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
WHERE P.title_tsv @@ to_tsquery('address & shipping')
ORDER BY score DESC;
And the results coming in with an execution time of about 29 ms:
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3) |
Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 171kB |
-> Parallel Bitmap Heap Scan on test_pictures p (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
Recheck Cond: (title_tsv @@ to_tsquery('address & shipping'::text)) |
Heap Blocks: exact=625 |
-> Bitmap Index Scan on ix_pictures_title_tsv (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1) |
Index Cond: (title_tsv @@ to_tsquery('address & shipping'::text)) |
Planning time: 0.214 ms |
Execution time: 28.995 ms |
As you can see in the execution plan, the index ix_pictures_title_tsv
was used in the second query, but not in the first one, making the query with the @@
operator a whopping 172 times faster!