Finding similar strings with PostgreSQL quickly

Question 1

Finding similar strings with PostgreSQL quickly

sql postgresql text similarity postgresql-performance

cdarwin · Jun 28, 2012 · Viewed 37.3k times · Source

Answer

Answer

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 12). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

Try instead:

SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
-- SELECT set_limit(0.8);               -- for older versions

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   names n1
JOIN   names n2 ON n1.name <> n2.name
               AND n1.name % n2.name
ORDER  BY sim DESC;

Faster by orders of magnitude, but still slow.

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

Query a parameter (postgresql.conf setting) like "max_connections"

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

As to your subsidiary question:

WHERE ... sim > 0.8

Does not work because you cannot refer to output columns in WHERE or HAVING clauses. That's according to the (a bit confusing, granted) SQL standard - which is handled rather loosely by certain other RDBMS.

On the other hand:

ORDER BY sim DESC

Works because output columns can be used in GROUP BY and ORDER BY. Details:

PostgreSQL reusing computation result in select query

Test case

I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS 
SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

Second round of tests with GIST index:

DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

New query:

SELECT set_limit(0.8);

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   t n1
JOIN   t n2 ON n1.name <> n2.name
           AND n1.name % n2.name
ORDER  BY sim DESC;

GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms

Old query:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM   t n1, t n2
WHERE  n1.name != n2.name
AND    similarity(n1.name, n2.name) > 0.8
ORDER  BY sim DESC;

GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN or GiST?

GIN often provides superior read performance:

Difference between GiST and GIN index

But not in this particular case:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

Multicolumn index on 3 fields with heterogenous data types

Question 2

I need to create a ranking of similar strings in a table.

I have the following table

create table names (
name character varying(255)
);

Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests:

CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);

and I'm executing the following query:

select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
from names n1, names n2
where n1.name != n2.name and similarity(n1.name, n2.name) > .8
order by sim desc;

The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8 without getting a "column sim doesn't exist" error.

I'd like any hint to make the query faster.

Finding similar strings with PostgreSQL quickly

Answer

Test case

GIN or GiST?

Related questions