CREATE TABLE product (
product_id SERIAL,
factory_key VARCHAR(60),
relevant BOOLEAN
)
Indexes:
"product_factory_key_key" btree (factory_key);
"product_factory_key_relevant_key" btree (factory_key, relevant) WHERE relevant = false;
"product_relevant_key" btree (relevant);
Facts:
product
tableThis is the problem query:
SELECT * FROM product WHERE factory_key='some_product_key' AND relevant=false LIMIT 10;
Explain analyze:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..23.06 rows=10 width=188) (actual time=2709.654..32252.961 rows=10 loops=1)
-> Seq Scan on product (cost=0.00..7366785.34 rows=3194759 width=188) (actual time=2709.634..32252.904 rows=10 loops=1)
Filter: ((NOT relevant) AND ((product_key)::text = 'some_product_key'::text))
Rows Removed by Filter: 449486
Total runtime: 32253.150 ms
(5 rows)
Problem:
This is problematic because:
I believe the planner chose to use seq scan because there are so many rows that match this factory. (about 3.2 million rows match this factory or approx 3%)
However, because only a EXTREMELY small number of rows are NOT relevant. And I am looking for NOT relevant. A seq scan end up being extremely expensive.
I have already created an composite index product_factory_key_relevant_key
, however it is not taking advantage of the index.
EDIT:
I am trying to force postgres to use the composite key: product_factory_key_relevant_key
SET enable_seqscan=off
Although, it is now using the index scan. It is actually still slower than a seqscan. (so I guess the planner was correct in doing a seq scan)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..34.03 rows=10 width=188) (actual time=8.088..469974.692 rows=10 loops=1)
-> Index Scan using product_factory_key_relevant_key on product (cost=0.57..10689307.49 rows=3194776 width=188) (actual time=8.083..469974.655 rows=10 loops=1)
Index Cond: (relevant = false)
Filter: ((NOT relevant) AND ((product_key)::text = 'some_product_key'::text))
Rows Removed by Filter: 2205295
Total runtime: 469974.820 ms
(6 rows)
You can't ever force PostgreSQL to use a particular index, or totally prevent it from doing a seqscan.
However, you can tell it to avoid doing certain scan types if it possibly can, by setting the relevant enable_
parameters to off
. It's really a feature intended only for debugging.
For testing, try:
SET enable_seqscan = off;
if Pg can possibly use an index scan (or something else) it will.
You may also want to consider:
SET random_page_cost = 1.1
i.e. tell PostgreSQL that random I/O is only slightly more expensive than sequential I/O. This is usually true on systems with SSDs, or where most of the DB is cached in RAM. It will be more likely to choose an index in this case.
Of course, if your system's random I/O is actually more expensive, then using an index is likely to be slower.
What you should really do is follow the advice you've already been given. Create the index in order of selectivity - if relevant
is less common, use that. You can even go a step further and create a partial index:
CREATE INDEX idx_name_blah ON tbl_name_blah (factory_key) WHERE (NOT relevant);
This index only contains values for relevant = 'f'
. It can only be used for queries where the planner knows relevant will be false. On the other hand, it will be a much smaller, faster index.
You might also have inaccurate statistics, causing PostgreSQL to think value frequencies are different than they really are for your table. explain analyze
will help show this.
You can also just ANALYZE my_table
in case the stats are just out of date; if so, increase the frequency with which autovacuum runs because it's not keeping up.
If the stats are up to date but the planner is still making stats-based mis-estimations, increasing the statistics target for the table (see manual) and re-analyzing can help if it is actually a statistics mis-estimation problem.
Older PostgreSQL versions tend to be less smart about cost estimation, query optimization, statistics, query execution methods, and pretty much everything else.
If you're not on the latest version, upgrade.
For example, 9.2's index-only scans would allow you to create a partial index
(product_id, factory_key) WHERE (NOT relevant)
and then run a query:
SELECT product_id, factory_key FROM my_table WHERE NOT relevant;
that should only read the index, with no heap access at all.