I'm trying to submit a query in Postgres that only returns distinct tuples. In my sample query, I do not want duplicate entries where an entry exists multiple times for a cluster_id/feed_id combination. If I do a simple:
select distinct on (cluster_info.cluster_id, feed_id)
cluster_info.cluster_id, num_docs, feed_id, url_time
from url_info
join cluster_info on (cluster_info.cluster_id = url_info.cluster_id)
where feed_id in (select pot_seeder from potentials)
and num_docs > 5 and url_time > '2012-04-16';
I get just that, but I'd also like to group according to num_docs
. So, when I do the following:
select distinct on (cluster_info.cluster_id, feed_id)
cluster_info.cluster_id, num_docs, feed_id, url_time
from url_info join cluster_info
on (cluster_info.cluster_id = url_info.cluster_id)
where feed_id in (select pot_seeder from potentials)
and num_docs > 5 and url_time > '2012-04-16'
order by num_docs desc;
I get the following error:
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: select distinct on (cluster_info.cluster_id, feed_id) cluste...
I think I understand why I'm getting the error (cannot group by tuples unless I explicitly describe the group somehow) but how do I do that? Or if I am incorrect in my interpretation of the error, is there a way to accomplish my initial goal?
The leftmost ORDER BY
items cannot disagree with the items of the DISTINCT
clause. I quote the manual about DISTINCT
:
The
DISTINCT ON
expression(s) must match the leftmostORDER BY
expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within eachDISTINCT ON
group.
Try:
SELECT *
FROM (
SELECT DISTINCT ON (c.cluster_id, feed_id)
c.cluster_id, num_docs, feed_id, url_time
FROM url_info u
JOIN cluster_info c ON (c.cluster_id = u.cluster_id)
WHERE feed_id IN (SELECT pot_seeder FROM potentials)
AND num_docs > 5
AND url_time > '2012-04-16'
ORDER BY c.cluster_id, feed_id, num_docs, url_time
-- first columns match DISTINCT
-- the rest to pick certain values for dupes
-- or did you want to pick random values for dupes?
) x
ORDER BY num_docs DESC;
Or use GROUP BY
:
SELECT c.cluster_id
, num_docs
, feed_id
, url_time
FROM url_info u
JOIN cluster_info c ON (c.cluster_id = u.cluster_id)
WHERE feed_id IN (SELECT pot_seeder FROM potentials)
AND num_docs > 5
AND url_time > '2012-04-16'
GROUP BY c.cluster_id, feed_id
ORDER BY num_docs DESC;
If c.cluster_id, feed_id
are the primary key columns of all (both in this case) tables that you include columns from in the SELECT
list, then this just works with PostgreSQL 9.1 or later.
Else you need to GROUP BY
the rest of the columns or aggregate or provide more information.