DISTINCT ON in an aggregate function in postgres

Migwell picture Migwell · May 6, 2015 · Viewed 21.4k times · Source

For my problem, we have a schema whereby one photo has many tags and also many comments. So if I have a query where I want all the comments and tags, it will multiply the rows together. So if one photo has 2 tags and 13 comments, I get 26 rows for that one photo:

SELECT
        tag.name, 
        comment.comment_id
FROM
        photo
        LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
        LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
        LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id

enter image description here

That's fine for most things, but it means that if I GROUP BY and then json_agg(tag.*), I get 13 copies of the first tag, and 13 copies of the second tag.

SELECT json_agg(tag.name) as tags
FROM
        photo
        LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
        LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
        LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id

enter image description here

Instead I want an array that is only 'suburban' and 'city', like this:

 [
      {"tag_id":1,"name":"suburban"}, 
      {"tag_id":2,"name":"city"}
 ]

I could json_agg(DISTINCT tag.name), but this will only make an array of tag names, when I want the entire row as json. I would like to json_agg(DISTINCT ON(tag.name) tag.*), but that's not valid SQL apparently.

How then can I simulate DISTINCT ON inside an aggregate function in Postgres?

Answer

Paul A Jungwirth picture Paul A Jungwirth · Mar 12, 2016

Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT and SUM if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:

WITH tags AS (
  SELECT  photo.photo_id, json_agg(row_to_json(tag.*)) AS tags
  FROM    photo
  LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
  LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
  GROUP BY photo.photo_id
),
comments AS (
  SELECT  photo.photo_id, json_agg(row_to_json(comment.*)) AS comments
  FROM    photo
  LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
  GROUP BY photo.photo_id
)
SELECT  COALESCE(tags.photo_id, comments.photo_id) AS photo_id,
        tags.tags,
        comments.comments
FROM    tags
FULL OUTER JOIN comments
ON      tags.photo_id = comments.photo_id

EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:

SELECT  photo.photo_id,
        to_json(array_agg(DISTINCT tag.*)) AS tags,
        to_json(array_agg(DISTINCT comment.*)) AS comments
FROM    photo
LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id