Multiple DISTINCT ON clauses in PostgreSQL

Jivan picture Jivan · Feb 28, 2017 · Viewed 7.5k times · Source

Is it possible to select rows that are DISTINCT ON some separate, independent sets of columns?

Suppose I want all the rows which match the following conditions:

  • distinct on (name, birth)
  • distinct on (name, height)

So that, out of the following table, the rows marked with a red cross would not be distinct (with an indication of the failing clause):

name      birth    height
--------------------------
William    1976      1.82
James      1981      1.68
Mike       1976      1.68
Tom        1967      1.79
William    1976      1.74   ❌ (name, birth)
William    1981      1.82   ❌ (name, height)
Tom        1978      1.92
Mike       1963      1.68   ❌ (name, height)
Tom        1971      1.86
James      1981      1.77   ❌ (name, birth)
Tom        1971      1.89   ❌ (name, birth)

In the above example, if the DISTINCT ON clause had just been DISTINCT ON (name, birth, height), then all the rows would have been considered distinct.

Tried and didn't work:

  • SELECT DISTINCT ON (name, birth) (name, height) ...
  • SELECT DISTINCT ON (name, birth), (name, height) ...
  • SELECT DISTINCT ON ((name, birth), (name, height)) ...
  • SELECT DISTINCT ON (name, birth) AND (name, height) ...
  • SELECT DISTINCT ON (name, birth) AND ON (name, height) ...
  • SELECT DISTINCT ON (name, birth) DISTINCT ON (name, height) ...
  • SELECT DISTINCT ON (name, birth), DISTINCT ON (name, height) ...

Answer

Erwin Brandstetter picture Erwin Brandstetter · Feb 28, 2017

As commented, there is ambiguity in the question. The number of result rows can differ for every call. If you are satisfied with arbitrary results, @klin's solution is good enough. Else, you need to define requiremens more closely. Like:
distinct on (name, birth), pick smallest height first, then smallest ID as tiebreaker

Or:
distinct on (name, height), pick earliest birth first, then smallest ID as tiebreaker

Your table should have a primary key (or some way to identify rows uniquely):

CREATE TEMP TABLE tbl (
  tbl_id serial PRIMARY KEY
, name text
, birth int
, height numeric);

INSERT INTO tbl (name, birth, height)
VALUES
  ('William', 1976, 1.82)
, ('James',   1981, 1.68)
, ('Mike',    1976, 1.68)
, ('Tom',     1967, 1.79)
, ('William', 1976, 1.74)
, ('William', 1981, 1.82)
, ('Tom',     1978, 1.92)
, ('Mike',    1963, 1.68)
, ('Tom',     1971, 1.86)
, ('James',   1981, 1.77)
, ('Tom',     1971, 1.89);

Query:

SELECT DISTINCT ON (name, height) *
FROM  (
   SELECT DISTINCT ON (name, birth) *
   FROM   tbl
   ORDER  BY name, birth, height, tbl_id  -- pick smallest height, ID as tiebreaker
   ) sub
ORDER  BY name, height, birth, tbl_id;    -- pick earliest birth, ID as tiebreaker
 tbl_id |  name   | birth | height
--------+---------+-------+--------
      2 | James   |  1981 |   1.68
      8 | Mike    |  1963 |   1.68
      4 | Tom     |  1967 |   1.79
      9 | Tom     |  1971 |   1.86
      7 | Tom     |  1978 |   1.92
      5 | William |  1976 |   1.74
      6 | William |  1981 |   1.82
(7 rows)    -- !!!

A DISTINCT ON query without deterministic ORDER BY can return any arbitrary row from each set of dupes. Applied once, you still get a deterministic number of rows (with arbitrary picks). Applied repeatedly, the resulting number of rows is arbitrary, too. Related: