SQL: What does "DISTINCT ON (expression)" do?

JohnSmithy1266 picture JohnSmithy1266 · Oct 4, 2017 · Viewed 17.2k times · Source

I understand how DISTINCT works, but I don't understand DISTINCT ON (expression).

Take the first example from this screenshot:

enter image description here

How does the (a % 2) part affect everything? Is it saying that if a % 2 evaluates to true, then return it, then continue doing so for all other tuples but only return if the returned value is distinct?

Answer

MatBailie picture MatBailie · Oct 4, 2017

While the previous answer appears correct, I don't feel that it is particularly clear.

The snippet from the Official documentation for PostGreSQL is as follows...

DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).

The first point is that whatever you put in the ON (), must come first in the the ORDER BY, for reasons that will hopefully shortly become clear...

SELECT DISTINCT ON (a) a, b, c FROM a_table ORDER BY a, b

The results are then filtered, so that for each of the distinct entities, only the first row is actually returned.


For example...

CREATE TABLE example (
    id               INT,
    person_id        INT,
    address_id       INT,
    effective_date   DATE
);

INSERT INTO
    example (id, person_id, address_id, effective_date)
VALUES
    (1, 2, 1, '2000-01-01'),  -- Moved to first house
    (5, 2, 2, '2004-08-19'),  -- Went to uni
    (9, 2, 1, '2007-06-12'),  -- Moved back home

    (2, 4, 3, '2007-05-18'),  -- Moved to first house
    (3, 4, 4, '2016-02-09')   -- Moved to new house
;

SELECT DISTINCT ON (person_id)
    *
FROM
    example
ORDER BY
    person_id,
    effective_date DESC
;

This will order the results so that all the records for each person are contiguous, ordered from the most recent record to the oldest. Then, for each person, on the first record is returned. Thus, giving the most recent address for each person.

Step 1 : Apply the ORDER BY...

 id | person_id | address_id | effective_date
----+-----------+------------+----------------
  9 |      2    |      1     |  '2007-06-12'
  5 |      2    |      2     |  '2004-08-19'
  1 |      2    |      1     |  '2000-01-01'
  3 |      4    |      4     |  '2016-02-09'
  2 |      4    |      3     |  '2007-05-18'

Step 2 : filter to just the first row per person_id

 id | person_id | address_id | effective_date
----+-----------+------------+----------------
  9 |      2    |      1     |  '2007-06-12'
  3 |      4    |      4     |  '2016-02-09'


It is broadly equivalent to this following...

SELECT
    *
FROM
(
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY person_id
                               ORDER BY effective_date DESC)  AS person_address_ordinal
    FROM
        example
)
    AS sorted_example
WHERE
    person_address_ordinal = 1


As for the question about what (a % 2) does, it's just a mathematical calculation for MOD(a, 2), so you could do the following...

CREATE TABLE example (
    id               INT,
    score            INT
);

INSERT INTO
    example (id, score)
VALUES
    (1, 2),
    (2, 6),
    (3, 5),
    (4, 3),
    (5, 4),
;

SELECT DISTINCT ON (id % 2)
    *
FROM
    example
ORDER BY
    id % 2,
    score DESC
;

That would give the highest score for the even ids (where id % 2 equals 0), then the highest score the odd ids (where id % 2 equals 1).

Step 1 : Apply the ORDER BY...

 id | score
----+-------

  2 |   6     -- id % 2 = 0
  4 |   3     -- id % 2 = 0

  3 |   5     -- id % 2 = 1
  5 |   4     -- id % 2 = 1
  1 |   2     -- id % 2 = 1

Step 2 : filter to just the first row per `id % 2`

 id | score
----+-------
  2 |   6     -- id % 2 = 0
  3 |   5     -- id % 2 = 1