I understand how DISTINCT
works, but I don't understand DISTINCT ON (expression)
.
Take the first example from this screenshot:
How does the (a % 2)
part affect everything? Is it saying that if a % 2
evaluates to true, then return it, then continue doing so for all other tuples but only return if the returned value is distinct?
While the previous answer appears correct, I don't feel that it is particularly clear.
The snippet from the Official documentation for PostGreSQL is as follows...
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).
The first point is that whatever you put in the ON ()
, must come first in the the ORDER BY
, for reasons that will hopefully shortly become clear...
SELECT DISTINCT ON (a) a, b, c FROM a_table ORDER BY a, b
The results are then filtered, so that for each of the distinct entities, only the first row is actually returned.
For example...
CREATE TABLE example (
id INT,
person_id INT,
address_id INT,
effective_date DATE
);
INSERT INTO
example (id, person_id, address_id, effective_date)
VALUES
(1, 2, 1, '2000-01-01'), -- Moved to first house
(5, 2, 2, '2004-08-19'), -- Went to uni
(9, 2, 1, '2007-06-12'), -- Moved back home
(2, 4, 3, '2007-05-18'), -- Moved to first house
(3, 4, 4, '2016-02-09') -- Moved to new house
;
SELECT DISTINCT ON (person_id)
*
FROM
example
ORDER BY
person_id,
effective_date DESC
;
This will order the results so that all the records for each person are contiguous, ordered from the most recent record to the oldest. Then, for each person, on the first record is returned. Thus, giving the most recent address for each person.
Step 1 : Apply the ORDER BY...
id | person_id | address_id | effective_date
----+-----------+------------+----------------
9 | 2 | 1 | '2007-06-12'
5 | 2 | 2 | '2004-08-19'
1 | 2 | 1 | '2000-01-01'
3 | 4 | 4 | '2016-02-09'
2 | 4 | 3 | '2007-05-18'
Step 2 : filter to just the first row per person_id
id | person_id | address_id | effective_date
----+-----------+------------+----------------
9 | 2 | 1 | '2007-06-12'
3 | 4 | 4 | '2016-02-09'
It is broadly equivalent to this following...
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY person_id
ORDER BY effective_date DESC) AS person_address_ordinal
FROM
example
)
AS sorted_example
WHERE
person_address_ordinal = 1
As for the question about what (a % 2)
does, it's just a mathematical calculation for MOD(a, 2)
, so you could do the following...
CREATE TABLE example (
id INT,
score INT
);
INSERT INTO
example (id, score)
VALUES
(1, 2),
(2, 6),
(3, 5),
(4, 3),
(5, 4),
;
SELECT DISTINCT ON (id % 2)
*
FROM
example
ORDER BY
id % 2,
score DESC
;
That would give the highest score for the even id
s (where id % 2
equals 0
), then the highest score the odd id
s (where id % 2
equals 1
).
Step 1 : Apply the ORDER BY...
id | score
----+-------
2 | 6 -- id % 2 = 0
4 | 3 -- id % 2 = 0
3 | 5 -- id % 2 = 1
5 | 4 -- id % 2 = 1
1 | 2 -- id % 2 = 1
Step 2 : filter to just the first row per `id % 2`
id | score
----+-------
2 | 6 -- id % 2 = 0
3 | 5 -- id % 2 = 1