Find overlapping date ranges in PostgreSQL

aocferreira picture aocferreira · Dec 19, 2010 · Viewed 18.6k times · Source

Is this correct?

SELECT * 
FROM   contract 
JOIN   team USING (name_team) 
JOIN   player USING(name_player) 
WHERE  name_team = ? 
AND    DATE_PART('YEAR',date_join)>= ? 
AND    DATE_PART('YEAR',date_leave)<= ?

My table contract has the player name, team name and the dates when he joined and left the club.
I want to make a function listing all players that were on the team in specific years.
The above query doesn't seem to be working ...

Answer

Erwin Brandstetter picture Erwin Brandstetter · Mar 9, 2013

The currently accepted answer does not answer the question. And it is wrong in principle. a BETWEEN x AND y translates to:

a >= x AND a <= y

Including the upper bound, while people typically need to exclude it:

a >= x AND a < y

With dates you can easily adjust. For the year 2009 use '2009-12-31' as upper bound.
But it's not as simple with timestamps which allow fractional digits. Modern Postgres versions use an 8-byte integer internally to store up to 6 fractional seconds (µs resolution). Knowing this we could still make it work, but that's not intuitive and depends on an implementation detail. Bad idea.

Moreover, a BETWEEN x AND y does not find overlapping ranges. We need:

b >= x AND a < y

And players that never left are not considered, yet.

Proper answer

Assuming the year 2009, I'll rephrase the question without changing its meaning:

"Find all players of a given team who joined before 2010 and did not leave before 2009."

Basic query:

SELECT p.*
FROM   team     t
JOIN   contract c USING (name_team) 
JOIN   player   p USING (name_player) 
WHERE  t.name_team = ? 
AND    c.date_join  <  date '2010-01-01'
AND    c.date_leave >= date '2009-01-01';

But there is more:

If referential integrity is enforced with FK constraints, the table team itself is just noise in the query and can be removed.

While the same player can leave and rejoin the same team, we also need to fold possible duplicates, for instance with DISTINCT.

And we may need to provide for a special case: players that never left. Assuming those players have NULL in date_leave.

"A player that is not known to have left is assumed to be playing for the team to this day."

Refined query:

SELECT DISTINCT p.* 
FROM   contract c
JOIN   player   p USING (name_player) 
WHERE  c.name_team = ? 
AND    c.date_join  <  date '2010-01-01'
AND   (c.date_leave >= date '2009-01-01' OR c.date_leave IS NULL);

Operator precedence works against us, AND binds before OR. We need parentheses.

Related answer with optimized DISTINCT (if duplicates are common):

Typically, names of natural persons aren't unique and a surrogate primary key is used. But, obviously, name_player is the primary key of player. If all you need is player names we don't need the table player in the query, either:

SELECT DISTINCT name_player 
FROM   contract
WHERE  name_team = ? 
AND    date_join  <  date '2010-01-01'
AND   (date_leave >= date '2009-01-01' OR date_leave IS NULL);

SQL OVERLAPS operator

The manual:

OVERLAPS automatically takes the earlier value of the pair as the start. Each time period is considered to represent the half-open interval start <= time < end, unless start and end are equal in which case it represents that single time instant.

To take care of potential NULL values, COALESCE seems easiest:

SELECT DISTINCT name_player 
FROM   contract
WHERE  name_team = ? 
AND    (date_join, COALESCE(date_leave, CURRENT_DATE)) OVERLAPS
       (date '2009-01-01', date '2010-01-01');  -- upper bound excluded

Range type with index support

In Postgres 9.2 or later you can also operate with actual range types:

SELECT DISTINCT name_player 
FROM   contract
WHERE  name_team = ? 
AND    daterange(date_join, date_leave) &&
       daterange '[2009-01-01,2010-01-01)';  -- upper bound excluded

Range types add some overhead and occupy more space. 2 x date = 8 bytes; 1 x daterange = 14 bytes on disk or 17 bytes in RAM. But in combination with the overlap operator && the query can be supported with a GiST index.

Also, no need to special-case NULL values. NULL means "open range" in a range type - exactly what we need. The table definition doesn't even have to change: we can create the range type on the fly - and support the query with a matching expression index:

CREATE INDEX mv_stock_dr_idx ON mv_stock USING gist (daterange(date_join, date_leave));

Related: