I have a data set of email addresses and dates that those email addresses were added to a table. There can be multiple entries of an email address for various different dates. For example, if I have the data set below. I would be looking to get the date and count of distinct emails that we have between said date and 3 days ago.
Date | email
-------+----------------
1/1/12 | [email protected]
1/1/12 | [email protected]
1/1/12 | [email protected]
1/2/12 | [email protected]
1/2/12 | [email protected]
1/3/12 | [email protected]
1/4/12 | [email protected]
1/5/12 | [email protected]
1/5/12 | [email protected]
1/6/12 | [email protected]
1/6/12 | [email protected]
1/6/12 | [email protected]
Result set would look something like this if we use a date period of 3
date | count(distinct email)
-------+------
1/1/12 | 3
1/2/12 | 3
1/3/12 | 3
1/4/12 | 3
1/5/12 | 1
1/6/12 | 2
I can get a distinct count of a date range using the query below, but looking to get a count of a range by day so I do not have to manually update the range for hundreds of dates.
select test.date, count(distinct test.email)
from test_table as test
where test.date between '2012-01-01' and '2012-05-08'
group by test.date;
Test case:
CREATE TABLE tbl (date date, email text);
INSERT INTO tbl VALUES
('2012-01-01', '[email protected]')
, ('2012-01-01', '[email protected]')
, ('2012-01-01', '[email protected]')
, ('2012-01-02', '[email protected]')
, ('2012-01-02', '[email protected]')
, ('2012-01-03', '[email protected]')
, ('2012-01-04', '[email protected]')
, ('2012-01-05', '[email protected]')
, ('2012-01-05', '[email protected]')
, ('2012-01-06', '[email protected]')
, ('2012-01-06', '[email protected]')
, ('2012-01-06', '[email protected]`')
;
Query - returns only days where an entry exists in tbl
:
SELECT date
,(SELECT count(DISTINCT email)
FROM tbl
WHERE date BETWEEN t.date - 2 AND t.date -- period of 3 days
) AS dist_emails
FROM tbl t
WHERE date BETWEEN '2012-01-01' AND '2012-01-06'
GROUP BY 1
ORDER BY 1;
Or - return all days in the specified range, even if there are no rows for the day:
SELECT date
,(SELECT count(DISTINCT email)
FROM tbl
WHERE date BETWEEN g.date - 2 AND g.date
) AS dist_emails
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date);
db<>fiddle here
Result:
day | dist_emails
-----------+------------
2012-01-01 | 3
2012-01-02 | 3
2012-01-03 | 3
2012-01-04 | 3
2012-01-05 | 1
2012-01-06 | 2
This sounded like a job for window functions at first, but I did not find a way to define the suitable window frame. Also, per documentation:
Aggregate window functions, unlike normal aggregate functions, do not allow
DISTINCT
orORDER BY
to be used within the function argument list.
So I solved it with correlated subqueries instead. I guess that's the smartest way.
BTW, "between said date and 3 days ago" would be a period of 4 days. Your definition is contradictory there.
Slightly shorter, but slower for few days:
SELECT g.date, count(DISTINCT email) AS dist_emails
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date)
LEFT JOIN tbl t ON t.date BETWEEN g.date - 2 AND g.date
GROUP BY 1
ORDER BY 1;
Related: