Using outer query result in a subquery in postgresql

brad picture brad · Nov 16, 2011 · Viewed 9.7k times · Source

I have two tables points and contacts and I'm trying to get the average points.score per contact grouped on a monthly basis. Note that points and contacts aren't related, I just want the sum of points created in a month divided by the number of contacts that existed in that month.

So, I need to sum points grouped by the created_at month, and I need to take the count of contacts FOR THAT MONTH ONLY. It's that last part that's tricking me up. I'm not sure how I can use a column from an outer query in the subquery. I tried something like this:

SELECT SUM(score) AS points_sum,
  EXTRACT(month FROM created_at) AS month,
  date_trunc('MONTH', created_at) + INTERVAL '1 month' AS next_month,
  (SELECT COUNT(id) FROM contacts WHERE contacts.created_at <= next_month) as contact_count
FROM points
GROUP BY month, next_month
ORDER BY month

So, I'm extracting the actual month that my points are being summed, and at the same time, getting the beginning of the next_month so that I can say "Get me the count of contacts where their created at is < next_month"

But it complains that column next_month doesn't exist This is understandable as the subquery knows nothing about the outer query. Qualifying with points.next_month doesn't work either.

So can someone point me in the right direction of how to achieve this?

Tables:

Points

score | created_at
10    | "2011-11-15 21:44:00.363423"
11    | "2011-10-15 21:44:00.69667" 
12    | "2011-09-15 21:44:00.773289"
13    | "2011-08-15 21:44:00.848838"
14    | "2011-07-15 21:44:00.924152"

Contacts

id | created_at
6  | "2011-07-15 21:43:17.534777"
5  | "2011-08-15 21:43:17.520828"
4  | "2011-09-15 21:43:17.506452"
3  | "2011-10-15 21:43:17.491848"
1  | "2011-11-15 21:42:54.759225"

sum, month and next_month (without the subselect)

sum | month | next_month
14  | 7     | "2011-08-01 00:00:00"
13  | 8     | "2011-09-01 00:00:00"
12  | 9     | "2011-10-01 00:00:00"
11  | 10    | "2011-11-01 00:00:00"
10  | 11    | "2011-12-01 00:00:00"

Answer

Erwin Brandstetter picture Erwin Brandstetter · Nov 16, 2011

Edit

Now with running sum of contacts. My first draft used new contacts per month, which is obviously not what OP wants.

WITH c AS (
    SELECT created_at
          ,count(id) OVER (order BY created_at) AS ct
    FROM   contacts
    ), p AS (
    SELECT date_trunc('month', created_at) AS month
          ,sum(score) AS points_sum
    FROM   points
    GROUP  BY 1
    )
SELECT p.month
      ,EXTRACT(month FROM p.month) AS month_nr
      ,p.points_sum
      ,( SELECT c.ct
         FROM   c
         WHERE  c.created_at < (p.month + interval '1 month')
         ORDER  BY c.created_at DESC
         LIMIT  1) AS contacts
FROM   p
ORDER  BY 1
  • This works for any number of months across the years.
  • Assumes that no month is missing in the table points. If you want all months, including missing ones in points, generate a list of months with generate_series() and LEFT JOIN to it.
  • Build a running sum in a CTE with a window function.
  • Both CTE are not strictly necessary - for performance and simplification only.
  • Get contacts_count in a subselect.

Your original form of the query could work like this:

SELECT month
      ,EXTRACT(month FROM month) AS month_nr
      ,points_sum
      ,(SELECT count(*)
        FROM   contacts c
        WHERE  c.created_at < (p.month + interval '1 month')) AS contact_count
FROM   (
    SELECT date_trunc('MONTH', created_at) AS month
          ,sum(score) AS points_sum
    FROM   points p
    GROUP  BY 1
    ) p
ORDER  BY 1

The fix for the immediate cause of your error is to put the aggregate into a subquery. You were mixing levels in a way that is impossible.
I expect my variant to be slightly faster with big tables. Not sure about smaller tables. Would be great if you'd report back with test results.
Plus a minor fix: < instead of <=.