Here's my table 'tab_test':
year animal price
2000 kittens 79
2000 kittens 93
2000 kittens 100
2000 puppies 15
2000 puppies 32
2001 kittens 31
2001 kittens 17
2001 puppies 65
2001 puppies 48
2002 kittens 84
2002 kittens 86
2002 puppies 15
2002 puppies 95
2003 kittens 62
2003 kittens 24
2003 puppies 36
2003 puppies 41
2004 kittens 65
2004 kittens 85
2004 puppies 58
2004 puppies 95
2005 kittens 45
2005 kittens 25
2005 puppies 15
2005 puppies 35
2006 kittens 50
2006 kittens 80
2006 puppies 95
2006 puppies 49
2007 kittens 40
2007 kittens 19
2007 puppies 81
2007 puppies 38
2008 kittens 37
2008 kittens 51
2008 puppies 29
2008 puppies 72
2009 kittens 84
2009 kittens 26
2009 puppies 49
2009 puppies 34
2010 kittens 75
2010 kittens 96
2010 puppies 18
2010 puppies 26
2011 kittens 35
2011 kittens 21
2011 puppies 90
2011 puppies 18
2012 kittens 12
2012 kittens 23
2012 puppies 74
2012 puppies 79
Here's some code that transposes the rows and columns so I get an average for 'kittens' and 'puppies':
SELECT
year,
AVG(CASE WHEN animal = 'kittens' THEN price END) AS "kittens",
AVG(CASE WHEN animal = 'puppies' THEN price END) AS "puppies"
FROM tab_test
GROUP BY year
ORDER BY year;
The output for the code above is:
year kittens puppies
2000 90.6666666666667 23.5
2001 24.0 56.5
2002 85.0 55.0
2003 43.0 38.5
2004 75.0 76.5
2005 35.0 25.0
2006 65.0 72.0
2007 29.5 59.5
2008 44.0 50.5
2009 55.0 41.5
2010 85.5 22.0
2011 28.0 54.0
2012 17.5 76.5
What I'd like is a table like the second one, but it would only contain items which had a COUNT()
of at least 3 in the first table. In other words, the goal is to have this as output:
year kittens
2000 90.6666666666667
There were at least 3 instances of 'kitten' in the first table.
Is this possible in PostgreSQL?
CASE
If your case is as simple as demonstrated, a CASE
statement will do:
SELECT year
, sum(CASE WHEN animal = 'kittens' THEN price END) AS kittens
, sum(CASE WHEN animal = 'puppies' THEN price END) AS puppies
FROM (
SELECT year, animal, avg(price) AS price
FROM tab_test
GROUP BY year, animal
HAVING count(*) > 2
) t
GROUP BY year
ORDER BY year;
Doesn't matter whether you use sum()
, max()
or min()
as aggregate function in the outer query. They all result in the same value in this case.
crosstab()
With more categories it will be simpler with a crosstab()
query. This should also be faster for bigger tables.
You need to install the additional module tablefunc (once per database). Since Postgres 9.1 that's as simple as:
CREATE EXTENSION tablefunc;
Details in this related answer:
SELECT * FROM crosstab(
'SELECT year, animal, avg(price) AS price
FROM tab_test
GROUP BY animal, year
HAVING count(*) > 2
ORDER BY 1,2'
,$$VALUES ('kittens'::text), ('puppies')$$)
AS ct ("year" text, "kittens" numeric, "puppies" numeric);
No sqlfiddle for this one because the site doesn't allow additional modules.
To verify my claims I ran a quick benchmark with close to real data in my small test database. PostgreSQL 9.1.6. Test with EXPLAIN ANALYZE
, best of 10:
Test setup with 10020 rows:
CREATE TABLE tab_test (year int, animal text, price numeric);
-- years with lots of rows
INSERT INTO tab_test
SELECT 2000 + ((g + random() * 300))::int/1000
, CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
, (random() * 200)::numeric
FROM generate_series(1,10000) g;
-- .. and some years with only few rows to include cases with count < 3
INSERT INTO tab_test
SELECT 2010 + ((g + random() * 10))::int/2
, CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
, (random() * 200)::numeric
FROM generate_series(1,20) g;
Results:
@bluefeet
Total runtime: 95.401 ms
@wildplasser (different results, includes rows with count <= 3
)
Total runtime: 64.497 ms
@Andreiy (+ ORDER BY
)
& @Erwin1 - CASE
(both perform about the same)
Total runtime: 39.105 ms
@Erwin2 - crosstab()
Total runtime: 17.644 ms
Largely proportional (but irrelevant) results with only 20 rows. Only @wildplasser's CTE has more overhead and spikes a little.
With more than a handful of rows, crosstab()
quickly takes lead.
@Andreiy's query performs about the same as my simplified version, aggregate function in outer SELECT
(min()
, max()
, sum()
) makes no measurable difference (just two rows per group).
Everything as expected, no surprises, take my setup and try it @home.