Has anyone used tablefunc
to pivot on multiple variables as opposed to only using row name? The documentation notes:
The "extra" columns are expected to be the same for all rows with the same row_name value.
I'm not sure how to do this without combining the columns that I want to pivot on (which I highly doubt will give me the speed I need). One possible way to do this would be to make the entity numeric and add it to the localt as milliseconds, but this seems like a shaky way to proceed.
I've edited the data used in a response to this question: PostgreSQL Crosstab Query.
CREATE TEMP TABLE t4 (
timeof timestamp
,entity character
,status integer
,ct integer);
INSERT INTO t4 VALUES
('2012-01-01', 'a', 1, 1)
,('2012-01-01', 'a', 0, 2)
,('2012-01-02', 'b', 1, 3)
,('2012-01-02', 'c', 0, 4);
SELECT * FROM crosstab(
'SELECT timeof, entity, status, ct
FROM t4
ORDER BY 1,2,3'
,$$VALUES (1::text), (0::text)$$)
AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);
Returns:
Section | Attribute | 1 | 0 ---------------------------+-----------+---+--- 2012-01-01 00:00:00 | a | 1 | 2 2012-01-02 00:00:00 | b | 3 | 4
So as the documentation states, the extra column aka 'Attribute' is assumed to be the same for each row name aka 'Section'. Thus, it reports b for the second row even though 'entity' also has a 'c' value for that 'timeof' value.
Desired Output:
Section | Attribute | 1 | 0
--------------------------+-----------+---+---
2012-01-01 00:00:00 | a | 1 | 2
2012-01-02 00:00:00 | b | 3 |
2012-01-02 00:00:00 | c | | 4
Any thoughts or references?
A little more background: I potentially need to do this for billions of rows and I'm testing out storing this data in long and wide formats and seeing if I can use tablefunc
to go from long to wide format more efficiently than with regular aggregate functions.
I'll have about 100 measurements made every minute for around 300 entities. Often, we will need to compare the different measurements made for a given second for a given entity, so we will need to go to wide format very often. Also, the measurements made on a particular entity are highly variable.
EDIT: I found a resource on this: http://www.postgresonline.com/journal/categories/24-tablefunc.
The problem with your query is that b
and c
share the same timestamp 2012-01-02 00:00:00
, and you have the timestamp
column timeof
first in your query, so - even though you added bold emphasis - b
and c
are just extra columns that fall in the same group 2012-01-02 00:00:00
. Only the first (b
) is returned since (quoting the manual):
The
row_name
column must be first. Thecategory
andvalue
columns must be the last two columns, in that order. Any columns betweenrow_name
andcategory
are treated as "extra". The "extra" columns are expected to be the same for all rows with the samerow_name
value.
Bold emphasis mine.
Just revert the order of the first two columns to make entity
the row name and it works as desired:
SELECT * FROM crosstab(
'SELECT entity, timeof, status, ct
FROM t4
ORDER BY 1'
,'VALUES (1), (0)')
AS ct (
"Attribute" character
,"Section" timestamp
,"status_1" int
,"status_0" int);
entity
must be unique, of course.
row_name
firstextra
columns nextcategory
(as defined by the second parameter) and value
last.Extra columns are filled from the first row from each row_name
partition. Values from other rows are ignored, there is only one column per row_name
to fill. Typically those would be the same for every row of one row_name
, but that's up to you.
SELECT localt, entity
, msrmnt01, msrmnt02, msrmnt03, msrmnt04, msrmnt05 -- , more?
FROM crosstab(
'SELECT dense_rank() OVER (ORDER BY localt, entity)::int AS row_name
, localt, entity -- additional columns
, msrmnt, val
FROM test
-- WHERE ??? -- instead of LIMIT at the end
ORDER BY localt, entity, msrmnt
-- LIMIT ???' -- instead of LIMIT at the end
, $$SELECT generate_series(1,5)$$) -- more?
AS ct (row_name int, localt timestamp, entity int
, msrmnt01 float8, msrmnt02 float8, msrmnt03 float8, msrmnt04 float8, msrmnt05 float8 -- , more?
)
LIMIT 1000 -- ??!!
No wonder the queries in your test perform terribly. Your test setup has 14M rows and you process all of them before throwing most of it away with LIMIT 1000
. For a reduced result set add WHERE conditions or a LIMIT to the source query!
Plus, the array you work with is needlessly expensive on top of it. I generate a surrogate row name with dense_rank() instead.
db<>fiddle here - with a simpler test setup and fewer rows.