I need to add row numbers to a large (ca. billion rows) dataset in BigQuery. When I try:
SELECT
*
ROW_NUMBER() OVER (ORDER BY d_arf DESC) plarf
FROM [trigram.trigrams8]
I get "Resources exceeded during query execution.", because an analytic/window function needs to fit in one node.
How can I add row numbers to a large dataset in BigQuery?
You didn't give me a working query, so I had to create my own, so you'll need to translate it to your own problem space. Also I'm not sure why do you want to give a row number to each row in such a huge dataset, but challenge accepted:
SELECT a.enc, plarf, plarf+COALESCE(INTEGER(sumc), (0)) row_num
FROM (
SELECT STRING(year)+STRING(month)+STRING(mother_age)+state enc,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY enc) plarf,
year
FROM [publicdata:samples.natality] ) a
LEFT JOIN (
SELECT COUNT(*) c, year+1 year, SUM(c) OVER(ORDER BY year) sumc
FROM [publicdata:samples.natality]
GROUP BY year
) b
ON a.year=b.year