Row number in BigQuery?

Question 1

Row number in BigQuery?

google-bigquery

Leo Stefa · Jun 15, 2012 · Viewed 56.2k times · Source

Answer

Answer

2018 update: If all you want is a unique id for each row

#standardSQL
SELECT GENERATE_UUID() uuid
 , * 
FROM table

2018 #standardSQL solution:

SELECT
  ROW_NUMBER() OVER() row_number, contributor_username,
  count
FROM (
  SELECT contributor_username, COUNT(*) count
  FROM `publicdata.samples.wikipedia`
  GROUP BY contributor_username
  ORDER BY COUNT DESC
  LIMIT 5)

But what about "Resources exceeded during query execution: The query could not be executed in the allotted memory. OVER() operator used too much memory.."

Ok, let's reproduce that error:

SELECT *, ROW_NUMBER() OVER() 
FROM `publicdata.samples.natality`

Yes - that happens because OVER() needs to fit all data into one VM - which you can solve with PARTITION:

SELECT *, ROW_NUMBER() OVER(PARTITION BY year, month) rn 
FROM `publicdata.samples.natality`

"But now many rows have the same row number and all I wanted was a different id for each row"

Ok, ok. Let's use partitions to give a row number to each row, and let's combine that row number with the partition fields to get an unique id per row:

SELECT *
  , FORMAT('%i-%i-%i', year, month, ROW_NUMBER() OVER(PARTITION BY year, month)) id
FROM `publicdata.samples.natality`

The original 2013 solution:

Good news: BigQuery now has a row_number function.

Simple example:

SELECT [field], ROW_NUMBER() OVER()
FROM [table]
GROUP BY [field]

More complex, working example:

SELECT
  ROW_NUMBER() OVER() row_number,
  contributor_username,
  count,
FROM (
  SELECT contributor_username, COUNT(*) count,
  FROM [publicdata:samples.wikipedia]
  GROUP BY contributor_username
  ORDER BY COUNT DESC
  LIMIT 5)

Question 2

Is there any way to get row number for each record in BigQuery? (From the specs, I haven't seen anything about it) There is a NTH() function, but that applies to repeated fields.

There are some scenarios where row number is not necessary in BigQuery, such as the use of TOP() or LIMIT function. However, I need it to simulate some analytical functions, such as a cumulative sum(). For that purpose I need to identify each record with a sequential number. Any workaround on this?

Thanks in advance for your help!

Leo

Row number in BigQuery?

Answer

Related questions