BigQuery COUNT(DISTINCT value) vs COUNT(value)

Balazs Gunics picture Balazs Gunics · May 17, 2013 · Viewed 55.7k times · Source

I found a glitch/bug in bigquery. We got a table based on Bank Statistic data under the starschema.net:clouddb:bank.Banks_token

If i run the following query:

SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;

And i get the following result:

Row totalrow    bankidcnt    
1   9513    9903    

My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.

Answer

Jeremy Condit picture Jeremy Condit · May 19, 2013

In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.

You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.

See the complete documentation here: https://developers.google.com/bigquery/docs/query-reference#aggfunctions


UPDATE 2017:

With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.