BigQuery COUNT(DISTINCT value) vs COUNT(value)

Question 1

BigQuery COUNT(DISTINCT value) vs COUNT(value)

google-bigquery

Balazs Gunics · May 17, 2013 · Viewed 55.7k times · Source

Answer

Answer

In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.

You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.

See the complete documentation here: https://developers.google.com/bigquery/docs/query-reference#aggfunctions

UPDATE 2017:

With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.

Question 2

I found a glitch/bug in bigquery. We got a table based on Bank Statistic data under the starschema.net:clouddb:bank.Banks_token

If i run the following query:

SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;

And i get the following result:

Row totalrow    bankidcnt    
1   9513    9903

My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.

BigQuery COUNT(DISTINCT value) vs COUNT(value)

Answer

Related questions