How to create dummy variable columns for thousands of categories in Google BigQuery?

Question 1

How to create dummy variable columns for thousands of categories in Google BigQuery?

mysql sql google-bigquery dummy-variable

wubr2000 · Dec 1, 2015 · Viewed 14k times · Source

Answer

Answer

You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1

Question 2

I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.

How to create dummy variable columns for thousands of categories in Google BigQuery?

Answer

Related questions