how to do subqueries in bigquery?

jas picture jas · Feb 22, 2016 · Viewed 10.3k times · Source

Im trying to play with the reddit data on bigquery and I want to see comments and replies in one row. I see bigquery supports subqueries, but I am unable to construct the query. I have to use a subquery to self join the same table because of the structure of the data, specifically i want to join id and parent_id together, but I need to modify id before I can join. Here is how im trying to do the query:

SELECT 
  p.subreddit, 
  p.body AS first_body,
  p.score AS first_score,
  CONCAT('t1_',p.id) AS first_id ,
  c.last_body,
  c.last_score,
  c.last_id 
FROM 
[fh-bigquery:reddit_comments.2016_01] p,
(
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
WHERE  p.first_id = c.parent_id  
AND p.score > 1 
AND  p.author != '[deleted]' 
AND p.body != '[deleted]';

The error I get is:

Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?

Here is where you can run the query: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01

Im not sure how to fix this. what is the proper way to join this and get this query to run?

Answer

Mikhail Berlyant picture Mikhail Berlyant · Feb 22, 2016

You might want to do something like below (just guess):

SELECT 
  p.subreddit, 
  p.body AS first_body,
  p.score AS first_score,
  CONCAT('t1_',p.id) AS first_id ,
  c.last_body,
  c.last_score,
  c.last_id 
FROM 
[fh-bigquery:reddit_comments.2016_01] p
JOIN (
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
ON  p.link_id = c.parent_id  
WHERE p.score > 1 
AND  p.author != '[deleted]' 
AND p.body != '[deleted]'
LIMIT 100

See more about JOINs

Please note, I just transformed your query to proper use JOINs, but logic of query is still for you to polish as you see needed

Added to address additional info in your comment:

SELECT 
  subreddit, 
  first_body,
  first_score,
  first_id ,
  last_body,
  last_score,
  last_id 
FROM (
  SELECT 
    subreddit, 
    body AS first_body,
    score AS first_score,
    CONCAT('t1_',id) AS first_id 
  FROM [fh-bigquery:reddit_comments.2016_01]
  WHERE score > 1 
  AND author != '[deleted]' 
  AND body != '[deleted]'
) p
JOIN (
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
ON  p.first_id = c.parent_id  
LIMIT 100