Im trying to play with the reddit data on bigquery and I want to see comments and replies in one row. I see bigquery supports subqueries, but I am unable to construct the query. I have to use a subquery to self join the same table because of the structure of the data, specifically i want to join id and parent_id together, but I need to modify id before I can join. Here is how im trying to do the query:
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p,
(
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
WHERE p.first_id = c.parent_id
AND p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]';
The error I get is:
Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?
Here is where you can run the query: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01
Im not sure how to fix this. what is the proper way to join this and get this query to run?
You might want to do something like below (just guess):
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.link_id = c.parent_id
WHERE p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]'
LIMIT 100
See more about JOINs
Please note, I just transformed your query to proper use JOINs, but logic of query is still for you to polish as you see needed
Added to address additional info in your comment:
SELECT
subreddit,
first_body,
first_score,
first_id ,
last_body,
last_score,
last_id
FROM (
SELECT
subreddit,
body AS first_body,
score AS first_score,
CONCAT('t1_',id) AS first_id
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE score > 1
AND author != '[deleted]'
AND body != '[deleted]'
) p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.first_id = c.parent_id
LIMIT 100