I am doing data mining on Reddit data on Google BigQuery and I wanna top 1000 posts ranked by the score for each subreddit for the whole 201704 data. I have tried different techniques but due to the limitation of BigQuery, the result is too large to return.
select body, score, subreddit from
(
select body, score, subreddit,row_number() over
(
partition by subreddit order by score desc
) mm
from [fh-bigquery:reddit_comments.2017_04]
)
where mm <= 1000 AND subreddit in
(
select subreddit from
(
select Count(subreddit) as counts, subreddit from
[fh-bigquery:reddit_comments.2017_04] GROUP BY subreddit ORDER BY counts DESC
LIMIT 10000
)
)
LIMIT 10000000
Is there any way to divide and conquer this problem since enabling large query results means could not do any complex query. Does Google provide payment option for large query resource?
I wanna top 1000 posts ranked by the score for each subreddit for the whole 201704 data
I just tested this query:
SELECT
subreddit,
ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY 1
It processed the whole dataset in 22s:
In your query it seems though that you want the posts and scores of the top 10000 most popular subreddits. I tried this query:
SELECT
subreddit,
ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
WHERE subreddit IN(
SELECT subreddit FROM(
SELECT
subreddit
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY subreddit
ORDER BY count(body) DESC
LIMIT 10000)
)
GROUP BY 1
And got results in 26s:
Hopefully these results are what you are looking for. Let me know if everything is correct.