Read from BigQuery into Spark in efficient way?

Mahmoud Hanafy picture Mahmoud Hanafy · Jan 4, 2017 · Viewed 7.6k times · Source

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark?

Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time?

Answer

Graham Polley picture Graham Polley · Jan 4, 2017

Maybe a Googler will correct me, but AFAIK that's the only way. This is because under the hood it also uses the BigQuery Connector for Hadoop, which accordng to the docs:

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job..

As a side note, this is also true when using Dataflow - it too performs an export of BigQuery table(s) to GCS first and then reads them in parallel.

WRT whether or not the copying stage (which is essentially a BigQuery export job) is influenced by your Spark cluster size, or if it's a fixed time - no. BigQuery export jobs are nondeterministic, and BigQuery uses its own resources for exporting to GCS i.e. not your Spark cluster.