Best Practice to migrate data from MySQL to BigQuery

NaN picture NaN · Jan 21, 2017 · Viewed 8.5k times · Source

I tried several csv-formats (different escape characters, quotes and other settings) to export data from MySQL and to import it into BigQuery, but I was not able to find a solution that works in every case.

Google SQL requires the following Code for importing/exporting from/to MySQL. Although, Cloud SQL is not BigQuery, it is a good starting point:

SELECT * INTO OUTFILE 'filename.csv' CHARACTER SET 'utf8' 
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' ESCAPED BY '' FROM table

At the moment I use the following command to import a compressed csv into BigQuery: bq --nosync load -F "," --null_marker "NULL" --format=csv PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json

On one hand the bq-command does not allow to set the escape character (" is escaped by another ", which seems to be a well defined CSV-format). On the other hand \" as escape character for MySQL-export would lead to "N as Null-value, which does not work too:

CSV table references column position 34, but line starting at position:0 contains only 34 columns. (error code: invalid)

So my question is: How to write a (table-independent) export command for MySQL in SQL, such that the generated file can be loaded into BigQuery. Which escape character should be used and how to handle/set null values?

Answer

blmayer picture blmayer · Jun 9, 2017

I've been running with the same problem, here's my solution:

Exporting data from MySQL

First, export the data from MySQL this way:

SELECT * INTO OUTFILE 'filename.csv' CHARACTER SET 'utf8' 
FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '' 
FROM table <yourtable>

This is in reality a tsv file (tab separated values), but you can import them as csv thought.

Import into Big Query

This way you should be able to import it into big query with the following parameters:

bq load --field_delimiter="\t" --null_marker="\N" --quote="" \
PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json

Notes

  1. If any field in you MySQL database contains a tab character (\t), it will break your columns. To prevent that you can add the SQL function REPLACE(<column>, '\t', ' ') on the columns and it will convert from tabs to spaces.

  2. If you set the table schema in big query's web interface you won't need to specify it every time you load a CSV.

I hope this works for you.