What is the difference between --split-by and --boundary-query in SQOOP?

burakongun picture burakongun · Nov 28, 2016 · Viewed 18k times · Source

Assuming we don't have a column where values are equally distributed, let's say we have a command like this:

sqoop import \
...
--boundary-query "SELECT min(id), max(id) from some_table"
--split-by id
...

What's the point using --boundary-query here while --split-by does the same thing? Is there any other way to use --boundary-query? Or any other way to split data more efficiently when there is no key(unique) column?

Answer

Dev picture Dev · Nov 28, 2016

--split-by id will split your data uniformly on the basis of number of mappers (default 4).

Now boundary query by default is something like this.

--boundary-query "SELECT min(id), max(id) from some_table"

But if you know id starts from val1 and ends with val2. Then there is no point to calculate min() and max() operations. This will make sqoop command execution faster.

You can specify any arbitrary query returning val1 and val2.


Edit:

Right now (1.4.7) there is no way in sqoop to specify uneven partitions for splitting.

For example, you have data like:

1,2,3,51,52,191,192,193,194,195,196,197,198,199,200

If you defined 4 mappers in the command. It will check min and max which is 1 and 200 in our case.

Then it will split it into 4 parts:

1-50
51-100
101-150
151-200

Yes, in this 3rd mapper(101-150) will get nothing from the RDBMS table.

But there is no way to define custom partition like :

1-10
51-60
190-200

For large data (billions of rows), practically it is not suitable to find exact values like this or use another tool to find data pattern first and then prepare custom partitions.