Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?
Yes you can split on any non numeric datatype.
But this is not recommended.
For splitting data Sqoop fires
SELECT MIN(col1), MAX(col2) FROM TABLE
then divide it as per you number of mappers.
Now take an example of integer as --split-by
column
Table has some id
column having value 1 to 100 and you using 4 mappers (-m 4
in your sqoop command)
Sqoop get MIN and MAX value using:
SELECT MIN(id), MAX(id) FROM TABLE
OUTPUT:
1,100
Splitting on integer is easy. You will make 4 parts:
Now string as --split-by
column
Table has some name
column having value "dev" to "sam" and you using 4 mappers (-m 4
in your sqoop command)
Sqoop get MIN and MAX value using:
SELECT MIN(id), MAX(id) FROM TABLE
OUTPUT:
dev,sam
Now how will it be divided in 4 parts. As per sqoop docs,
/**
* This method needs to determine the splits between two user-provided
* strings. In the case where the user's strings are 'A' and 'Z', this is
* not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
* splits for strings beginning with each letter, etc.
*
* If a user has provided us with the strings "Ham" and "Haze", however, we
* need to create splits that differ in the third letter.
*
* The algorithm used is as follows:
* Since there are 2**16 unicode characters, we interpret characters as
* digits in base 65536. Given a string 's' containing characters s_0, s_1
* .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
* base 65536. Having mapped the low and high strings into floating-point
* values, we then use the BigDecimalSplitter to establish the even split
* points, then map the resulting floating point values back into strings.
*/
And you will see the warning in the code:
LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
+ "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");
In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).
In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.
In a nutshell, Go for integer column as --split-by
column.