Sqoop Import Split by Column Data type

Bagavathi picture Bagavathi · Oct 14, 2016 · Viewed 7.3k times · Source

Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?

Answer

Dev picture Dev · Oct 14, 2016

Yes you can split on any non numeric datatype.

But this is not recommended.

WHY?

For splitting data Sqoop fires

SELECT MIN(col1), MAX(col2) FROM TABLE

then divide it as per you number of mappers.

Now take an example of integer as --split-by column

Table has some id column having value 1 to 100 and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

1,100

Splitting on integer is easy. You will make 4 parts:

  • 1-25
  • 25-50
  • 51-75
  • 76-100

Now string as --split-by column

Table has some name column having value "dev" to "sam" and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

dev,sam

Now how will it be divided in 4 parts. As per sqoop docs,

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */

And you will see the warning in the code:

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");

In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).

In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.


In a nutshell, Go for integer column as --split-by column.