Java ETL: hard to find a suitable one

tpdi picture tpdi · Nov 23, 2010 · Viewed 21.8k times · Source

I'm looking for an embeddable Java ETL, i.e., an Extract Transform Load engine that can be called from Java code.

I'm finding it surprisingly hard to find a suitable one.

I'm mainly looking at loading delimited text files into database tables, with some minor transforms along the way.

I'd like the following features:

  • the ability to specify the simple mappings externally, e.g, text column 5 to database column foo, specified some xml mapping file
  • the ability to give the the database node a javax.sql.Datasource

CloverETL allows mapping to be specified in XML, but database connections must be either JNDI names or a properties file specifying driverClass, url, dbusername, password, etc. Since I already have javax.sql.Datasources set up by my dependency injection framework, properties files seem painful and non-robust, especially if I want this to work in several environments (dev, test, prod).

KETL tells me that "We are currently in the process of completely overhauling our documentation for KETLâ„¢. Because of this, only the installation guide has been updated." Honest, but not helpful.

Octopus is now "http://www.together.at/prod/database/tdt", which is "under construction".

Pentaho seems to use the same "specify driverClass" style that CloverETL does, rather that using a datasource, but Pentaho's documentation for calling their engine from java code is just difficult to find.

Basically I'd really like to be able to do this pseudo-code:

extractTransformLoad(         
        getInputFile( "input.csv" ) , 
        getXMLMapping( "myMappingFile.xml") ,
        new DatabaseWriter( getDatasource() );

Any suggestions?

Answer

ejboy picture ejboy · Nov 7, 2012

Disclosure: I'm the author of Scriptella ETL, but I believe this tool might be useful for your case.

It's a lightweight open source ETL with a one-liner integration with Java. It also supports Spring Framework and comes with built-in drivers for CSV, text, XML, Excel and other data-sources.

Example of importing a CSV file into a table:

<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
  <connection id="in" driver="csv" url="data.csv" />
  <connection id="out" driver="oracle" url="jdbc:oracle:thin:@localhost:1521:ORCL" 
      classpath="ojdbc14.jar" user="scott" password="tiger" />
  <!-- Copy all CSV rows to a database table -->
  <query connection-id="in">
      <!-- Empty query means select all columns -->
      <script connection-id="out">
          INSERT INTO Table_Name VALUES (?id,?priority, ?summary, ?status)
      </script>
  </query>
</etl>

Running from Java:

// Execute etl.xml file
EtlExecutor.newExecutor(new File("etl.xml")).execute();

Running from command-line:

scriptella [file_name]

Integration with Spring:

  1. Use "spring" driver and the name of the bean to references data-sources. Example:

    <connection id="spring" driver="spring" url="datasourceBeanName" />
    
  2. Add EtlExecutorBean to the application context in order to execute the job:

    <bean id="createDb" class="scriptella.driver.spring.EtlExecutorBean">
        <property name="configLocation" value="create-db.etl.xml" />
        <property name="progressIndicator"><ref local="progress" /></property>
        <property name="autostart" value="true" /> <!-- Etl will be run during app context initialization -->
    </bean>
    

For additional details see the Spring example.