I'm looking for an embeddable Java ETL, i.e., an Extract Transform Load engine that can be called from Java code.
I'm finding it surprisingly hard to find a suitable one.
I'm mainly looking at loading delimited text files into database tables, with some minor transforms along the way.
I'd like the following features:
CloverETL allows mapping to be specified in XML, but database connections must be either JNDI names or a properties file specifying driverClass, url, dbusername, password, etc. Since I already have javax.sql.Datasource
s set up by my dependency injection framework, properties files seem painful and non-robust, especially if I want this to work in several environments (dev, test, prod).
KETL tells me that "We are currently in the process of completely overhauling our documentation for KETLâ„¢. Because of this, only the installation guide has been updated." Honest, but not helpful.
Octopus is now "http://www.together.at/prod/database/tdt", which is "under construction".
Pentaho seems to use the same "specify driverClass" style that CloverETL does, rather that using a datasource, but Pentaho's documentation for calling their engine from java code is just difficult to find.
Basically I'd really like to be able to do this pseudo-code:
extractTransformLoad(
getInputFile( "input.csv" ) ,
getXMLMapping( "myMappingFile.xml") ,
new DatabaseWriter( getDatasource() );
Any suggestions?
Disclosure: I'm the author of Scriptella ETL, but I believe this tool might be useful for your case.
It's a lightweight open source ETL with a one-liner integration with Java. It also supports Spring Framework and comes with built-in drivers for CSV, text, XML, Excel and other data-sources.
<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
<connection id="in" driver="csv" url="data.csv" />
<connection id="out" driver="oracle" url="jdbc:oracle:thin:@localhost:1521:ORCL"
classpath="ojdbc14.jar" user="scott" password="tiger" />
<!-- Copy all CSV rows to a database table -->
<query connection-id="in">
<!-- Empty query means select all columns -->
<script connection-id="out">
INSERT INTO Table_Name VALUES (?id,?priority, ?summary, ?status)
</script>
</query>
</etl>
// Execute etl.xml file
EtlExecutor.newExecutor(new File("etl.xml")).execute();
scriptella [file_name]
Use "spring"
driver and the name of the bean to references data-sources. Example:
<connection id="spring" driver="spring" url="datasourceBeanName" />
Add EtlExecutorBean
to the application context in order to execute the job:
<bean id="createDb" class="scriptella.driver.spring.EtlExecutorBean">
<property name="configLocation" value="create-db.etl.xml" />
<property name="progressIndicator"><ref local="progress" /></property>
<property name="autostart" value="true" /> <!-- Etl will be run during app context initialization -->
</bean>
For additional details see the Spring example.