I hear the term "ETL tool" used a lot lately and started digging in to learn more about them. I found a great list of open source ETL tools for Java here and started reading up on them.
But that made me really confused.
Most of these tools (CloverETL, Pentaho, etc.) are GUI tools. Some of them, such as Smooks, are pure Java frameworks. I guess this makes sense: some ETL users may be non-technical and/or would like to use a GUI tool to set up transformations. Other users will be developers who want to tap into the raw ETL power of these tools.
My question(s): are there any benefits (additional features, etc.) that these GUI tools offer over the pure Java frameworks, or vice versa? Do the "major player" GUI tools such as CloverETL and Pentaho - which bill themselves as GUI tools - also have Java APIs that I can accomplish the same things with (programmatically)? Or are they pure GUI tools? I can't find Java Docs anywhere (for either one).
I would say that no, there's really no real advantage in using a non-GUI tool for ETL.
In most typical situations, a GUI approach is much more efficient for ETL jobs because the tools should offer you a way to do data task rapidly and almost without writing custom code. That's because an ETL platform is by philosophy no more than a code generation platform: the task drawn on the canvas is translated (ideally, in the most suitable way) by the tool engine in machine code that's directly executed under the hood, without intermediate agents. The bigger ones have a complex client-server architecture, but the basic idea stays the same.
How deeply this generated code is hidden, it depends by the platform. Some, like Pentaho or Datastage, really make it inaccessible to the user; others, like Talend (which produces java code in a class easily embeddable in application or executed directly) or SAS Data Integration Studio (which produces a .sas file) gives the developer the possibility to dig into the generated code. But it's always an option left to the hardcore developer, while the regular user will almost never go inside the code to do her everyday job.