My requirement is to
- Move data from Oracle to HDFS
- Process the data on HDFS
- Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
- ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
- Do large scale processing either by Custom MapReduce or Hive or PIG.
- SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
- STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
- Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
Looks like you have several questions so let's try to break it down.
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.