Using Pentaho Kettle, how do I load multiple tables from a single table while keeping referential integrity?

blunders picture blunders · Nov 14, 2010 · Viewed 14k times · Source

Need to load data from a single file with a 100,000+ records into multiple tables on MySQL maintaining the relationships defined in the file/tables; meaning the relationships already match. The solution should work on the latest version of MySQL, and needs to use the InnoDB engine; MyISAM does not support foreign keys.

I am a completely new to using Pentaho Data Integration (aka Kettle) and any pointers would be appreciated.

I might add that it is a requirement that the foreign key constraints are NOT disabled. Since it's my understanding that if there is something wrong with the database's referential integrity, MySQL will not check for referential integrity when the foreign key constraints are turned back on. SOURCE: 5.1.4. Server System Variables -- foreign_key_checks

All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity.

Again, completely new to this, and doing my best to provide as much information as possible, if you have any questions, or request for clarification -- just let me know.

If you are able to post the XML from the kjb and ktr files (jobs/transformations) that would be SUPER. Might even hunt down every comment/answer you've every made anywhere and up vote them... :-) ...really, it's really important to me to find an answer for this.

Thanks!


SAMPLE DATA: To better elaborate with an example, lets assume I am trying to load a file containing employee name, the offices they have occupied in the past and their Job title history separated by a tab.

File:

EmployeeName<tab>OfficeHistory<tab>JobLevelHistory
John Smith<tab>501<tab>Engineer
John Smith<tab>601<tab>Senior Engineer
John Smith<tab>701<tab>Manager
Alex Button<tab>601<tab>Senior Assistant
Alex Button<tab>454<tab>Manager

NOTE: The single table database is completely normalized (as much as a single table may be) -- and for example, in the case of "John Smith" there is only one John Smith; meaning there are no duplicates that would lead to conflicts in referential integrity.

The MyOffice database schema has the following tables:

Employee (nId, name)
Office (nId, number)
JobTitle (nId, titleName)
Employee2Office (nEmpID, nOfficeId)
Employee2JobTitle (nEmpId, nJobTitleID)

So in this case. the tables should look like:

Employee
1 John Smith
2 Alex Button

Office
1 501
2 601
3 701
4 454

JobTitle
1 Engineer
2 Senior Engineer
3 Manager
4 Senior Assistant

Employee2Office
1 1
1 2
1 3
2 2
2 4

Employee2JobTitle
1 1
1 2
1 3
2 4
2 3

Here's the MySQL DDL to create the database and tables:

create database MyOffice2;

use MyOffice2;

CREATE TABLE Employee (
      id MEDIUMINT NOT NULL AUTO_INCREMENT,
      name CHAR(50) NOT NULL,
      PRIMARY KEY (id)
    ) ENGINE=InnoDB;

CREATE TABLE Office (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  office_number INT NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB;

CREATE TABLE JobTitle (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  title CHAR(30) NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB;

CREATE TABLE Employee2JobTitle (
  employee_id MEDIUMINT NOT NULL,
  job_title_id MEDIUMINT NOT NULL,
  FOREIGN KEY (employee_id) REFERENCES Employee(id),
  FOREIGN KEY (job_title_id) REFERENCES JobTitle(id),
  PRIMARY KEY (employee_id, job_title_id)
) ENGINE=InnoDB;

CREATE TABLE Employee2Office (
  employee_id MEDIUMINT NOT NULL,
  office_id MEDIUMINT NOT NULL,
  FOREIGN KEY (employee_id) REFERENCES Employee(id),
  FOREIGN KEY (office_id) REFERENCES Office(id),
  PRIMARY KEY (employee_id, office_id)
) ENGINE=InnoDB;



My Notes in Response to Selected Answer:

PREP:

  1. (a) Use the sample data, create a CSV by changing <TAB> to comma delimited.
  2. (b) Install MySQL and create sample database using the MySQL DDL sample
  3. (c) Install Kettle (it's Java based and will run on anything that runs Java)
  4. (d) Download KTR file

Dataflow by Step: (My Notes)

  1. Open the KTR file in Kettle, and double clicked the "CSV file input" and browse to the CSV file that you created. The delimiter should already be set to comma. Then click OKAY.
  2. Double click "Insert Employees" and select DB connector then follow these directions on Creating a New Database Connection

Answer

rwilliams picture rwilliams · Nov 14, 2010

I put together a sample transformation(right click and choose save link) based on what you provided. The only step I feel a bit uncertain on is the last table inputs. I'm basically writing the join data to the table and letting it fail if a specific relationship already exists.

note:

This solution doesn't really meet the "All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity." criteria, though it probably won't fail. If you really want to setup something complex we can but this should definitely get you going with these transformations.

alt text

Dataflow by Step

1. We start with reading in your file. In my case I converted it to CSV but tab is fine too. alt text

2. Now we're going to insert the employee names into the Employee table using a combination lookup/update. After the insert we append the employee_id to our datastream as id and remove the EmployeeName from the data stream.

alt text

3. Here we're just using a Select Values step to rename the id field to employee_id alt text

4. Insert Job Titles just like we did employees and append the title id to our datastream also deleting the JobLevelHistory from the datastream.

alt text

5. Simple rename of the title id to title_id(see step 3) alt text

6. Insert offices, get id's, remove OfficeHistory from the stream.

alt text

7. Simple rename of the office id to office_id(see step 3)

alt text

8. Copy Data from the last step into two streams with the values employee_id,office_id and employee_id,title_id respectively.

alt text alt text

9. Use a table insert to insert the join data. I've got it selected to ignore insert errors as there could be duplicates and the PK constraints will make some rows fail.

Output Tables

alt text

alt text

alt text

alt text

alt text