Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Wednesday, February 2, 2011

Staging Data for ETL Processing with Talend Open Studio

For loading a set of files into a staging table with Talend Open Studio, use two subjobs: one subjob for clearing the tables for the overall job and one subjob for iterating over the files and loading each one.

In Talend Open Studio, you can apply processing directly to a spreadsheet in preparation for loading into a data mart.  However, it's often better to do this type of work in the database itself.  This is because of the flexibility and performance of running queries directly in the database process.  For example, rather than gathering surrogate keys outside of the database using tMap and lookups from a file processing loop, load the data into the database and use a query containing high-performance joins for additional transformations.

File Processing
 
Start by defining the main processing loop based off of tFileList.  This will feed into a tMap component.  The tMap component can provide some basic NULL-checking and data type validation, spooling any rejects off to reject table.

This is a Talend Open Studio job containing 3 subjobs that load a directory of spreadsheets into a MySQL table, leaving the results in place for additional processing.
Talend Open Studio - A Typical ETL Staging Job

While the tMysqlOutput components have an option to Clear or Truncate tables, we want to leave the records from each iteration in order to gain an economies of scale.  That is, load every data file and then process the entire set in one large batch rather than flushing the contents out with every iteration.  I use a single iteration of a tForeach (added a single element "once" to the list) to drive the two delete statements.

Rejecting Records
 
The tMap component contains logic for rejecting records.  An alternate implementation would put this logic into a component like tFilterRow so that the specific failure condition could be tracked.

Talend Open Studio - Map Only Valid Fields
The expression a the top of the tMap contains several Commons Lang calls to check the presence and data type of the fields.  If the criteria is not met, the record is not written out.  A second schema -- directed to the reject table -- contains the opposite expression.

Talend Open Studio - Mapping Reject
I used the Java not equal operator '!' to keep the two expressions, the main criteria and the reject criteria, in sync.  I took the REV expression and cut-and-pasted it into the REV_REJECT expression and wrapped the code with a !() block.  This is to make sure that there aren't any logical gaps.

tRules, available on Talend Exchange will also handle rejects using a Javascript syntax defining the filtering conditions.

Stage as String

The schemas for components like tFileInputExcel can use any simple data type available in Java: String, Integer, Date, Double.  I find it useful to stage everything using very loose schemas where each field is a String.  This has the following advantages

  1. Bring data into the system where it can be handled, and
  2. Allow for runtime processing related to Locales or conditional formatting.
For example, a field may be defined as a double and eventually stored as a double, but on input, the data could be rendered as "1,50" where the comma is a European-style decimal indicator.  If you rely on the automated type handling of a schema, this would produce incorrect results.  The loose schema will allow the field to be brought into the processing where it can be interrogated, filtered, parsed, and reported along with other validations.

Staging is used to load a database table where it is more convenient to analyze and process a record than a set of flat files.  Once in the database, performant queries can be run without suffering the network latency of having to move data in and out of the database process.  Talend Open Studio components provide the file processing and the data loading steps in this arrangement of subjobs.

2 comments:

  1. Hi All,
    I have two table with same column. When First column is updated with some Invoice No. I want to compare with second table and in the second table delete the record and update the second table with first table data based on the Invoice number.

    ReplyDelete
  2. I don't know how to deal with loading a set of files into a staging table with Talend Open Studio thank you for explaining a lot.

    ReplyDelete