Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Friday, February 18, 2011

ETL Filter Patterns with Talend Open Studio

When you're writing ETL code in Talend Open Studio, there are a few places you can filter in relation to transformations: before, after, or during.

Most ETL programs require some type of filtering.  Filtering is the removal of records or data from an input stream.  In Talend Open Studio, you can filter in a tMap by placing an expression on the output schema.  Or, you can use the dedicated tFilterRow component.  Here are some guidelines for deciding what -- and where -- to use.

This post describes ETL filter placement using Patterns, a common design technique in the Java world.  A Pattern is an arrangement of code -- components, classes, etc. -- with a descriptive name known to all team members.  When a team member hears this descriptive name, he or she can easily visualize that segment of the design and how best to implement it.

For ETL filters, these Patterns are created in reference to a transformation, the tMap in Talend Open Studio.  Here is a listing.

  • Guard.  Filtering occurs before the transformation
  • Business Rule.  Filtering occurs after the transformation
  • Alternate Path.  Filter occurs as part of the transformation
The Patterns are not exclusive and may be used more than once.  For example, a job may have several Guard conditions, one for each input, followed up by several Business Rules, applied after the data is loaded but before another transformation.

Guard

The Guard Pattern removes records prior to transformation.  It's used when you want to prevent malformed data from entering the ETL system.  Examples include removing records missing required fields, having fields in the incorrect format, or fields that exceed the expected size.

In Talend Open Studio, the Guard Pattern is implemented using a tFilterRow before a tMap.

Talend Open Studio Job Implementing Guard Pattern
 Prior to a transformation (tMap), a tFilterRow is added after the input data source.  The tFilterRow contains one or more expressions based on the contents of the input.  Some possible expressions include a check for a required field, a bad numeric conversion, or a length that exceeds its expected value.

The tFilterRow spools the rejected records off to a file for analysis and possible re-transmit by the source system.

Business Rule

The Business Rule Pattern rejects records after a transformation.  Because the filter is placed after the transformation, it's expressions may be more complex.  The post-transformation filter has more data available to it and can validate across schemas.

An example of a Business Rule might be to reject a record that's not on the Approved Vendor List.  This rejection considers the input records (as in the Guard Pattern), but also considers another source.

 In Talend Open Studio, this is implemented by placing a tFilterRow after the tMap.

Talend Open Studio Job Implementing Business Rule
 The tMap component has two input sources.  The result of the tMap is intercepted by the tFilterRow prior to output.  The tMap component applies an expression that considers the main input value and the lookup input value, such as verifying that the lookup record is flagged as an Approved Vendor.

Because the data is already in the system, you may skip spooling the records to a reject file.  In this case, the filter isn't fending off bad data, but selectively processing the data.


Alternate Path

The Alternate Path Pattern (or "Alt Path") applies a filter during a transformation.  As the transformation prepares to output a record, it is checked against a condition and rejected.  Like the Business Rule, Alt Path has access to more than one source schema.

This is a common technique in integration tools like Pervasive Data Integrator that aren't pipeline-driven.

In Talend Open Studio, this is implemented using a single tMap which maps more than one output source.

Talend Open Studio Job Implementing Alt Path Pattern
It's important in the Alt Path Pattern to make sure that the expressions are exclusive.  In the Talend job displayed above, the expression for the ENR  flow must be the complement (not) of the ENR_Reject flow. This can be difficult if the expressions are long because the text box associated with the schemas in tMap are small.

These ETL filtering Patterns -- Guard, Business Rule, Alt Path -- aren't inviolable.  (You can always check for blank records after a transformation if needed.)  But Patterns help quickly share concepts among team members: "Create a job with a Guard for each source and a Business Rule for approved vendors."

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. The post outlines ETL filter strategies using Patterns in Talend Open Studio. Three Patterns are described: Guard, filtering before transformation to eliminate invalid data; Business Rule, filtering after transformation but before further processing; and Alternate Path, applying filters within the transformation. These Patterns can be used together and help maintain a clear and organized ETL design. In Talend, the Guard Pattern is executed using a tFilterRow preceding a tMap to remove undesirable records based on specific criteria, preventing flawed data from entering the system. Read blog to learn about the data analytics: " The Ultimate Data Analytics Training Course: Hands-On Learning and Real-World Applications.

    ReplyDelete