Most ETL programs require some type of filtering. Filtering is the removal of records or data from an input stream. In Talend Open Studio, you can filter in a tMap by placing an expression on the output schema. Or, you can use the dedicated tFilterRow component. Here are some guidelines for deciding what -- and where -- to use.
This post describes ETL filter placement using Patterns, a common design technique in the Java world. A Pattern is an arrangement of code -- components, classes, etc. -- with a descriptive name known to all team members. When a team member hears this descriptive name, he or she can easily visualize that segment of the design and how best to implement it.
For ETL filters, these Patterns are created in reference to a transformation, the tMap in Talend Open Studio. Here is a listing.
- Guard. Filtering occurs before the transformation
- Business Rule. Filtering occurs after the transformation
- Alternate Path. Filter occurs as part of the transformation
The Guard Pattern removes records prior to transformation. It's used when you want to prevent malformed data from entering the ETL system. Examples include removing records missing required fields, having fields in the incorrect format, or fields that exceed the expected size.
In Talend Open Studio, the Guard Pattern is implemented using a tFilterRow before a tMap.
|Talend Open Studio Job Implementing Guard Pattern|
The tFilterRow spools the rejected records off to a file for analysis and possible re-transmit by the source system.
The Business Rule Pattern rejects records after a transformation. Because the filter is placed after the transformation, it's expressions may be more complex. The post-transformation filter has more data available to it and can validate across schemas.
An example of a Business Rule might be to reject a record that's not on the Approved Vendor List. This rejection considers the input records (as in the Guard Pattern), but also considers another source.
In Talend Open Studio, this is implemented by placing a tFilterRow after the tMap.
|Talend Open Studio Job Implementing Business Rule|
Because the data is already in the system, you may skip spooling the records to a reject file. In this case, the filter isn't fending off bad data, but selectively processing the data.
The Alternate Path Pattern (or "Alt Path") applies a filter during a transformation. As the transformation prepares to output a record, it is checked against a condition and rejected. Like the Business Rule, Alt Path has access to more than one source schema.
This is a common technique in integration tools like Pervasive Data Integrator that aren't pipeline-driven.
In Talend Open Studio, this is implemented using a single tMap which maps more than one output source.
|Talend Open Studio Job Implementing Alt Path Pattern|
These ETL filtering Patterns -- Guard, Business Rule, Alt Path -- aren't inviolable. (You can always check for blank records after a transformation if needed.) But Patterns help quickly share concepts among team members: "Create a job with a Guard for each source and a Business Rule for approved vendors."