Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Saturday, January 17, 2015

Using Talend tGroovy to Verify a File Set

Generally speaking, prefer Talend components to code when creating jobs.  Talend components can be configured in the Talend Open Studio Component Tab whereas configuring or parameterizing code must be done indirectly through Context Variables or the globalMap.

However, there's a case for preferring code when the number of components for a simple task starts to become unwieldy and affects readability.  Verifying a single file with a tFileExist is clear in its intent and functionality.  Verifying more than one file with multiple tFileExists presents readability problems.  Significant Talend canvas real estate becomes cluttered with many components that taken together, perform a simple task.

In this case, I like to use code to reduce the clutter.  This blog post shows how to do that with tGroovy, a Java-like scripting language that is more forgiving for beginners (and experts).

Using Components

This job expects 3 files to be present before processing.  Otherwise, the job will immediately die.  I want to do this check up front and don't want to start processing and find out in the middle of the program that I'm missing a critical file.  For each required file, I've added a tFileExist / tDie pair.  If the file exists, then a Run If allows the program to proceed vertically, ending with the "Output a Message" tJava.  If any files are not found, the corresponding tDie is called when a Run If evalutates to "true" for a missing file.  The tDie aborts the program.

One tFileExist Per Required file
This job could make use of a tForeach.  That way, I don't have to continually add tFileExist components for each file that needs to be verified.  The tForeach contains a list of file names.  CURRENT_VALUE is fed to the tFileExist which may lead to a tDie.  If no tDies were called, then the final tJava is called.

Loop to Reduce Number of Components


If you're a Java developer, you can write a Talend Routine to check an input list of file namesand return a "missingFile" return value.   If the return value is empty, then all the files exist.  Otherwise, capture the missing filename from the non-empty return value.  The Talend Routine is ideal because it heightens reuse across jobs.

But if you're not a Java developer, you may be looking for an easier-to-program solution.  Groovy is much more forgiving in terms of its syntax.  This job reduces the number of components in the job to 3 (down from 4 in the Loop version).

A Script Reduces the Number of Components
This job is easy to read.  A single component is marked with "Check All Files" so future maintainers will be clear on its intent.  In my experience, this isn't likely to be a maintenance problem, so get the reader past it as soon as possible.

Script Configuration

The configuration of the tGroovy component starts with the passing of parameters.  From the calling job, I pass in the globalMap object and map it to a Groovy variables "gm".  globalMap will provide the mechanism with which I determine whether or not a file is missing.  There are other parameters for passing in the file names.

The Groovy Program and its Parameters
The script is presented below.  A list "files" is formed with the 3 input variables: file1, file2, file3.  A loop is set up on this list.  A File object is created for each item.  If the file does not exist, a map item "missingFile" is added to the globalMap (where it can be used by callers).  If the file does not exist, the script stops on the "break".

def files = [ file1, file2, file3 ]
for( fn in files) {
 java.io.File f = new java.io.File(fn);

 if( !f.exists() ) {
  gm.put( "missingFile", fn );

The Run If leading into the tDie checks to see if the missingFile item now stored in the globalMap is null.  If so, the tDie is called.

"missingFile" is in globalMap
The Run If that continues processing evaluates to true if "missingFile" was not found.

"missingFile" not in globalMap

The tDie component uses the "missingFile" item, not as a flag, but as a way to pull out the error message for the user from the tGroovy component.

tDie Prints Results from tGroovy


Consider writing small scripts to replace Talend Components for readability.  Don't abuse this; Talend components are still preferred.  However, if the job becomes so heavy with components that aren't essential to the understanding of the job, it may be worth eliminating them. 

While loops and logic are important for your development and testing, I consider the real important parts of the job to be the data flows.  That is, jobs can become 75% about framing the input or error handling instead of a key RDBMS to Web Service flow.  The next guy maintaining your Talend job is more likely to need to find and fix a tMysqlInput query rather than file handling code.

No comments:

Post a Comment