Monday, April 11, 2011

Three Error Handling Strategies in Talend Open Studio

You can recover from some errors.  Others, like system or network failures are fatal.  But even in the fatal case, your Talend Open Studio job should die gracefully, notifying the operations team and leaving the data in a good state.  This post presents three error handling strategies for your Talend jobs.

Some Talend Open Studio job errors are alternate paths that, though infrequent, occur often enough to justify special programming. This programming may come in the form of guard conditions, special logic applied to route the special case to another subjob.  For an example of these type of errors, see this blog post on ETL Filter Patterns.

Other errors are related to system and network activity or are bugs.  There are a few ways to handle this class of error in Talend Open Studio.

Do Nothing

For simple jobs, say an automated administrative task, you can rely on the exception throwing of Talend Open Studio.  An example is a simple input to output job where a database failure in writing the output results in a system error.  This is expressed in the Run View as a red stack trace.

Simple Job with No Extra Error Handling Configured 
Subjob or Component Error Triggers

Each subjob and component has a return code that can drive additional processing.  The Subjob Ok/Error and Component Ok/Error can be used to steer the error toward an error handling routine like the tSendMail component.  This example looks for a connection error (the database is off) or a file processing error (the database is on, but the table name is wrong).

Both an individual subjob and a finer-grain component can be tested.  The screenshot shows two tSendMail routines being called from an OnSubjobError trigger.

Error Handling Tailored to the Subjob (or Component)
While testing the individual subjobs and components has the advantage of providing error handling tied to the specific case, there are disadvantages in maintenance and testing.  Maintenance suffers because the job  becomes cluttered with extra components which can confuse the normal processing, less frequent processing, and the error handling.  Testing is harder because there are more test cases.

Sometimes, there is a need for this level of detail.  You may want to send a file that represents an intermediate stage of processing via email.  This file isn't available throughout the job, and not every failure can handle this.

tAssertCatcher

A more general strategy is to define an error handling subjob to be performed when an error -- any error -- occurs.  This has the important advantage of consolidating the error handling, dramatically reducing testing.  It puts the burden of testing for error conditions on Talend (where it belongs).

To implement the general strategy, use the tAssertCatcher component which will be invoked whenever any component throws an error.

A Shared Error Handler with tAssertCatcher

If there's a failure in the XSL component (tXSLT) or other component resulting in a Java exception, the job will continue with the error handler (in this case a tLogRow) attached to the tAssertCatcher. tAssertCatcher can route an error message to other handlers like a tSendmail.

tAssertCatcher Config
Components like tXSLT don't need any additional configuration to use tAssertCatcher.  The tFileInputXML has a "Die on error" checkbox that needs to be set.

In the following screenshot, the database component tMSSqlOutput_1 has "Die on error" set.  If the flag is not set, then the tMSSqlOutput will print a message and the tAssertCatcher will not be called.  This particular example caught errors from the connection component (bad login) and the tMSSqlOutput component (DB-generated unique constraint violation and invalid insert of identity column).

An Example with Database Components


Let Talend Work

Handling system errors is different than alternate paths and conditions that arise during coding a Talend job.  Sometimes, you'll have a specific error routine for a specific system error condition.  But where possible, let Talend throw the system errors and catch them with a tAssertCatcher. 

55 comments:

  1. Excellent tutorial.

    I've tried applying this concept using a tOracleInput component with a OnSubJobOk trigger to a tMap component. The tMap component won't seem to accept a trigger conditional, forcing me to divide my logic to 2 steps (checking first for a connection using tOracleConnection, then running my query using the tOracleInput).

    Could you explain why you can't run a trigger directly to a tMap? Is there another way to consolidate this error checking on a query without having to have an explicit connection component?

    Thanks!

    ReplyDelete
  2. Hi,

    tMap needs an input component like tOracleInput to provide it with operands. This is the flow of records from a query.

    A tOracleInput can include connection information. However, most of the jobs that I write use a separate tOracleConnection. This is so that the connection can be applied to all of the components in the job: tOracleOutput, other tOracleInputs. The configuration, number of connections, and properties like auto-commit are controlled centrally. It's rare that I have only a single DB input in a job.

    To handle a tOracleConnection error, connect your error handling code using an OnSubjobError trigger. Alternatively, a tLogCatcher can be configured to accept Java Exceptions which will be thrown if the tOracleConnection fails.

    Good luck.

    ReplyDelete
  3. Thanks, this has been a huge help!

    ReplyDelete
  4. Hi,

    Can you tell us if it's possible to resume from an exception ?
    Let's say I have an infinite loop that calls a tFTPPut every 30 minutes, if there is a connection problem the job will stop by throwing an exception, I'm trying to make it goo back to the loop and try again until the FTP server is available.

    Thanks

    ReplyDelete
  5. Have you tried checking the exit status of the tFTPPut component and building a loop from that? Prior to the tFTPPut component, set a flag "NumRetries". Direct the OnComponentError or OnSubjobError connector to a tJava component that will increment this variable.

    To avoid a long-running problem, define a context variable 'MaxRetries'. Before calling the tFTPPut, check this variable with a Run If connection or other component.

    If the exception is still being thrown, make sure there are not catchers registered for "Java Exception".

    ReplyDelete
  6. I developed my own sendmail component, but I cannot connect other component to it in "on component ok". Could you please help me?

    ReplyDelete
    Replies
    1. Hi,

      Can you verify that you have these CONNECTOR elements in your XML config?

      Delete
  7. I used the third method (A more general strategy is to define an error handling subjob to be performed when an error -- any error -- occurs ), even the job runs successfully with out any error on any components the tdie is executing, how do I avoid it?
    thanks in advance

    ReplyDelete
    Replies
    1. Hi,

      tDie is the wrong component. Try tAssertCatcher. I'm correcting the post with an updated screenshot.

      Delete
    2. thanks for your reply Carl, will try and let you know.

      Delete
    3. No problem. The section was updated with an example that uses tAssertCatcher. I verified that the job doesn't invoke the error handler if there are no errors and that the error handler is called if there is an error in the tXSLT and the tFileInputXML components.

      See the note about setting "Die on error" if available in a component.

      Delete
    4. thanks Carl, I saw the update, however I use lot of database related components such as toralceinput, toracleout, toracleSCD, I tested with tAssertCatcher and it doesnt pick up error like "cannot insert NULL" or anything that happens in that database components, but however it picks up errors like invalid objects names etc.. How do I handle some of the error that occurs using the database components. thanks in advance.

      Delete
    5. Are the "Die on error" settings on the tOracleOutput and tOracleSCD (Basic settings) checked?

      Delete
    6. Carl, I havent checked it, will try and see. thanks.

      Delete
    7. Carl, I didnt get a chance to test it yet, i was using the tLogCatcher before and today I came to know about tAssertCatcher, so what is the difference between them. thanks.

      Delete
    8. The difference between tLogCatcher and tAssertCatcher is the schema used to communicate the errors in the project.

      tLogCatcher uses moment, pid, root_pid, father_pid, project, job, context, priority, type, orgin, message, and code.

      tLogCatcher will also capture tWarn messages.

      tAssertCatcher uses moment, pid, project, log, language, origin, status, substatus, and description.

      Both handle tDie messages although tAssertCatcher calls these tAssert messages.

      Delete
    9. thanks Carl.

      Delete
    10. Carl, I tried checking the "Die on Error" and it sends the error through tAssertCatcher and I have a sendmail component, but I know the "Die on Error" stops the process, but how do I continue the process just getting the error messages. thanks.

      Delete
    11. Use a Run If connector routed to a tWarn to throw an error to a tLogcatcher. For more details on this, search the blog for a post called "Handling Database Warnings" I just put up (May 7, 2012). There's a video too. Be sure to Like and +1 anything you found useful.

      Delete
    12. thanks Carl, will check it out.

      Delete
    13. Carl, I see in the RunIF you check for number of lines inserted == 0, what do I need to do, if it partially inserts rows and warns later or for example as I asked before what if it tries to insert null values to a not null column etc., what global variables do I need to use in the RUN IF? thanks.

      Delete
    14. If you're able to turn off batching, you can open up a rejects stream which will give you the error that resulted in a failed insert (not null, unique, etc).

      In that case, you'd use _REJECTED > 0. However, if you need to have batching, say for performance reasons, continue with _INSERT.

      You can always do the null checks ahead of time with a tFilterRow or similar component.

      Check the "Handling Database Warnings" post for an update (section "Getting More Information").

      Delete
    15. ok thanks Carl, will check there

      Delete
    16. Hi Carl, not sure if you have a post for handling using the same table for multiple lookups without using the same table multiple times, for example, take a Date Dimension table, if I need to reslove keys for the Birthdate, Hiredate, TerminationDate from the Date Dimension table, currently I using the same table 3 times for resoving keys for the 3 fields, how do I overcome this by looking up the table one time and reslove all the 3 keys, please let me know. I am using 4.2.3r.....version of Talend. thank you.

      Delete
    17. Hi,

      Take a look at a post from May 14, 2012 called "Efficient Lookups with Talend Open Studio's Hash Components" on this blog.

      Delete
  8. thanks Carl, was busy, will take a look now.

    ReplyDelete
  9. Hi Carl, I posted a question on "Efficient Lookups with Talend Open Studio's Hash Components" about how to clear the thashinput/output from the memory, please take a look and let me know. thanks.

    ReplyDelete
  10. Hi Experts,

    My talend job reads some host names from a table and then I am using the tSSH component to get the df -K command outputs for those hosts. In a case where talend is not able to connect to a host I want to log a warning and resume the command execution with the next record. How do I achieve that?

    Cheers!

    Chanaka

    ReplyDelete
    Replies
    1. Hi,

      Take a look at the May 22, 2012 post "Your Own Exception Handler for Standard Talend Components" on this blog.

      Delete
  11. Hi Carl,
    thanks for excellent blog!

    I have a problem related to exception handling on subjobs. My job has many subjobs (up to 4 levels down) and i'm trying to establish a single exception handler in the main job. I've tried different strategies but something fails - maybe you can suggest some general principles to avoid typical mistakes?

    Regards,
    Martins

    ReplyDelete
    Replies
    1. I thought not subjobs but child jobs.

      My goal is to have one place for all warning and error handling - to write them to database and more dangerous ones to report through e-mail.

      One of problem is that when I put a tWarn in a child job (and it executes) and a tLogCatcher - all works fine. But if try to catch that tWarn message in parent job - it doesn't work. The same with tDie.

      The main question is - is it possible at all to catch child job errors and messages in parent job?

      Delete
    2. Hi Martins,

      Everything in Talend becomes a Java class, so I tried a few combinations of child jobs, tAssertCatcher, and re-throws ("throw new Exception()"). I didn't find a solution worth posting.

      There's definitely a need to break jobs into child jobs (.class files size, reusability). However, I'd try to eliminate one of the levels on the scheduler side by finding a job scheduler that supports processing chains (sometimes called "job streams").

      Maybe there's a refactoring of your jobs that you can do to eliminate another layer? Something like grouping all of the lower-level child jobs that behave in a similar manner together if they're separated by different use cases today.

      Delete
    3. Thanks, Carl,
      I chose a simple workaround - using copy-paste I put in each childjob a tLogCather component connected to tRunjob which calls another child job that does the necessary work.

      Delete
  12. Hi Carl, I have a problem with the tOracleoutput component, where I set the action on table to be drop and create, first time it runs fine but later on the action on table is not there and see only the action on data option in the basis settings, wondering why it is changing, not sure what I am missing do you have any blog on this I can take a look. thanks in advance.

    John

    ReplyDelete
    Replies
    1. Hi John,

      I haven't run into your particular problem with tOracleOutput. The Oracle work I did with Talend used tables that were already created. In general, I create the tables with scripts rather than the Talend "Action on table" settings.

      This post may be of interest as it describes working with the DDL generated from the "Action on table" settings. Different DB, though. http://bekwam.blogspot.com/2011/10/controlling-ddl-behind-talend-open.html

      Check the JIRA on Talend for your particular version.

      Good luck.

      Delete
    2. thanks Carl, will take a look.

      John

      Delete
  13. Hi Carl, I just posted a question in the following link

    http://bekwam.blogspot.com/2011/05/procedural-update-in-talend-open-studio.html?showComment=1341891554898#c3966067130247382279

    just sending this as a backup, could you please take a look and let me know there is way to handle that requirement in Talend.

    thanks
    John

    ReplyDelete
    Replies
    1. There is...Take a look at http://bekwam.blogspot.com/2012/07/theta-join-with-talend-open-studio.html.

      Delete
  14. Hello Carl!
    I have a service installed in windows which fires commands every minute on a user made terminal to fetch data from TPF window into database. In case this service stops, database is no longer updated.
    How can I use Talend to notify authority via email that the service has stopped working?

    thanks in advance!

    ReplyDelete
    Replies
    1. Hi Hitesh,

      Job scheduling software like Control-M will keep track of your jobs, and if one isn't run, an alert can be generated.

      If you're not using this type of software with Talend, you'll need to provide your own mechanism. I see a lot of batch jobs -- Talend included -- using a polling process to check to see whether something has run. The Talend job writes out or updates a record with a process status. This process status is updated at the end of the job.

      A polling job is run that looks at this record, examining the process status, last run date, and other information recorded by Talend.

      Delete
  15. Hi Carl
    I am really impressed with your answers. Much appreciated.

    I am new to talend and I am working on an enhancement to fail the entire job when there is first error (whatever be the error). Cos the downstream system is heavily dependentant on previous job's data.

    So my requirement is --> I have a workflow in which 10 jobs are attached most are sequential load and only few are parallel. After going through all these threads, I checked on Die on error for all targets and added a joblet which has tlogcatcher and sendmail. Is this sufficient? or do i need to include tDie at every job?

    Say out of 10 sequential job, 2nd one has failed, then entire job should just STOP at 2nd one and do nothing than sending an email from failure_job_alert joblet. Please help me with this.

    ReplyDelete
  16. I don't use Enterprise (joblets), but for TOS, I catch the errors using tLogCatcher and route the data to an error handler (email, etc) in the calling job.

    ReplyDelete
    Replies
    1. so when we use tlogcatcher does it fail the entire job?

      Delete
    2. tDie at the end of the tLogCatcher subjob.

      Delete
  17. Hi Carl, this question is related to CLOB datatype in oracle, my source and destination is oracle and one of the comment field is CLOB and using Talend to move the data, I used the tConvertType component which changed to the type to Object, but I am getting error, Please let me know if you have any experience with this datatype.

    thanks
    John

    ReplyDelete
    Replies
    1. Hi John,

      Have you seen the CLOB-related posts on the blog? At the top of the page type 'clob' in the search.

      Here's one on working with XMLType, a CLOB-like Oracle data type.

      http://bekwam.blogspot.com/2011/12/exporting-oracle-xml-to-file-with.html

      And another one: http://bekwam.blogspot.com/2011/05/large-chunks-of-text-in-talend-open.html

      There are a few other hits.

      Delete
  18. thanks Carl, I didnt notice it, will take a look.

    John

    ReplyDelete
  19. Thanks curl your error handling systems are really useful to me..

    ReplyDelete
  20. Hi, I am new to Talend Studio. I designed a job to transform data from phpmysql to phpMysql itself. The following error occurs:

    java.net.ConnectException: connect: Address is invalid on local machine, or port is not valid on remote machine
    ......
    Exception in component tMysqlOutput_1
    java.lang.NullPointerException
    at routines.system.RunStat.sendMessages(RunStat.java:286)
    [statistics] connecting to socket on port 4041

    However the port 4041 in the console error is shown differently at different turns of run. The actual port is 3306. What could be the reason for this connectivity problem?Is this because of installation error?Please help.Thanks.

    ReplyDelete
    Replies
    1. Can you compare your Talend MySQL job with the job shown in the video? I apologize for the sound quality; my later videos are better. http://www.youtube.com/watch?v=5L1AtlPqsVg

      Delete
  21. frnds i want to send Email alert to my Team mates when my MDMconnections goes down...in ETL talend..how can i do this ..??? im using oracle DB
    pls help me thanks

    ReplyDelete
    Replies
    1. Have you tried a tSendMail component as demonstrated in this post? http://bekwam.blogspot.com/2012/05/handling-database-warnings-with-talend.html

      Delete
  22. Is there a way to identify the log files that are generated by Talend Administrator ?
    I would like to get those logs and email. Usually, I download the logs from the administrator from job execution monitoring. I would like to automate and get these logs in email.
    Thanks in advance!!

    ReplyDelete
    Replies
    1. Have you tried writing a job or joblet that uses the tSendMail component to send the files as an attachment?

      Delete
  23. Hello,

    I have an issue with a rather complex job, and no idea on how to catch the error. I would be thankful if you had a suggestion.
    I have roughly 200 remote database I'm trying to call and retrieve data from.
    However I sometimes get a
    "Exception in component tMSSqlInput
    java.sql.SQLException: Network error IOException: Connection timed out (...)"

    This kills the job ( taking the other threads and the next with him ) resulting in much less data retrieved than actually available.

    I would like to catch that error and kill only the one thread it concerns, or any other idea that would help the job to fallback gracefully.

    Any hint?

    Regards.

    ReplyDelete