Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Thursday, May 26, 2011

Procedural Update in Talend Open Studio

For optimal performance, have the database do most of the work when updating.  Sometimes, this won't be feasible and you'll need to create loops and parameterize queries in a Talend Open Studio job.

SQL Update or Delete

To execute a single SQL UPDATE or DELETE statement, use the t*Row components in the Databases family: tOracleRow, tPostgresqlRow, tMysqlRow.  In the Query box in "Basic settings", enter a SQL string joined with variables taken from the globalMap of the job.

The t*Row components can be called as standalone subjobs, in an iteration, or as the target of a flow.

Procedural Coding

When it isn't practical to render a data transformation in SQL, use a procedure to more cleanly express the algorithm.  In Talend Open Studio, this is implemented using components like tLoop and tFlowToIterate.  These component break a data flow up -- say from an Access datasource -- into distinct iterations.  Each iteration can drive additional processing like keeping a counter of the records processed in a current iteration.

In this post, a table "UpdateTestSourceTable" is loaded with 5 records.  Each record contains a RECORD_ID which is a business record key that is not unique.  An auto-increment DB_RECORD_ID is used for uniqueness.  SEC_ROW is a column that defines ordering within a RECORD_ID.  Initially, this is empty.  The job in this post will set the value as in the screenshot.

Data Used in Example

This job implements the following procedural algorithm.  RECORD_ID is a business record key with meaning, but is not unique in the table.  DB_RECORD_ID is a surrogate key that guarantees uniqueness.

Get a list of distinct RECORD_IDS from a table
For each RECORD_ID 
    Initialize a COUNTER variable 
    Find a list of records matching RECORD_IDS (Get "DB_RECORD_ID")
    Map the fields
    Increment the COUNTER variable
    Update the table using COUNTER constrained by DB_RECORD_ID

The job uses a tAccessInput "Get Rec Ids" to start the processing.  tAccessInput goes into a tFlowToIterate.  For each iteration, the next five components are called.
Job Loading Records by Iteration

 Lookup Rec is a second tAccessInput that runs a query returning the distinct DB_RECORD_IDs.  It uses row1.record_id as a parameter.  Here is the SQL with the parameter.

SQL Parameterized with a tFlowToIterate's Variable
 The Update component is a tAccessOutput which has "Action on data" set to Update.  The schema is a subset of UpdateTestSourceTable: only db_record_id (key) and seq_row.

For learning about how to set up a counter in a Talend job, read Running Count in Talend Open Studio. "Set Sec Num" is set to 1 at the stat of each RECORD_ID iteration; this sets the variable "counter".  The code behind the tJavaRow increments the counter and maps each input field to the output.

output_row.db_record_id = input_row.db_record_id;
output_row.sec_row = input_row.sec_row;

int counter = ((Integer)globalMap.get("counter"));
globalMap.put("counter", ++counter);


For a database like Oracle RAC, nothing beats a set-based operation like MERGE.  However, it may be too difficult if not impossible to render a particular requirement in SQL.  In these cases, use the procedural facilities (tLoop, etc.) in Talend Open Studio to better interpret the algorithm.  Watch for performance!  If you do too much work in a single iteration, your job might not scale.

Additional Screen Shots

The following screen shots are for (in order) the configuration of "Get Rec Ids", "Lookup Rec", and "Update".

tAccessInput "Get Rec Ids" Config

tAccessInput "Lookup Rec" Config
 See the above screenshot "SQL Parameterized" for the embedded SQL statement in "Lookup Rec".

tAccessOutput "Update" Config
  The following is the tMap used in the job.

tMap Config
Excel Input (not Database) Walkthrough

In most cases, it's best to get the data into an RDBMS as quickly as possible and leave the flat files -- including Excel files -- saved for historical records.  The RDBMS gives the ultimate in flexibility and performance.  However, it is possible to do complex processing using Talend components like tAggregateRow or tFilterRow in place of SQL.

For a walkthrough of a job that uses only Excel (not a relational database) for an input source, watch the following video. 


Excel Output

This screenshow shows a job using a tExcelFileOutput.  It was created using 4.2.1.  The older format (excel2007 unchecked) was also tested.  The Component View is showing the configuration for the tExcelFileOutput component.

Job Using a tExcelFileOutput

15 comments:

  1. Hi,
    Thanks for the post.Kindly consider my feed back
    1.Please do not rename any of the component as it will be difficult for a newcomer to undertsand the transformation from the figure alone.please add some numbers if you need to use it in description.
    2.Also please list out screen shots of the content/changes of each components.Else it is difficult to follow unless some one is fully aware of all the features.
    Sanil

    ReplyDelete
  2. Thanks for the feedback.

    I added the component names to the main job diagram and included additional screen shots from the Component View window.

    I'm working on some new video tutorials, so be sure to check out the Bekwam YouTube Channel. The video tutorials are better for those starting out since they aren't as fragmented as the blog posts.

    ReplyDelete
  3. Hi,
    The code works .But Output is not as expected.
    Scenario here is my input and output is excel not DB. Second Excel is look up one say cotain only 2 distict recods.Itration happnes for 5 inputs but in outout I am geeting as
    121 1
    122 2
    This mean reset count and mutilpe count for each repetition is not happening

    ReplyDelete
  4. Hi Sanil,

    I published a video that walks through an Excel-only job. Look at the end of the original post for the "Excel Input (not Database) Walkthrough" section.

    ReplyDelete
  5. Hi Carl,
    Your presentation is awesome.Youtube is so powerful medium.As I am very new to Talend it is really helpful.Also I am planning to go through all your series.
    I tried to impliment this.It works. But excel is not geeting updated.However I can see output in the logs.To update excel automatically do we need to do any seetinsg changes in update excel file transformation?
    Sanil

    ReplyDelete
  6. I don't see a Talend option to update with tFileOutputExcel similar to that used with tAccessOutput. If possible, try to use a database like MS Access or MySQL.

    If not, then create a second Excel file using tFileOutputExcel. Follow up the processing with a sequence of file operations (tFileCopy) that replace the input Excel file with the output.

    Good luck. Thanks for the feedback. I'm going to use the video walkthrough in future posts.

    ReplyDelete
  7. Hi Carl,
    I tried with excel. Yes log is generating. But only last set of iterating group is getting written to excel. I tied with append option in the sheet also. Still not working. Strange thing is when I am selecting the debug run with traces it shows all the set of records are approaching to excel in the output.
    Again I tried with access. Input and look up I used a table. Same table I used in output as update option. Nothing is getting updated. Again here also traces showing all records coming perfectly till output access. I have checked update option only. Still the problem persists
    Sanil

    ReplyDelete
  8. Hi,

    Which version of TOS and Excel are you using?

    I added a screenshot in the post's body from a job I wrote in 4.2.1 with Excel 2010 Professional and was able to record the output in Excel across multiple iterations.

    If you're having difficulties after reviewing the screenshot, see if you can get a simple tLoop -> tRowGenerator -> tFileOutputExcel job working.

    1. Drag a tLoop, tRowGenerator, and tFileOutputExcel on the canvas.
    2. Connect the tLoop iterate to the tRowGenerator.
    3. Configure the tRowGenerator. Add a field using the default values for name and function (getAsciiRandom). Reduce the '100' rows generated to '1'.
    4. Connect the tRowGenerator's main to the tFileOutputExcel.
    5. Configure the tFileOutputExcel with a filename, both appends checked, and the desired version (2007).

    The result should be a 10-row spreadsheet where each iteration -> a new record.

    ReplyDelete
  9. Hi Carl, Please let me know if the below requirement can be handled in Talend, this is totally a database stuff and I am using oracle.

    below is the record set I am getting from the source

    emp_id job_code position_code account_code event_date
    1 j1 p1 a1 07/01/12
    1 null p1 null 07/09/12

    and I want to store in the destination table as shown below


    emp_id job_code position_code account_code event_date
    1 j1 p1 a1 07/01/12
    1 j1 p1 a1 07/09/12

    if any of the code columns is null I want to go and pick the previous value based upon the emp_id and the event_date, I am just giving a simple example, but there are thousands of records from the source with different emp_id's will be having this kind os scenario, I think this can be hanled using some stored proc / script using a cursor, but checking with you to see if this can be handled in Talend.

    thanks in adavnce.

    John

    ReplyDelete
    Replies
    1. Hi Carl, I thought the format will be good, sending again with comma separated format


      Please let me know if the below requirement can be handled in Talend, this is totally a database stuff and I am using oracle.

      below is the record set I am getting from the source

      emp_id, job_code, pos_code, acct_code,event_date
      1, j1, p1, a1, 07/01/12
      1, null, p1, null, 07/09/12

      and I want to store in the destination table as shown below


      emp_id, job_code, pos_code, acct_code,event_date
      1, j1, p1, a1, 07/01/12
      1, j1, p1, a1, 07/09/12

      if any of the code columns is null I want to go and pick the previous value based upon the emp_id and the event_date, I am just giving a simple example, but there are thousands of records from the source with different emp_id's will be having this kind os scenario, I think this can be hanled using some stored proc / script using a cursor, but checking with you to see if this can be handled in Talend.

      thanks in adavnce.

      John

      Delete
    2. Hi John,

      You can do this in Oracle. Sort the input in the cursor. In a loop, keep track of the "last job_code" in a LOOP. You'll also need to keep track of the "last emp_id" to make sure you're not associating the job_code with the wrong emp.

      Delete
  10. Hi Carl, Checking to see if you had a chance to look at the question?
    thanks
    John

    ReplyDelete
    Replies
    1. Yep. Load the text file in Oracle, and run a stored procedure. The date processing ("previous event_date") sounds like it won't play to Talend's strengths. However, you can use Talend for the file loading instead of sqlldr. Then call the stored proc from Talend. (See this blog for "stored proc" examples.)

      Delete
    2. thanks for the reply Carl, will work on this using the stored proc.

      thanks again
      John

      Delete
    3. Hi John,

      If you're still looking for a Talend-only solution to your problem, check out this blog post called "A Theta Join with Talend Open Studio": http://bekwam.blogspot.com/2012/07/theta-join-with-talend-open-studio.html.

      Delete