JavaFX Tutorials

Saturday, October 15, 2011

Updating a Table with the Talend Open Studio ELT Components

ELT stands for Extract-Load-Transform and is an inversion of the data processing pattern "ETL".  In ELT, data is manipulated within a database which is a huge performance improvement.  Talend Open Studio comes with components like tELTMySqlMap to make the writing of such manipulations easier.

When updating data in a database, sometimes its easier to load a file in a staging table and then manipulate the data within the database.  It may even be the only option based on performance requirements.  Using a staging table means loading relatively unclean data from a source like a spreadsheet or CSV file into a schema based heavily on the input.  Subsequent transformations and loads within the database will address the quality and move the data into a more usable format.

The Talend Open Studio (TOS) ELT components (for example, tELTMySqlMap and tELTOracle) will generate SQL INSERT OR UPDATE statements based on Talend metadata.  While these statements can be formed by hand, forming the SQL with Talend will cut down on typing and errors.  It's especially useful on a consulting job where large sets of unfamiliar columns need to be mapped.

 An ELT Job

This TOS job will update a table CONTACT.  CONTACT has a column BUSINESS_ID which will be updated based on a mapping established in the BUSINESS_CONTACT table.  The BUSINESS_CONTACT table uses a business key "BUSINESS_NAME" to retrieve the value BUSINESS_ID.

ELT Job
 BUSINESS and BUSINESS CONTACT are input tables to the tELTMySqlMap component.  The output table is CONTACT.  The schema defined for the CONTACT table in the tMySqlOutput component is a subset of the fields because this job will only be updating a single field.  For inserts, the whole set of fields would be used.

ELT Map

In the map component, the input tables are joined on a business key "BUSINESS_NAME".  From the BUSINESS table, BUSINESS_ID is mapped to the output.  If this were an insert operation, the job is complete.  Because this is an update, there is an additional step.

In an update, the input needs to be correlated with the output. This is accomplished using an alias 'C' on the target table.  The C alias is imposed as a condition on the join table using an expression entered by hand.  Notice that the target table is not added as a third input.  That's because the UPDATE statement generated will add input tables to the FROM cause, not what we want with an "UPDATE CONTACT" statement where the table name follows the UPDATE.

To set up the C alias, change the table name to include an alias in the tELTMySqlOutput component.

Adding an Alias to tELTMySqlOutput
 The result of a run generates the following SQL.

An Update Statement Considering the Output Table
Troubleshooting

If there is an error in the SQL generated by the Talend component, check the list of columns in the ETL Map.  Although this example only updates a single column, I have used an ELT component -- tMSSqlMap specifically -- to update more than one column.  What you're looking for in the list of columns are syntactic differences between a plain column (like 'dateCol') and a function or vendor-specific construct (like 'GETDATE()').  Try removing out anything that isn't a plain column when debugging.


 Sometimes the only way to process a large amount of data is to first stage the data in a loosely-defined and unconstrained table and to manipulate it once the data is in the database.  This is ELT (Extract-Load-Transform), an inversion of the more widely known "ETL".  Using Talend Open Studio's ELT helps generate the SQL for this type of transformation.  These SQL statements can certainly be created by hand, but if you're a consultant or dealing with an unfamiliar schema, the productivity boost of mapping unfamiliar tools with a GUI can be substantial.

8 comments:

  1. Your post is helpful with me, thanks for your guide.

    ReplyDelete
  2. Hi,

    For above example do all the tables in the same schema?

    ReplyDelete
    Replies
    1. Hi,

      This is a scenario where a large flat data file is being loaded into two normalized tables, BUSINESS and CONTACT. This process removes a repeating group "BUSINESS" that is repeated for some of the CONTACTS.

      The loading is done in two stages (you're seeing the second in this post). The first stage quickly loads the text file into two tables using a bulk loading component. However, the linkage between the two tables (BUSINESS and CONTACT) is broken.

      So, a lookup table matching two business keys in the source data: BUSINESS_NAME and PRIMARY_CONTACT_EMAIL. The second stage of processing -- shown in this post -- is an update statement working within the database for efficiency.

      You can avoid the multi-stage processing on smaller data sets, opting for something like a getLastId component chained together with the output component. There are examples of this on the blog. But this may not be the most efficient processing since it can't be done with a bulk load since batching must be turned off.

      Delete
    2. Thanks Carl,

      I have tried to use tELTOracle components and getting errors as all my source and targets are oracle tables only and having huge data. That's reason i plan to use ELT components. and it's straight forward, for dimensions just update and insert(type1) some look ups and aggregations for the fact tables.

      The error is that target schema is not same as the input schema.

      could you please advise me on this, like for performance wise which component would better. i thought of using ELT Components but not able to resolve the error.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Thank you, Carl. This post is very useful, nice work you're doing with this blog!

    Unfortunately I wasn't able to follow these steps with SQL Server, because the ELT components are generating a command like this:

    UPDATE mytable alias_different_table_name
    SET attr1 = (SELECT attr1 FROM source_table ... INNER JOIN ...)
    attr2 = (SELECT attr2 FROM source_table ... INNER JOIN ...)
    WHERE ...

    And the expected syntax would be:

    UPDATE mytable
    SET attr1 = A.attr1
    attr2 = A.attr2
    FROM source_table A
    INNER JOIN ...
    WHERE ...

    I don't know if the syntax above is the only one correct, but I tried both methods manually and only the second option had success (I think that is more optimized, also). The first option only worked without the alias defined in the output component and before the 'SET' word.

    Now I'm inserting all the data in a temporary table and then I use a tSQLTemplateMerge component to update/insert the target table.

    Best regards,
    Anselmo

    ReplyDelete
    Replies
    1. It's been a few years and I don't recall seeing these components producing SQL as you expect. The screenshot "An Update Statement..." indicates that I got the same generated SQL as you did. My example deals with a single field and doesn't have an out where clause, but you can see that the single field contains a subquery.

      Delete