Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Thursday, March 10, 2011

Combining Excel Spreadsheets with tUnite in Talend Open Studio

When you need to combine several Excel workbooks or sheets inside a workbook, use Talend Open Studio's tUnite component.

The tUnite component takes one or more inputs and outputs a data flow in the style of a SQL UNION..  If using Excel sources, and configured accordingly, the tUnite component will also output each sheet in the workbook in the data flow.

These Excel workbooks contain business contacts spread across two regions: BusinessContacts_West.xlsx and BusinessContacts_East.xlsx.  Additionally, there are sheets for each fiscal year in the workbook.

An Excel Workbook with Several Sheets
Configuring a component like tFileInputExcel to process each sheet will send all of a workbook's records in each sheet to the output data flow.  For example, the following component configuration is for a tFileInputExcel that references metadata from the repository.

Configuration for tFileInputExcel
This configuration specifies the file to process, BusinessContacts_West.xlsx,.  The configuration references previously-loaded metadata from the repository, marks all sheets for processing ("All sheets"), sets up a header ("Header 1"), and applies that header to each sheet ("Affect each sheet(header&footer)".  The Excel 2007 is also set.  (If you get an OLE error, be sure to verify the version.)

"Header 1" is important so that the headers repeated on each sheet aren't also repeated in the output data flow and mixed with the data.

The two workbooks are configured similarly.  Each workbook's main data flow is sent to a tUnite component which sends its output to a tLogRow (for demonstration purposes).
Sample tUnite Job
The tUnite configuration is straightforward.  It's using the same business contacts schema as the tFileInputExcels which is taken from the repository.
tUnite Configuration
And a run produces

Running the tUnite Example
tUnite provides a nice alternative to writing code that loops through workbooks and sheets.  You may need to write a loop if you need access to the iteration controls like the workbook name or a value encoded in the sheet name.  If this isn't the case, tUnite is preferred because configuration over code means less testing.

3 comments:

  1. How do I set up the metadata repository that you mention in this tutorial?

    ReplyDelete
    Replies
    1. In TOS for Data Integration, expand the Metadata item in the Repository view. Right-click on "File excel" and select "Create file Excel". Follow the wizard to select the input file and to parse out a schema.

      Delete
  2. How to break the Talend job if wrong input excel file taken

    I have excel file, in first row of the file, it contains heading like "Department-Food" or "Department-Clothes", I need to read file only if it is "Department-Food" in tMap.
    if it is not "Department-Food" than the flow should break.

    tflowList >> tFileInputExcel >> tMap >> tFlowMeter >> tpostgres

    My requirement is, in tMap if the file is wrong file than nothing should be recorded in tFlowMeter not even "0", the flow should break at tMap itself.

    ReplyDelete