Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Thursday, June 23, 2011

Processing Files with Headers and Footers with Talend Open Studio

Talend Open Studio uses wizards to define schemas for text and Excel files.  However, the wizards aren't sufficient for processing a complex file containing a header or a footer.  Use Java to build a state table that will route the different sections of the file to the correct target.

Wizards like the Delimited File Wizard or the Excel File Wizard define a schema based on a single row/column structure.  Components like tFileInputExcel provide additional control over the file processing, allowing header rows to be skipped.  But if the input header (or footer) needs to be processed, add some Java components to help.

Algorithm

For processing the complex file, use a state table which identifies the sections of the file and describes the section boundaries.  Take the following file as an example

Rows 1-3 contain the header.  Rows 5-9 contain the body.  Row 11 contains the footer.  An empty row separates the header from the body.  An empty row separates the body from the footer.  For processing, this can be written as a state table where working with a section in the file defines a new state.

State    Transition    New State
------------------------------------
Header   Empty row     Body
Body     Empty row     Footer 


The state table drives a process where the header is processed until an empty row is found.  Then, the body is processed until an empty row is found.  Finally, the footer is processed.  In the following Talend Open Studio job, "processed" means routed to the correct target (the tLogRows to the right of the image).

Job Routing Header, Body, and Trailer to Different Targets
 Input

 In the example job, a state table is defined in a tJava (see section "State Table").  The input flows from a loosely-defined tFileInputExcel.  The tFileInputExcel contains a schema with a field for each of the possible input fields, all of which are defined as string.  Usually, the number of fields will be number of fields as the body, but if the header contained more columns, then include that many columns in the generic schema.

Schema Used in tFileInputExcel
Identifying Sections

The tFileInputExcel is connected to a tJavaRow.  The tJavaRow maps each of the input fields to output fields.  The tJavaRow also defines an extra "recordType" field.  This holds the value of the current state and is used in the later tMap to route the record to the correct schema and target.  The tJavaRow also applies the state table to the input, determining if a new section is being processed (see section "Applying the State Table").

tJavaRow schema with recordType Field
Routing

Finally, a tMap is used to route the input to its correct destination.  The tMap defines three output schemas for each of the three sections of the file.  These are NOT loosely defined.  They contain the proper number of columns, column headers, and data types.  Each output schema in the tMap is supported by an expression which will examine the recordType value to determine which schema should be applied.

tMap Defining 3 Output Schemas with recordType Expressions

Output

Each output from the tMap is directed to a different component.  This example uses tLogRows that print out the component name with the fields for debugging.  The output could be three individual tables in an RDBMS in which case a unifying "load number" should be acquired from the DB.  This load number would link together the header, body, and footer records for later re-assembly.

Two tFilterRow components are used to filter out the blank lines that demarcate the file sections but which are included in the processing flow.

State Table

The state table in this job is implemented using a Java Map.  The Map is set up in the tJava at the start of the job.

Here is the code behind the tJava_1 component.  From "Basic settings"

Map<Map<FileState, String>, FileState> stateTable =
     new HashMap<Map<FileState, String>, FileState>();

Map<FileState, String> transition = 
     new HashMap<FileState, String>();  
transition.put(FileState.HEADER, null);
stateTable.put( transition, FileState.BODY );

transition = new HashMap<FileState, String>();
transition.put(FileState.BODY, null);
stateTable.put( transition, FileState.TRAILER );

FileState currentState = FileState.HEADER;

globalMap.put("stateTable", stateTable);
globalMap.put("currentState", currentState);

"Advanced settings" has the usual imports.  There is also a cool (my word) Java language trick to apply an enumerated value.  I slipped in an enum definition for use throughout the job.

import java.util.Map;
import java.util.HashMap;

enum FileState { HEADER, BODY, TRAILER };


Applying the State Table

The state table and its transitions are applied in the tJavaRow component between the tMap and the tFileInputExcel.  The code relies on a nested Map which requires a special two-part key.  This is the code from "Basic settings" in the tJavaRow.

FileState currentState = (FileState)globalMap.get("currentState");

output_row.field1 = input_row.field1;
output_row.field2 = input_row.field2;
output_row.field3 = input_row.field3;
output_row.recordType = String.valueOf(currentState);

Map<Map<FileState, String>, FileState> stateTable =
  (Map<Map<FileState, String>, FileState>)globalMap.get("stateTable");

Map<FileState, String> key = new HashMap<FileState, String>();
key.put( currentState, input_row.field1 );
 
FileState nextState = stateTable.get( key );

if( nextState != null ) {
     currentState = nextState;
     globalMap.put("currentState", currentState);
} 

While there is a lot of Java code in this job to process the complex file, the job design is simple and scalable.  New file sections can be added in a straight-forward manner and the demarcations (currently empty rows) can be more sophisticated. 

9 comments:

  1. I am new to the Talend but would not be possible to use tFileInputMSPositional instead of the above?

    ReplyDelete
  2. Hi,

    I guess somebody stole your blog. Check this link out, it's copied word by word:
    http://talendhunter.blogspot.ca/2013/06/processing-files-with-headers-and.html

    ReplyDelete
    Replies
    1. Thanks for the heads up. It's better to use this blog because you can post comments and get a response.

      Delete
  3. Hi,

    Is there a way to map a text file with a header being at line 1 and footer at last line of file with no spearator and tranform this file into say an XML file with a main element containing a header, body, footer subelements?

    I can't seem to find a simple way to do this in Talend.

    Thanks for your suggestions

    ReplyDelete
    Replies
    1. If you can identify the header and footer with a record type marking (for example, "HDR" and "FTR") you can use this approach.

      http://bekwam.blogspot.com/2012/04/rendering-xml-from-multi-schema-text.html

      If the header and footer are purely positional, you can take multiple passes over your input: first line for header, up to last line for data, last line for footer. Then, direct each flow into a tAdvancedFileOutputXML in append mode.

      http://bekwam.blogspot.com/2011/09/xml-output-from-multiple-data-sources.html

      Delete
    2. Yes, that's exactly what I did. Having multiple tMap passes, one for HDR, one for PRI (detail lines) and one for SE (string end, the footer) each with an output to an tAdvancedFileOutputXML. The HDR xml write pass is in normal overwrite mode and the following passes are in append mode.

      Here's a post I send on Talendforge to see if there are better was to doing this though (obviously I have not received any answer)
      http://www.talendforge.org/forum/viewtopic.php?id=34543

      Thanks for the reply Carl.

      Baha

      Delete
    3. Another way to handle it is shown at
      http://www.howtotalend.com/multischema-file-read-header-detail-trailer-records/

      Delete
  4. good solution for eliminate the header!

    ReplyDelete
  5. HI I have a requirement to validate header footer
    header and footer must follow some regex.
    which should be configurable on the basis of validation i need to move file on diffrent folder say form in ---> valid , for valid and in ----> invalid for file data is not matched

    and this task i need to perform on hdfs file

    ReplyDelete