Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Wednesday, July 20, 2011

XPath Loops in Talend Open Studio

When analyzing an XML document for processing, one tends to think top-down.  For example, "in a library, give me all the books" implies a <library><books><book> structure.  However, it may be easier to think from the innermost elements, outward: "give me all the books and their library".

Talend Open Studio uses the tFileInputXML component to read XML documents into a job.  tFileInputXML uses a Loop XPath query to to define a repeating structure in the document against which a series of mapping XPath queries are run.  There is a mapping XPath query for each schema data field to be set during processing.

Bottom-up Processing

When working with a hierarchical structure like a filesystem, one starts as the top and drills down to lower level elements.  However, in XPath processing with Talend Open Studio, it's important to start with the lowest-level grain that will define a record.  For example, the following XML document has ID elements in a Location element contained with an IDs element.

<Locations>
   <IDs>
     <ID sequenceName="Name">ABCDE</ID>
     <ID sequenceName="Site"/>
     <ID sequenceName="Bin">XYZ</ID>
   </IDs>
</Locations>


The first step in processing this XML document is to determine whether each ID is a record (in which case there will be three rows produced by tFileInputXML) or if the IDs element defines a record (only one row).

Starting with the lowest-level possible, this Talend job produces three name / value pairs, one for each ID element.  The loop is set to Locations/IDs/ID. @sequenceName returns the attribute value of sequenceName.  The period (".") returns the text in the ID element.  The period stands for the current element which is the ID defined in the loop.

Each ID Defines a Record

An alternative way of processing the Locations document is to specify the loop element as Locations/IDs.  In this example, a single record will be produced.  There are attribute selectors ([@sequenceName=""]) that map each ID element to a different field.

Containing IDs Element Defines a Record
 In other cases, there may be extra information in the parent required by the child.  This extra information may provide identifying or contextual information.  Suppose "Locations" allowed additional "IDs" elements.  In order to associate an ID record with its IDs parent, provide a relative reference to the parent ("../@Name") that will repeat the IDs field for each ID record.

It's natural to think top down when looking at a hierarchy.  However, for XML processors it may help to think bottom-up to identify the correct looping structure.  Parents -- and other ancestors -- aren't ignored in the bottom-up processing.  Access parent elements and attributes using relative (../) paths.

Namespaces Update

If your input XML uses namespaces and they can be ignored, then set the "Ignore namespaces" option on the tFileInputXML's Advanced settings tab.  This will produce a temp file of the XML data with all namespace definitions and prefixes stripped out.
 

4 comments:

  1. ¿Do you know which the basics to create an XML file from an XLS file? Thanks in advance for you response.

    ReplyDelete
    Replies
    1. At a minimum, you need a tFileInputExcel, a tMap, and an XML output component like tAdvancedFileOutputXML. For examples working with the various XML components in Talend, follow the "Bekwam Wiki / Talend" link and read the posts with XML in the title.

      Delete
  2. How will you get the elements if you have a structure like this?



    ABCDE

    XYZ



    and you want to get the attribute "num" ?

    ReplyDelete
    Replies
    1. I'm not sure where the attribute num is in the input. The XML tags didn't come through in Blogger and when I look in the HTML, I only see "sequenceNumber". To get sequenceNumber, you'll add an @sequenceNumber to the mapping.

      Delete