Talend Open Studio uses the tFileInputXML component to read XML documents into a job. tFileInputXML uses a Loop XPath query to to define a repeating structure in the document against which a series of mapping XPath queries are run. There is a mapping XPath query for each schema data field to be set during processing.
When working with a hierarchical structure like a filesystem, one starts as the top and drills down to lower level elements. However, in XPath processing with Talend Open Studio, it's important to start with the lowest-level grain that will define a record. For example, the following XML document has ID elements in a Location element contained with an IDs element.
The first step in processing this XML document is to determine whether each ID is a record (in which case there will be three rows produced by tFileInputXML) or if the IDs element defines a record (only one row).
Starting with the lowest-level possible, this Talend job produces three name / value pairs, one for each ID element. The loop is set to Locations/IDs/ID. @sequenceName returns the attribute value of sequenceName. The period (".") returns the text in the ID element. The period stands for the current element which is the ID defined in the loop.
|Each ID Defines a Record|
An alternative way of processing the Locations document is to specify the loop element as Locations/IDs. In this example, a single record will be produced. There are attribute selectors ([@sequenceName=""]) that map each ID element to a different field.
|Containing IDs Element Defines a Record|
It's natural to think top down when looking at a hierarchy. However, for XML processors it may help to think bottom-up to identify the correct looping structure. Parents -- and other ancestors -- aren't ignored in the bottom-up processing. Access parent elements and attributes using relative (../) paths.
If your input XML uses namespaces and they can be ignored, then set the "Ignore namespaces" option on the tFileInputXML's Advanced settings tab. This will produce a temp file of the XML data with all namespace definitions and prefixes stripped out.