Using the standard XML components for inputting XML means that your job can be supported by the Talend Open Studio community. (See this blog post for an example.) Yet there are cases where adding Java code can improve the robustness of your job that show up when the XML is complex or can be written in many variations. This Java code is best supplemented with a data binding tool which will generate Java classes from a sample XML or an XSD. The following is a list of data binding tools that I've used.
- Liquid Technologies' XML Data Binder (commercial)
- XML Beans
- Castor
The XML I'm working with in this post is based on the Recordare MusicXML standard for transmitting a musical score. This is a section of the document.
<score-partwise xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.0">
<part-list>
<score-part id="A1">
<part-name>Music</part-name>
</score-part>
</part-list>
<part id="A1">
<measure number="1">
The score is divided into parts which are listed in both the header (part-list) and in individual sections (part). A part is made up of measures. The job in this post will handle single-part scores and will output the part-name along with a list of the measures.
Custom Java Code
In addition to Java classes generated by XML binding tools, I've written a class using the Java pattern "Adapter". An Adapter class changes the interface of a class into something more handy. In this case, the Adapter class digs into the generated Java classes and extracts nested members and collections. There is also some null protection for objects that are not required in terms of the XSD.
The first method extracts a part-name. If you're used to XPaths, the general flow of code like this is to make a method call for each path element. part-list/score-part/part-name -> getPart_List()/getScore_part()/getPart_name(). There is a special protection given to score_part which will prevent a NullPointerException as I try to access the member variable.
public String getPartName() {
String retVal = null;
Part_list part_list = score_p.getPart_list(); // mandatory
Score_part score_part = part_list.getScore_part(); // optional
if( score_part != null ) {
Part_name part_name = score_part.getPart_name(); // mandatory
retVal = part_name.getPrimitiveValue();
}
return retVal;
}
The second method extracts a type-safe list of measure numbers.
@SuppressWarnings("unchecked")
public List<String> getMeasureNumbers() {
List<String> retVal = new ArrayList<String>();
PartCol parts = score_p.getPart(); // mandatory
Iterator<Part> iterator = parts.getIterator();
while( iterator.hasNext() ) {
Part p = iterator.next(); // mandatory
MeasureACol measures = p.getMeasure();
Iterator<MeasureA> iterator2 = measures.getIterator();
while( iterator2.hasNext() ) {
MeasureA m = iterator2.next();
try {
retVal.add( m.getNumber() );
} catch(LtException ignore) {}
}
}
return retVal;
}
To create the class, call an empty constructor, then initialize with a string of XML.
public ScorePartwiseAdapter() {}
public void init(String xml) throws LtException, IOException {
this.xml = xml;
score_p = new Score_partwise();
score_p.fromXml(xml);
}
The full code listing is here.
Job Design
After loading the libraries and putting a 'ScorePartwiseAdapter' object on the globalMap, a Clob (Memo) of XML is read in from an Access table. The ScorePartwiseAdapter is initialized on this XML text and the convenient getPartName() and getMethodNumbers() method calls are made. The result is a flow that fills up a two element schema: partName and methodNumbers. methodNumbers is a Java List.
To process the Java List 'methodNumbers', a tLoop is used that will iterate over the elements. This is initiated by a tFlowToIterate and followed by a tIterateToFlow. I find it easiest to deal with flows in Talend Open Studio, but the iteration is needed because the Java List is not a flow.
Job Reading XML |
Calling ScorePartwiseAdapter Methods |
Output
The result of this test job is a log message to System.out. However, any flow-based output component can be used. To do this, a loop is executed for each XML document. This loop is based on the measureNumbers list and tLoop_1 is configured as follows.
Iterating Over Java Collection |
It's getting partName and measures from a tFlowToIterate component that converts the two fields of the tJavaRow_1 schema into a pair of global variables. The tLoop_1 component feeds into a tIterateToFlow component so that any flow-based component can be used.
tIterateToFlow Component |
The input in this job can be any flow-based input. This example uses a Memo (Clob) field in Access.
Writing jobs using standard components gives your jobs the widest possible support across the Talend community. This applies to working with XML. However, there are cases where the off-the-shelf XML processing components aren't sufficient such as when the XML is complex, variegated, and or tied up with extract or loading processing. In these cases, an XML data binding tool -- commercial or open source -- and an established Java pattern "Adapter" can make creating the TOS job easier.
castor project isn't the url you gave anymore which looks like a cesspool, moved to codehaus:
ReplyDeletehttp://castor.codehaus.org/xml-framework.html
Thanks! I updated the link in the post.
Delete