JavaFX Tutorials

Tuesday, November 22, 2011

Dynamically Filtering Columns for XML with Talend Open Studio

To filter columns when producing an XML document with Talend Open Studio, use the tFileOutputMSXML component with the "Create empty element if needed" checkbox unset.

Talend Open Studio's tFileOutputMSXML component creates an XML document from an input schema.  This component uses a map dialog to link the input fields to XML elements and attributes.  On the Advanced settings, there is an option that prevents XML elements from being produced if there is no value.  This setting can be used to filter columns.

Remove Empty Elements from Null Input
Using an Excel input and an XML output, define a schema that will include all the columns.  For example, this Excel file has four columns: brand, type, price, and shelf.  Note that "shelf" does not have a value for "sunkist".
Excel Spreadsheet
Create a job that uses a tFileInputExcel component run into a tJavaRow which is run into a tFileOutputMSXML.  Base the tFileInputExcel component's schema on all available columns.  Synchronize all schemas using the "Sync columns" button on the tJavaRow.

Job with Context Variable for Allowed Columns
Create a context variable "OUTPUT_COLUMNS" that is a comma-separated list of column names used in the schema.  These names must match the column names used throughout the job.

Create the mappings for the XML document.  Define a root container tag (sodas) and loop tags (soda).  Map each available column as its own subelement.  Each subelement must retain its input source column name.

Mapping for tFileOutputMSXML
Finally, add in the Java code for the tJavaRow.  This code loops over the list of column names provided by the context variable.  Java Reflection is used to dynamically map the input and output fields.  This avoids having to hardcoded the names using "input_row.brand".  On Advanced settings, java.lang.reflect.Field must be imported.


Class<?> c_in = input_row.getClass();
Class<?> c_out = output_row.getClass();

if( context.OUTPUT_COLUMNS != null ) {

  String[] fieldNames =
     context.OUTPUT_COLUMNS.split(",");

  for( String fn : fieldNames ) {

    // input / output schemas must match

    Field f_in = c_in.getDeclaredField(fn);
    Field f_out = c_out.getDeclaredField(fn);

    if( f_in.get(input_row) == null ) {
      f_out.set(output_row, "");
    } else {
      f_out.set(output_row, f_in.get(input_row));
    }
  }
}



The check for null is handling the case where you want an empty element to show up in the XML.  "shelf" is not filled in for "sunkist", but it should show up if "shelf" is an included column.

With Java Reflection, you can build a filtering mechanism that will produce an XML document tailored to your particular requirement.  Build comprehensive source and target schemas and use a tJavaRow and a setting on the tFileOutputMSXML component to apply the filtering logic.

4 comments:

  1. Hi Carl, I am trying to dynamically filter out some columns and save as JSON file. I think I can use the same way as you mentioned here. But the problem is tFileOutputJSON without options "Create empty element if needed" checkbox. So all the columns I would like to filter out still output into the json file with null value. Do you have any idea on how to do this? Thanks!

    ReplyDelete
    Replies
    1. Have you tried adding a tMap that will replace the null with a ''? A tMap can also exclude a column with nulls, but that might filter out fields with data too.

      Delete
    2. I don't see tMap could exclude a column with nulls. May I know where it is? Thanks!

      Delete
    3. Two other things you could try are a post-processing step or to build up multiple documents if the number of nullable columns is small.

      In the post-processing step, run a pair of regular expression replacements that remove strings of the form "_____" : null, and "____": null (no comma).

      For the multiple documents approach, produce multiple JSON files. Say there are fields firstName, lastName, and companyName. Either firstName and lastName will be set or companyName will be set. Create a JSON file for the firstName / lastName case the omits companyName in the tFileOuputJSON. Next, create a JSON file for the companyName case that omits the firstName and lastName. At the end, merge the two files.

      Good luck

      Delete