Wednesday, June 22, 2011

Dynamic Schemas in Talend Open Studio

When working with input data from a source like a spreadsheet, which is subject to editing by users, you can't always count on a strict format.  For instance, the columns may be reordered.  But if the spreadsheet contains column headers that are still accurate, you can map it to a schema using Talend Open Studio.

In Talend Open Studio (TOS), you define schemas to describe the purpose and data types for a source or target.  With flat files or spreadsheets, TOS is strict about associating the column position with a particular field definition.  If the first field is designated as "First Name" in the File Excel wizard, the input is expected to contain "First Name" in its first field despite what the column header says.

To build a schema on-the-fly, take two passes on the input file.  First, record the header information in a data structure to be used in later processing.  For each input record, consult the data structure.

  The Talend Exchange has a good example of this in "Example job for the dynamic schema routine" at (rbaldwin).  The job uses Java Reflection wrapped up in a Talend Routine.

The following job is an alternative example that is identical in its algorithm, but varies in the Java and the set of Talend Components.

A Talend Job Reading a Header Prior to Processing


The file is processed in three stages

  1. Initialize a data structure
  2. Read in and record the header
  3. For each input record, map the fields to the correct schema using the header data structure
The header data structure is a java.util.Map.  This is created and associated with the globalMap in a tSetGlobalVar.  There are also Java imports in the Advanced settings for java.util.* and java.lang.reflect.*.

The tSetGlobalVar creates the java.util.Map.  In the "Basic settings" of the tSetGlobalVar, add an entry with key "fieldMap".  This will reference the following object.

  new HashMap<String, String>()

"fieldMap" should be quoted.  Don't quote the new command.

Record the Header

The header is read in using a tFileInputDelimited which is configured to stop after the first record (Limit=1).  The tFileInputDelimited is based on a loose schema where each field is a type and there is one field for each possible input column.  The names of the fields are left generic so that they aren't mislabed if a different column order is used.

A tJavaRow calls the put() method on the header Map.  The key of the map is the true field name, taken from the first line of the input file.  The value of the map is the corresponding column name.  For example, if the first column header is "Field_B" and the first field of the loose tFileInputDelimited schema is "Column0", then Field_B/Column0 is recorded.

Here is the Java Reflection code used in the tJavaRow_3 component.  For each field in the input row, record the first line's text (it's only reading in one line) and the column name of the loose schema.

Map<String, String> fieldMap = (Map<String, String>)globalMap.get("fieldMap");

for( Field f : row1.getClass().getFields() ) {
   fieldMap.put((String)f.get(row1), f.getName());

Input Processing

The file is processed in a second pass.  The same tFileDelimited component is used (same file, same schema).  However, the second pass will not use Limit=1, but will use Header=1.  This skips the header and reads in the rest of the file.

This job uses a tMap to establish the target schema.  The tMap controls the eventual ordering of the columns.
Map Using Dynamic Column Values

 Vars are used to reduce the amount of Java code in the target mapping.  Each Var reads in the mapped values of the fieldMap using the target field name as a key.  The mapped values are used in the OUT schema.


This is a video walkthrough of the dynamic schema job.

From this example, and rbaldwin's on Talend Exchange, you can see that it's feasible to accept input files where the column order is rearranged.  However, you will need to work with Java (or Perl) to extend the standard File Delimited or File Excel wizards.

Update for 5.x

I think there may have been a change in the imports available to Java embedded in Talend components since I wrote this post.  I don't have a 4.x around to verify.  But to fix this, you have two options.

  1. Fully qualify the Java classes.  Replace each "HashMap" with "java.util.HashMap" and each "Map" with "java.util.Map".
  2. Use a tJava to add the imports.  See the screenshot below for an job that starts with a tJava leading into the tSetGlobalVar.
Adding Imports to the Job

A few comments have mentioned an NPE on the searchFields() method.  If you get an error like this

Exception in thread "main" java.lang.NullPointerException
at java.lang.Class.searchFields(
at java.lang.Class.getField0(
at java.lang.Class.getField(

at com.bekwam.resignator.ReflTest.main(

It's from a null being passed into the getField() call.  For example,


Another potential problem is "fieldMap".  If that String is different than the Map added in the preceding step, then the code that I have in the Var section will throw an NPE.


  1. Thank you very much for this post, it really helped me a lot. I have subscribed to your RSS feed, interesting stuff...

  2. in tjavaRow component
    for(Field f: row1.getClass().getFields())
    what is the code of Field class here, can you please type full code as i am not a java guy so its difficult for me to write it
    mail me at

    1. This is the full Java code for the example. Make sure that you follow the import step described in "Algorithm".

      1. On the tJavaRow, select the Advanced settings tab.
      2. Add the following lines to the Import box

      import java.util.*;
      import java.lang.reflect.*;

  3. thanks carl for ur help..some errors are removed..
    now these errors are upfronting...
    Exception in component tJavaRow_1
    java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
    at anchit.anchit_0_1.anchit.tFileInputDelimited_1Process(
    at anchit.anchit_0_1.anchit.tSetGlobalVar_1Process(
    at anchit.anchit_0_1.anchit.runJobInTOS(
    at anchit.anchit_0_1.anchit.main(
    JOB NAME IS "anchit"

    and can u send me your e-mail id so we can be in contact...i have mentioned mine above....

    1. Hi,

      Take a look at the screenshot "Map Using Dynamic Column Values" and make sure that the entries in the Var panel match what you're using. Pay particular attention to the parenthesis.

      My email is bekwam@

  4. Error is not in tmap
    it's in tjavarow component...and these are run time error...No compile time errors are there.

    1. Double-check the code in the tSetGlobalVar. See the video at :28 seconds in.

      The problem may be with this line


      which is storing a String under "fieldMap" rather than a Map.

      I'm adding the tSetGlobalVar code in the video to the post under the "Algorithm" section.

  5. ya thanks carl, that job is done.


  6. Hi Carl,

    Can Dynamic Schema be applied on addition of columns also in TOS, as if the source file is updated by adding a new column then the same job should take that additional column by itself.

    Thanks in advance

    1. Hi Anchit,

      I haven't seen this in Open Studio. Enterprise has a feature that may be what you're looking for.

  7. Hi Carl,

    It's ok about additional columns.
    How can we edit automap funtionality of tmap component in TOS and make that automapping as direct mapping without matching column names.

    Thanks in advance

  8. Hi,

    I used a "Match By Position" setting in Pervasive Data Integrator as an alternative to "Match By Name", but haven't found a similar feature in Talend. Not only are matching names important in Talend, but order is also. I found that two schemas -- say from a tMap into a tOracleOutput -- will produce an error if the order isn't matched manually or Sync Columns is pressed.

  9. Hi Carl,

    Thanks for the information, but anyhow can we edit tmap in talend to use "automap" option as match by position.
    By writing some java code or removing any code from tmap.


  10. Hi Carl,

    I have a similar query as above. Can we map the columns dynamically?
    I mean in tMap we have to map columns manually. But lets say we have a file with input and out column names which needs to be mapped. Can we somehow read this file and then map the columns accordingly? Any workaround?


    1. Hi,

      It sounds like you don't want to use tMap at all and rely on the positions of your source/target mapping file to serve as the map.

      If this is the case, you can expand on this blog post by including two other data structures. The blog post defines a fieldMap which is a mapping of the input fields to the column-oriented position (for example, LastName/Column0, FirstName/Column1). You need the same thing for the output fields, but in reverse (for example, Column0/LN, Column1/FN). Finally, you'll need a map for the source / target mapping (LastName/LN, FirstName/FN).

      Use a tJavaRow to process the input coming from a component like a tMySqlInput. The tMySqlInput will be based on a general schema with sufficient colums for the mappings and with basic datatypes. Such as schema could be Column0:String, Column1: String, Column2: String. In the tJavaRow, lookup the input field name based on position. Take this field name, and lookup the output field name in the source / target mapping. Write the output field by looking up the position (Column0) from the output field mapping based on what you've gathered from the source / target mapping.

      The basic structure of the job will include 3 flows that take 3 passes on the source / target mapping field to produce fieldmap, outputfieldmap, and sourcetargetmap. An input component -- say a tMySqlInput -- is based on a loosely defined schema with columns for each of the fields left with their positional names (Column1, Column2). This components is run into a tJavaRow that applies the maps. Finally, write the output with an output component with a positional-based schema (Column1, Column2).

      Good luck

    2. Correction: The output field map doesn't need to be in reverse (Column0/LN) but rather structured so that the field name from the mapping file is the key (LN / Column0).

  11. Exception in component tMap_1
    at java.lang.Class.searchFields(Unknown Source)
    at java.lang.Class.getField0(Unknown Source)
    at java.lang.Class.getField(Unknown Source)
    at projectname.internet_scenario_0_1.INTERNET_SCENARIO.tFileInputDelimited_2Process(
    at projectname.internet_scenario_0_1.INTERNET_SCENARIO.tFileInputDelimited_1Process(
    at projectname.internet_scenario_0_1.INTERNET_SCENARIO.tSetGlobalVar_1Process(
    at projectname.internet_scenario_0_1.INTERNET_SCENARIO.runJobInTOS(
    at projectname.internet_scenario_0_1.INTERNET_SCENARIO.main(

    i am getting this error.can you please help me ..

    1. Is this related to Dynamic Schemas or some other error?

  12. Hi Carl,

    I tried this exercise, but got an error that HashMap can not be resolved as type.

    I am using TOS 5.4 version.

    Could you please help me.


    1. Hi. I put a screenshot up in the post that should help. It's an old post and I think there were some implied imports in previous versions of Talend.

  13. here in this scenario if the incoming file column names change,then how to fetch them dynamically in talend open studio ?

    1. You can look at the first line of the file (the header) and parse the results into a Java structure called a Map. As you step through the tokens, you can build up this data structure using a LinkedHashMap which will retain the positions for later mapping using this technique. For example, chopping up the string "first_name,last_name" would result in two map records first_name/0 and last_name/1.

      Take a look at the File Positional definition in the metadata section. Maybe that can implement your requirements off-the-shelf.

    2. Thanks Carl, any additional tuto?

  14. Hello,
    I used your solution to load an excel file with dynamic schema. It was working until the total number of columns exceed 53 columns.
    I want to extract 53 columns in my excel file but when an extra column is added between theses 53 columns and not after the last one, I get this error :
    Exception in component tMap_4
    at java.lang.Class.searchFields(Unknown Source)
    at java.lang.Class.getField0(Unknown Source)
    at java.lang.Class.getField(Unknown Source)
    I tried to fix the initial capacity of hashmap in the tsetGlobalVar without success (new HashMap(100000) ).
    Do you know why? Thank you.

    1. It seems like something is missing rather than a problem with the capacity. Try adding pair of System.out sbefore the get() call in the for loop.

      > System.out.println("f=" + f);
      > System.out.println("f.row1=" + f.get(row1));
      fieldMap.put((String)f.get(row1), f.getName());

  15. Dear Carl,
    Can you please share your actual job so that we can just import as item and can run in our talend. May be upload in your Google drive and share link.

    1. Hi,

      Unfortunately, I lost some of the examples in this blog to a hard drive crash back in 2011.

  16. Hi Carl,

    I have tried to follow the tutorial but I am getting the below error in the t_Map component:

    Exception in component tMap_1
    at firstproject.dynamicschema_0_1.DynamicSchema.tFileInputDelimited_2Process(
    at firstproject.dynamicschema_0_1.DynamicSchema.tFileInputDelimited_1Process(
    at firstproject.dynamicschema_0_1.DynamicSchema.tSetGlobalVar_1Process(
    at firstproject.dynamicschema_0_1.DynamicSchema.tJava_1Process(
    at firstproject.dynamicschema_0_1.DynamicSchema.runJobInTOS(
    at firstproject.dynamicschema_0_1.DynamicSchema.main(

    1. Hi,

      Have you followed my troubleshooting suggestion from January 29th? It's a few messages up.

    2. Hi Carl,

      Thanks for your reply. I have tried your suggestion, and the output is:

      f=public java.lang.String firstproject.dynamicschema_0_1.DynamicSchema$row1Struct.Column0
      f=public java.lang.String firstproject.dynamicschema_0_1.DynamicSchema$row1Struct.Column1
      f=public java.lang.String firstproject.dynamicschema_0_1.DynamicSchema$row1Struct.Column2

      And I am still getting the same error.

    3. Can you verify the settings in the tMap screenshot with the tMap? Specifically, make sure that there is an object in globalMap called "fieldMap" and that each of the field lookup values is of the property case.

    4. Hi Carl,

      Thanks for your reply again. I have identified the problem. A novice mistake (which I am).

      The problem is, the HashMap I have created is named "fieldMap", but while using it in GlobalMap I was referring to it as "FieldMap". It was a problem of one uppercase letter.

      Really sorry for wasting your time, I am feeling really dumb now.

  17. No problem. You're anonymous. We'll keep this exchange up for the next person.

  18. Hello Carl,

    I followed your approcach but it is giving nullpointerException whenever there is NULL in the source column. How can we handle this? Although all the columns are selected as NULLable in the mapping.

    1. I posted an update that might help solve your NPE.

  19. Nice tutorial, I was searching for something like this for quite a while. Thanks a lot

  20. I'm not sure if this post is still active. But, I'm getting two errors on the tJavaRow component, "row1 cannot be resolved", "row1 cannot be resolved to a variable".

    I added tJava component as you recommended, because I'm using 5.5.1.