When mapping in Talend Open Studio, you may want to use regular expressions to match a lookup table. You can apply a regular expression in a tMap, but often there is a need to apply several expressions. This can lead to a cumbersome structure that isn't easy to reuse in a Job.
For example, the following code shows an ever-increasing tangle of logic that becomes embedded in a component like tMap.
// accruing more and more logic / subject to change
if( input_row.SYSTEM_FUND_SOURCE_NM.matches("Fred.*") ) {
output_row.VALUE = "LOCAL_GOVERNMENT";
} if( input_row.SYSTEM_FUND_SOURCE_NM.matches("Washington") ) {
output_row.VALUE = "LOCAL_GOVERNMENT";
} else if (input_row.SYSTEM_FUND_SOURCE_NM.matches("[Ff][Ee][Dd].*") ) {
output_row.VALUE = "FEDERAL_GOVERNMENT";
}
// ... etc.
It's easier to maintain this type of mapping in a database table or spreadsheet like this.
Lookup Table of Regular Expression Mappings |
The typical tMap or tJoin component works well if you want to use a single value with the equality operator. However, there's no way to compare a regular expression. You can't switch the "=" out for a "matches()".
Using custom Java code, you can set up a Java data structure called a Map (implemented as a HashMap). The key of the Map will be a regular expression and the value will be the looked-up value. The Map will be loaded with the contents of the database table or spreadsheet. As the source data is read in, more custom Java code will be used to examine the Map and find the appropriate value.
Regex Lookup Job |
Subjob 1 - Initialize
The following code is added to the Basic and Advanced Settings tab of a tJava component. The first block creates a Java data structure called a Map using Java generics. The Advanced Settings include the required imports.
// Basic settings
Map<String, String> regexMap =
new HashMap<String, String>();
globalMap.put("regexMap", regexMap);
// Advanced settings
import java.util.Map;
import java.util.HashMap;
Subjob 2 Load Lookup Table
Subjob 2 uses a tFileInputExcel and a tJavaRow to read the Excel spreadsheet into the Java Map. The tFileInputExcel is the lookup spreadsheet shown at the start of this post and its schema is identical to what is shown.
The tJavaRow component will put a FUND_SOURCE_NM in the Map for each of the regular expressions. The regular expression will serve as a key.
Map<String, String> regexMap =
(Map<String, String>)globalMap.get("regexMap");
if( row1.REGEXP != null ) {
regexMap.put( row1.REGEXP, row1.FUND_SOURCE_NM );
}
Subjob 3 Load the Data
The last subjob will take the data and look up the data using the Map created in the first two subjobs. The subjob starts with a tFileInputExcel, which is the source data. That feeds into a tJavaRow which will map all of its fields to the output using the Generate Code feature. An additional field is defined that will contain the value pulled from the Map.
This is the schema behind the tJavaRow component in the third subjob.
tJavaRow Schema with Extra Value Column |
output_row.SYSTEM_FUND_SOURCE_NM = input_row.SYSTEM_FUND_SOURCE_NM;
output_row.FUND_NM = input_row.FUND_NM;
output_row.RESTRICTED = input_row.RESTRICTED;
output_row.REVENUES = input_row.REVENUES;
output_row.VALUE = "";
Map<String, String> regexMap = (Map<String, String>)globalMap.get("regexMap");
for( String key : regexMap.keySet() ) {
if( input_row.SYSTEM_FUND_SOURCE_NM != null ) {
if( input_row.SYSTEM_FUND_SOURCE_NM.matches(key) ) {
output_row.VALUE = regexMap.get(key);
break;
}
}
}
Here's the output from running the job.
Run of Regex Lookup Table Job |
Congratulations Carl for the article. It was very useful to me
ReplyDeleteThanks.
Evando.
Hi Evando,
ReplyDeleteI'm glad you found it useful.
-Carl
This is really awesome...Thanks a lot for posting this. I'm going to give this a try....I have a fairly complicated data flow and I'll be using this technique to do a conditional replace. I wish replacelist had conditional elements in it.
ReplyDeleteThe logic in the for loop from the code in "Subjob 3 Load the Data" can be as complex as needed. Here it's doing a straightforward regex matches() on a single column.
ReplyDeleteIf you need a lot more flexibility, consider storing a dynamic expression in the spreadsheet. For example, if you need to establish a match between varying numbers of columns.
This post uses Jexl (Javascript) in Talend to form such expressions.
http://bekwam.blogspot.com/2011/04/embedding-expressions-with-jexl-in.html
Good luck!
Great! Thanks.
ReplyDelete