Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Monday, March 25, 2013

Parsing a String using Talend Open Studio's tExtractRegexFields Component

A reader asked how to extract the bond type "CD" from the following input string: "CD Corporation du 20/12/2010 4.5% à 26 semaines".  Although it's easy to grab the first two characters in a tMap using a substring function, there is an off-the-shelf component tExtractRegexFields that can handle varying lengths.

This job uses a tFixedFlowInput to provide data.  The tFixedFlowInput is run into a tExtractRegexFields which breaks the input into two strings: investment type and remainder.  The tExtractRegexFields is connected to a tMap which filters the columns.  The result is output to tLogRow.

Input Data for Regex-parsing Job
The test data consists of four records representing four French investment types: CD, OAT, BTF, BTAN.  Types vary in character length: 2, 3, 4.  While a simple substring() in a tMap is a quick way to pluck the first two characters off of a string, that solution won't work for the varying characters.  Regular expressions can be used to handle the variety.  A regular expression keys off of the repeating structure of the input data rather than the fixed positions.

tExtractRegexFields
Regular expression syntax can be intimidating, and is best learned by breaking down examples.  Fortunately, nearly every toolkit from Javascript to Perl has an easy way to work with regular expression. In the Regex supporting this example, the investment type (CD, OAT, etc) is identified as a sequence of word characters (A-Z, a-z, 0-9).  The regular expression will match the characters until it encounters one not in the set (in this case, the space character).  The following ".*" will grab everything at the end.

The parenthesis characters denote the column match groups.  Everything in the first set of parens -- the investment type w+ -- is the first group.  The remainder is the second group.  Here is the outbound schema used in the tExtractRegexFields component.

Schema Used by tExtractRegexFields
The tMap isn't essential.  I'm using it to filter the remainder column.

A tMap Filtering a Column
Finally, here is the output generated when the job is run.  tLogRow provides the writing.

Input String Parsed to Define an Investment Type
Regular expressions are enormously power.  Thankfully, they're available in most languages.  The best way to learn the regular expressions is to break down examples using an interpreter like Perl or a Javascript program.  Although many regular expression implementations will be the same in different environments, there is sometimes a syntactic different between a Perl and a Java regular expression.  The most notable is the need to escape the important backslash ('\') character with a second backslash to form valid Java strings.




2 comments:

  1. Hi! Do you know, if there is a way to extract specific part of a string inside tMap? I'm trying to do something like
    Var.path.matches("^.*\\\\(.*)$") ? Var.path.split("^.*\\\\(.*)$")[1] : Var.path
    in order to get the last part of the path, but that doesn't seem to work.

    ReplyDelete
  2. Hello,
    Is there a way to use fixedflowinput from database input ?
    I'm getting data like 32 RUE JACQUES IBERT
    CS 50036 and I need to break it on the new line \n after IBERT into 2 outputs rue1 = 32 RUE JACQUES IBERT and rue 2 = CS 50036. How can I do this please?

    ReplyDelete