Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Saturday, July 16, 2011

Scanning an Input String in Talend Open Studio

Use Java's regular expressions packaged in a Talend Open Studio Routine to scan an input string and perform an advanced manipulation.

Java's regular expressions are powerful and can be used to handle string manipulation that exceeds the capabilities of split() or replace().  A pattern to follow for this type of manipulation is to build a new string based on applying a regular expression to the input.

The regular expression should be packaged as a Talend Routine which is available in components like tMap.

For example, consider the following input as expressed in the variables n1, n2, and n3.  This Java could be embedded in a tJava component.

String n1 = "Sec1&lib1$$Sec2&lib2";
String n2 = "Sec2&lib2$$Sec4&lib4$$Sec6&lib6";
String n3 = "Sec1&lib1$$Sec2&lib2$$Sec3&lib3$$Sec5&lib$$Sec8&lib8$$Sec9&lib9";
 

System.out.println("n1=" + FitlerUtils.filterNonSec(n1));
System.out.println("n2=" + FilterUtils.filterNonSec(n2));
System.out.println("n3=" + FilterUtils.filterNonSec(n3));

filterNonSec needs to pull out the non-Sec parameters.  This includes the "lib" parameters, but the regular expression solution will handle other parameters.  First define a Sec parameter as a regular expression of the form Sec[0-9]+ where "Sec" is followed by one or more digits.  Any character following the digit will serve as the boundary.

Expected Output

The expected output removes extra tokens, but retains the ampersand separator.

n1=Sec1&Sec2
n2=Sec2&Sec4&Sec6
n3=Sec1&Sec2&Sec3&Sec5&Sec8&Sec9

 
Code
 
Use the regular expression function find(), invoked repeatedly, to build up a string.  Sec[0-9]+ defines a group that will return the particular Sec being examined.  For each matching Sec, a StringBuffer is appended with the Sec token (including the number) and a separator.

Note the Java technique to append the separator.  The separator is appended at the beginning if needed.  The  firstPass is skipped and a flag set.

Here is a Talend Routine that can be packaged as "FilterUtils".  Create a Routine "FilterUtils", then swap out the sample static method for this code.


public static String filterNonSec(String _input) {

   if( _input == null || _input.length()==0 ) {
     return "";
   }

   StringBuffer output_sb = new StringBuffer("");

   java.util.regex.Pattern p =
       java.util.regex.Pattern.compile("Sec[0-9]+");

   java.util.regex.Matcher m = p.matcher(_input);

   boolean firstPass = true;
   while( m.find() ) {

     if( !firstPass )
       output_sb.append("&");
     else
       firstPass = false;

     output_sb.append(m.group());
   }

   return output_sb.toString();
}


Regular expressions are a powerful way to handle a complex input string.  In Talend, use a Routine -- rather than chunks of Java code embedded in components -- to package the expressions.  If you're writing a lot of regular expression code, consider a test-driven methodology so that all of the variations of input (empty strings, nulls, etc) can be covered.

Test-Driven Development of Talend Open Studio Routines

5 comments:

  1. Hi,

    I've got a URL, such as:

    /google.se/url?sa=t&rct=j&q=insights%20konsult&source=web&cd=11&ved=0CC4QFjAAOAo&url=http%3A%2F%2Fwww.inuseinsights.se%2Fom-inuse-insights%2Fpartners

    I want to put the string into different tokens and to filter each of them to find what string characters does one token has.

    Any idea how to do that?

    ReplyDelete
  2. Replies
    1. Hi Ilyas,

      Take a look at this blog post: http://bekwam.blogspot.com/2012/07/parsing-url-with-talend-open-studio.html.

      It breaks the URL down. It will give you a stream of name/value pairs. This example doesn't, but you can carry along other items like the host/path.

      Good luck

      Delete
  3. Hello Carl,

    I have an Excel spreadsheet in which data concerning corporate bonds are stored. There is one column "Libelle" whose values look like "CD Corporation du 20/12/2010 4.5% à 26 semaines".
    What I want in the output is a new column that takes only the type of the bond. In this case "CD".
    Would like to give me a hand with that?
    Many thanks

    ReplyDelete
    Replies
    1. Hi,

      Take a look at this post

      http://bekwam.blogspot.com/2013/03/parsing-string-using-talend-open.html

      Delete