Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Tuesday, July 10, 2012

Parsing a URL with Talend Open Studio

If you need to process URLs with Talend Open Studio, a few well-placed components can break apart the URL parameters to be stored, converted, or filtered.

Given the following URL

/google.se/url?sa=t&rct=j&q=insights%20konsult&source=web&cd=11&ved=0CC4QFjAAOAo&url=http%3A%2F%2Fwww.inuseinsights.se%2Fom-inuse-insights%2Fpartners

You can break the string apart with 3 Talend Open Studio components that will result in a stream of name/value pairs.  This screenshot shows the running of such a job.

Name / Value Pairs Extracted from a URL
Depending on requirements, additional columns can be carried through the processing to provide a business key (such as host or path).  This screenshot shows the job.

Job Parsing a URL - Two Extra Delimited Fields and a tNormalize
Three components hack off various pieces of the URL.  First, a tExtractDelimitedFields_1 separates the host/path from the QUERY_STRING using the "?" delimiter.  Next, a tNormalize takes each name/value pair, forming a distinct row based on the "&" delimiter.  Finally, the second tExtractDelimitedFields_2 separates the name from the value, based on "=".

The tFilterColumns component is used for presentation purposes, it removes the pre-processed "path" variable.

Here are the component configurations starting with the tFixedFlow component providing the test data.

A tFixedFlowInput with a URL


tExtractDelimitedFields_1
tNormalize 
tExtractDelimitedFields_2
While some custom Java can be thrown into a tJavaRow, this blog post presents a cleaner alternative.  It's cleaner because it's based on the schema, rather than some Java code that could suffer a syntax error.

10 comments:

  1. Replies
    1. Great Article android based projects

      Java Training in Chennai Project Center in Chennai Java Training in Chennai projects for cse The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training Project Centers in Chennai

      Delete
  2. FYI Google's Guava library has functions you can use for url processing such as domain extraction.

    http://code.google.com/p/guava-libraries/wiki/GuavaExplained

    ReplyDelete
    Replies
    1. Thanks Yash. To work with a third-party library like Guava, take a look at this post: http://bekwam.blogspot.com/2012/04/right-padding-string-with-talend-open.html.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hi , i want to pass a variable in the URL, i am using context.message as a variable and values are capture under this variable . i am using component thhtprequest.

    "https://api.telegram.org/bot322480:AETfC4RyKIGcDTrsKua0daUKORg/sendmessage?chat_id=323109827&text=+context.message+"

    but i am getting context.message printed , not the value for the context.message
    Any help?

    ReplyDelete