JavaFX Tutorials

Tuesday, July 10, 2012

Parsing a URL with Talend Open Studio

If you need to process URLs with Talend Open Studio, a few well-placed components can break apart the URL parameters to be stored, converted, or filtered.

Given the following URL

/google.se/url?sa=t&rct=j&q=insights%20konsult&source=web&cd=11&ved=0CC4QFjAAOAo&url=http%3A%2F%2Fwww.inuseinsights.se%2Fom-inuse-insights%2Fpartners

You can break the string apart with 3 Talend Open Studio components that will result in a stream of name/value pairs.  This screenshot shows the running of such a job.

Name / Value Pairs Extracted from a URL
Depending on requirements, additional columns can be carried through the processing to provide a business key (such as host or path).  This screenshot shows the job.

Job Parsing a URL - Two Extra Delimited Fields and a tNormalize
Three components hack off various pieces of the URL.  First, a tExtractDelimitedFields_1 separates the host/path from the QUERY_STRING using the "?" delimiter.  Next, a tNormalize takes each name/value pair, forming a distinct row based on the "&" delimiter.  Finally, the second tExtractDelimitedFields_2 separates the name from the value, based on "=".

The tFilterColumns component is used for presentation purposes, it removes the pre-processed "path" variable.

Here are the component configurations starting with the tFixedFlow component providing the test data.

A tFixedFlowInput with a URL


tExtractDelimitedFields_1
tNormalize 
tExtractDelimitedFields_2
While some custom Java can be thrown into a tJavaRow, this blog post presents a cleaner alternative.  It's cleaner because it's based on the schema, rather than some Java code that could suffer a syntax error.

10 comments:

  1. FYI Google's Guava library has functions you can use for url processing such as domain extraction.

    http://code.google.com/p/guava-libraries/wiki/GuavaExplained

    ReplyDelete
    Replies
    1. Thanks Yash. To work with a third-party library like Guava, take a look at this post: http://bekwam.blogspot.com/2012/04/right-padding-string-with-talend-open.html.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi , i want to pass a variable in the URL, i am using context.message as a variable and values are capture under this variable . i am using component thhtprequest.

    "https://api.telegram.org/bot322480:AETfC4RyKIGcDTrsKua0daUKORg/sendmessage?chat_id=323109827&text=+context.message+"

    but i am getting context.message printed , not the value for the context.message
    Any help?

    ReplyDelete