If you need to process URLs with Talend Open Studio, a few well-placed components can break apart the URL parameters to be stored, converted, or filtered.
Given the following URL
/google.se/url?sa=t&rct=j&q=insights%20konsult&source=web&cd=11&ved=0CC4QFjAAOAo&url=http%3A%2F%2Fwww.inuseinsights.se%2Fom-inuse-insights%2Fpartners
You can break the string apart with 3 Talend Open Studio components that will result in a stream of name/value pairs. This screenshot shows the running of such a job.
|
Name / Value Pairs Extracted from a URL |
Depending on requirements, additional columns can be carried through the processing to provide a business key (such as host or path). This screenshot shows the job.
|
Job Parsing a URL - Two Extra Delimited Fields and a tNormalize |
Three components hack off various pieces of the URL. First, a tExtractDelimitedFields_1 separates the host/path from the QUERY_STRING using the "?" delimiter. Next, a tNormalize takes each name/value pair, forming a distinct row based on the "&" delimiter. Finally, the second tExtractDelimitedFields_2 separates the name from the value, based on "=".
The tFilterColumns component is used for presentation purposes, it removes the pre-processed "path" variable.
Here are the component configurations starting with the tFixedFlow component providing the test data.
|
A tFixedFlowInput with a URL |
|
|
|
|
tExtractDelimitedFields_1 |
|
tNormalize | |
|
tExtractDelimitedFields_2 |
While some custom Java can be thrown into a tJavaRow, this blog post presents a cleaner alternative. It's cleaner because it's based on the schema, rather than some Java code that could suffer a syntax error.
Thanks man .A nice post :)
ReplyDeleteFYI Google's Guava library has functions you can use for url processing such as domain extraction.
ReplyDeletehttp://code.google.com/p/guava-libraries/wiki/GuavaExplained
Thanks Yash. To work with a third-party library like Guava, take a look at this post: http://bekwam.blogspot.com/2012/04/right-padding-string-with-talend-open.html.
DeleteThis comment has been removed by the author.
ReplyDeleteHi , i want to pass a variable in the URL, i am using context.message as a variable and values are capture under this variable . i am using component thhtprequest.
ReplyDelete"https://api.telegram.org/bot322480:AETfC4RyKIGcDTrsKua0daUKORg/sendmessage?chat_id=323109827&text=+context.message+"
but i am getting context.message printed , not the value for the context.message
Any help?
Thanks for this blog keeep sharing your thoughts like this...
ReplyDeleteTalend Training in Chennai
Leadership Training in Chennai
Matlab Training in Chennai
Great Post!!! thanks for sharing this information with us.
ReplyDeleteSEO Benefits for small business
Why SEO is important for small business
Useful screenshots
ReplyDeleteThank You for this wonderful and much required information guidewire testing, guidewire consultants
ReplyDeletenice post...
ReplyDeletecustom software development