Talend Open Studio for Big Data contains Big Data components not found in the more familiar Talend Open Studio for Data Integration. TOS for Big Data has a set of components to interact with the NoSQL database MongoDB. The MongoDB components
- tMongoDBConnection,
- tMongoDBInput,
- tMongoDBOutput, and
- tMongoDBClose
Metadata Repository
I was surprised to find the Metadata Repository missing in the TOS Big Data download. TOS Big Data is licensed under a permissive (at least as far as object code goes) Apache license whereas Data Integration uses the reciprocal GNU license. There isn't a source code link for Big Data on Talendforge.
So, I'm using a Built-in schema for the tFileInputRegex.
The Job
The Talend job to load a log file into MongoDB looks like many others I've posted for Data Integration. A database connection supplies the host target and credentials (tMongoDBConnect). This connection is closed with a tMongoDBClose.
An input source -- tFileInputRegex -- drives a tMongoDBOutput.
A Talend Big Data Job |
tFileInputRegex uses regular expression groups to map a chunk of input text into a Talend schema column. The following table breaks the expression down.
- "^" : At the start of the string
- "([1-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3})" : Look for a group of four numbers containing one to three digits; the numbers except for the last will be separated by periods ('.')
- " - - " : The string literal " - - "
- "\\[(.*)\\]" : Everything in between [ and ]; the date part
- " " : A single space
- "\"(.*)\"" : Everything in between " and "; the request part
- ".*" : Everything else
All of the parts of interest -- the IP address, the date part, the request part -- are surrounded by the regular expression group parenthesis. These will map to columns in the following schema. Note that 'remainder' will not receive a value unless the "Everything else" is surrounded by parenthesis. The schema of the log file is synced with the tMongoDBOutput.
Schema of the Input Log File |
tMongoDBOOutput Configuration |
In RDBMS work, there is usually DDL that needs to be run to define the target table. In Talend, this can be created on-the-fly based on an "Action on table" set up in the component or a separate script run outside of Talend. For a NoSQL database like MongoDB, there is not concept of columns. Rather, by writing out the input record, all of the key / value pairs will be inserted with the object stored in MongoDB.
This means that if I change the job, say I parse the date out to its component parts -- and re-run the job, I can expect a different shape of record to be added to MongoDB. In parsing out the date, if I switch the tFileInputRegex schema to have day, month, and year rather than datePart, the newly-loaded records will differ in the properties (day versus datePart) available.
More License Notes
A dialog like this seems to be popping up a lot in Big Data. For example, dragging tMongoDBOutput displays the following dialog.
Bridging the GNU / Apache License Differences |
This was also displayed with a tMysqlOutput component.
If you're a Talend Open Studio for Data Integration developer, working with Big Data will be easier because the familiar structure of the Talend job. Jobs are configured, coded, and error-checked using the same techniques. The Metadata Repository seems like it will be missed. In order to save off your schemas, export to XML and re-import as Built-in properties of the Big Data components.
Thanks for the wonderful post. It is definitely useful.
ReplyDeleteI have a different scenario in my case. I need to connect various input sources, where I will be getting different types of data. It can be a txt file, json, xml etc., The input file size will be about 10 GB sometimes. I want to load this data as such into MongoDB as a file. I don't want to parse the input and load into MongoDB. Instead, I want to store as Raw Data file. How do I do this using Talend any edition of Big Data? Is this possible at all using an ETL tool?
Please let me know. It will be very helpful for me.
Thanks,
Vijay
Hi,
DeleteI haven't worked with this scenario, but maybe this link on storing large objects in MongoDB referencing "GridFS" will help your Talend searches.
http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb
Looking forward to getting more updates and we play a small role in upskilling people providing the latest tech courses. Join us to upgrade on Talend Big Data training
ReplyDeleteLooking forward to getting more updates and we play a small role in upskilling people providing the latest tech courses. Join us to upgrade on talend big data training
ReplyDeletePresently id verification service is quite popular among people. There are numerous id verification methods that someone can receive on a reliable site titled Trust Swiftly, and a corporation can utilize the methods to safe their own business online handily. When you visit this particular TRUST SWIFTLY website, you'll get a lot more information about id verification service.
ReplyDeleteGreat information, thank you for sharing.
ReplyDeleteBest Big Data Solutions Company/a>
reputation for dependability,
ReplyDeleteTextPad crack 6/