In Talend Open Studio, you can apply processing directly to a spreadsheet in preparation for loading into a data mart. However, it's often better to do this type of work in the database itself. This is because of the flexibility and performance of running queries directly in the database process. For example, rather than gathering surrogate keys outside of the database using tMap and lookups from a file processing loop, load the data into the database and use a query containing high-performance joins for additional transformations.
Start by defining the main processing loop based off of tFileList. This will feed into a tMap component. The tMap component can provide some basic NULL-checking and data type validation, spooling any rejects off to reject table.
This is a Talend Open Studio job containing 3 subjobs that load a directory of spreadsheets into a MySQL table, leaving the results in place for additional processing.
|Talend Open Studio - A Typical ETL Staging Job|
While the tMysqlOutput components have an option to Clear or Truncate tables, we want to leave the records from each iteration in order to gain an economies of scale. That is, load every data file and then process the entire set in one large batch rather than flushing the contents out with every iteration. I use a single iteration of a tForeach (added a single element "once" to the list) to drive the two delete statements.
The tMap component contains logic for rejecting records. An alternate implementation would put this logic into a component like tFilterRow so that the specific failure condition could be tracked.
|Talend Open Studio - Map Only Valid Fields|
|Talend Open Studio - Mapping Reject|
Stage as String
The schemas for components like tFileInputExcel can use any simple data type available in Java: String, Integer, Date, Double. I find it useful to stage everything using very loose schemas where each field is a String. This has the following advantages
- Bring data into the system where it can be handled, and
- Allow for runtime processing related to Locales or conditional formatting.
Staging is used to load a database table where it is more convenient to analyze and process a record than a set of flat files. Once in the database, performant queries can be run without suffering the network latency of having to move data in and out of the database process. Talend Open Studio components provide the file processing and the data loading steps in this arrangement of subjobs.