Input and Results
A multi-valued attribute is a collection of values stored in a column. In the RDBMS world, it's common to normalize this information, storing the collection in a join. Such a table would look like this input text file.
A List of Services Codes Associated with a Hotel |
The desired output for the multi-valued attribute is to gather the service codes for a hotel, and output a single row for a hotel with all the services codes in one column. The following results show the target output. HOTEL1 started with 3 rows: HOTEL1, SWI; HOTEL1, TEN; HOTEL1, DRY. The rows are converted to a single row with the service code list column "SWI,TEN,DRY". The results demonstrate that the solution works across a range of inputs from a single value "DRY" for "HOTEL2" to the 4-record set of HOTEL5.
Hotels with Lists of Service Code Values |
Job Design
The following job shows an input file being routed into a tMap. The results are passed out to a tAggregateRow. A tSortRow is used to improve the presentation. Finally, a tLogRow prints the output, though any component could be used.
Job Producing Multi-Valued Attribute |
This screenshot shows the configuration of the input text file. Two columns -- HotelId and Service Code -- are passed along to the tMap.
Input File Configuration |
Most of the functionality for the job is in the tMap. The tMap is responsible for forming a comma-separated list of the service codes. The result of the tMap is an intermediate work product that gathers the services codes into a running concatenation CurrServiceCodeList. This variable is reset with each change in HotelId which is tracked with a state variable CurrHotelId.
The input must be sorted for this algorithm to work.
tMap Tracking Changes in Hotel Id Values |
tMap Variables
CurrHotelId is a variable and the most important part of the job. It is referring to 'out1.HotelId' which is the output value. However, that value carries over from the previous iteration (flows are compiled into iterations). It's accessible in the computation of CurrServiceCodeList.
The screenshot does not show the variable expression for CurrServiceCodeList clearly, so it is printed below.
(!row1.HotelId.equals(Var.CurrHotelId))?row1.ServiceCode:Var.CurrServiceCodeList + "," + row1.ServiceCode
If the input is different than CurrHotelId (including the first record), then CurrServiceCodeList is the one-element list row1.ServiceCode. If the input HotelId is the same as CurrHotelId, then concatenation occurs. Remember that the tMap is an intermediate product, so for those hotels with more than one record, there will still be more than one record output. However, the last of these records will have a CurrServiceCodeList that has accrued all of the values.
tAggregateRow
The tAggregateRow is used to step down the number of rows. The tMap passes along intermediate values: HOTEL1: SWI, HOTEL1: SWI,TEN, HOTEL1: SWI,TEN,DRY. But we only care about the record with all of the concatenated values. The aggregation relies on a "max" computation on the strings to select one of the values over the others. The MAX is taking the record with the longest ServiceCodeList, assuming that that's the one with all the values.
Stepping Down the Number of Rows with a tAggregate |
This is a Talend-centric solution. While there is a complex Java expression in a tMap variable, the job itself is cordoned off from any complex coding. This job kept track of the state of processing using a tMap variable. An intermediate product was built and then adjusted later (tAggregateRow). Something like this would work well with applications that store more in a column than a single data value, like a list of properties.
UPDATE:
This is a version that does away with the Java in the tMap Variables. If you get a NullPointerException on the tDenormalizeSortedRow, make sure that a tSortRow precedes the component.
Multi Valued Attribute with tDenormalize |
that's the tDenormalize job no?
ReplyDeleteYou're right! I added a better version under 'UPDATE'. Thanks Yannick.
DeleteHI all ,
ReplyDeleteI think above scenario can be done by the below steps also ,
tfileinputDelimited -- > tAggregatorrow -> tlogRow.
The only configuration you need to do is in tAggregator , make groupby as HotelID and in operation part select service code only and take the list for the function . you will achieve the resultant.
Regards
Aby
Yes It is possible with t_aggregate row,tdenormalize,tdenormalizesortedrow, and taggrregatesorted row.But the way which you have explained is also awsome.
ReplyDeleteThanks Carl.