Difference between aggregator with sorted input and without sorted input

jawsnnn picture jawsnnn · Jul 17, 2015 · Viewed 7.6k times · Source

My question is primarily theoretical. When we use an aggregator in a Informatica workflow and do not select the sorted input property, I assume the aggregator transformation sorts the data based on the group by ports and then does the aggregation. To improve the performance of this operation, it is often suggested to turn the sorted input property on, and pass the output of a sorter transformation (with keys = group by ports) to the aggregator. My question is: From the perspective of the map, the number of operations is similar in both cases: Sort the data, then Aggregate it. Why is it then that the sorter + Aggregator (Sorted input) runs much faster than a single Aggregator (no sorted input)?

Answer

Samik picture Samik · Jul 17, 2015

It is not correct that aggregator transformation sorts the data before aggregation. It reads each row and stores the data for each group until it reads the entire source, because it cannot be sure where each group ends. Only after reading all the rows, it can perform the aggregation. Consequently, it uses more data cache.

When you use sorted input option, integration service assumes the data to be sorted by group, and performs aggregate calculations as soon as it reads all the rows for a group.

Performance improvement using sorted input is more significant when you have configured the session with multiple partitions.

Also, sorted input cannot be used, when you have nested aggregate function in the aggregator or the session uses incremental aggregation. Maybe, that is the reason aggregator does not automatically sort the data.