InfoSphere Information Server How To: Caching 'Time Saver' in DataStage Transformer

This blog discusses simplification of even the most complicated data integration challenges. When we can achieve that, and make data processing more efficent, it's the best of both worlds. The new cache mechanism is a benefit to both of those goals.

The Transformer Cache is an in-memory storage mechanism that is available from within the Transformer stage and is used to help solve complex data integration scenarios. The cache is a first-in/first-out (i.e. FIFO) construct and is accessible to the developer via two new functions:

SaveInputRecord: stores an input row to back of the cache

GetInputRecord: retrieves a saved row from the front of the cache

These functions should be called from the stage variable or loop variable sections of the transformer in most cases. Developers will find the cache most useful when a set of records need to be analyzed as a single unit and then have a result of that data appended to each record in the group.

Here are few scenarios discussed by Tony in detail where using a cache will prove VERY helpful:

The input data set is sorted by fund id and valuation date in ascending order. We have an unknown number of records for each fund. The requirement is to output the five most recent valuations for any fund and if there are not at least five, do not output any.
There is a varying number of clients (N) related to each salesperson. The requirement is to label each such client detail record with a label that reads "1 of N".
An input file contains multiple bank accounts for each customer. The requirement is to show the percentage of the total balance for each individual account record.
Perhaps one or more of these sounds familiar to you. You may also refer to the Information Server InfoCenter for more detail on this solution.

-Ritesh

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions