Wednesday, May 25, 2011

DataStage Parallel lookup types

ETL Tools are nothing but accessing various sources and transforming the data. During this process it has to perform various actions including lookups. So when we say from DataStage perspective, DataStage Jobs can have many sources of reference data for lookup including database table lookups, flat files and many native data sources. Aim is to identify faster solution. In DataStage Server Jobs it is quite simple use local hash files are the fastest method for a key based lookup provided time taken to build the hash file does not wipe out your benefits from using it.
But for Parallel Jobs there are a very large number of stages that can be used as a lookup, a much wider variety then server jobs, this includes most data sources and the parallel staging formats of datasets and lookup filesets. Even if we discount database lookups as the overhead of the database connectivity and any network passage makes them slower then most local storage.

Here is detailed comparison performed by Vincent McBurney comparing datasets to sequential files to lookup filesets and increased row volumes to see how they responded. The test had three jobs, each with a sequential file input stage and a reference stage writing to a copy stage.
DataStage tip for beginners - parallel lookup types

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

No comments:

Post a Comment