Wednesday, May 25, 2011

DataStage server v enterprise: Performance stats

Here are few performance tests performed by Vincent McBurney comparing DataStage server jobs against parallel jobs running on the same machine and processing the same data. Interesting results. Do refer them in Free time
DataStage server v enterprise: some performance stats

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

DataStage Parallel lookup types

ETL Tools are nothing but accessing various sources and transforming the data. During this process it has to perform various actions including lookups. So when we say from DataStage perspective, DataStage Jobs can have many sources of reference data for lookup including database table lookups, flat files and many native data sources. Aim is to identify faster solution. In DataStage Server Jobs it is quite simple use local hash files are the fastest method for a key based lookup provided time taken to build the hash file does not wipe out your benefits from using it.
But for Parallel Jobs there are a very large number of stages that can be used as a lookup, a much wider variety then server jobs, this includes most data sources and the parallel staging formats of datasets and lookup filesets. Even if we discount database lookups as the overhead of the database connectivity and any network passage makes them slower then most local storage.

Here is detailed comparison performed by Vincent McBurney comparing datasets to sequential files to lookup filesets and increased row volumes to see how they responded. The test had three jobs, each with a sequential file input stage and a reference stage writing to a copy stage.
DataStage tip for beginners - parallel lookup types

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

Tuesday, May 10, 2011

IBM InfoSphere DataStage Performance Tuning

ETL Processes are always complex, resource and time consuming. All is based on complexity of Business Logic, amount of Data Processed and various data sets and data sources which are growing regularly. Even then SLA is same as it was decided initially without considering current rate of Data Growth. As all about Data and information need to be provided on time, Performance is key element in the success of any BI and DW Project. To meet the agreed SLA and provide timely information performance tuning and configuration of various Parameters is the Key. It requires to be given appropriate attention during the DW and ETL development process. Now when we talk about InfoSphere DataStage, it is not an easy task. No tuning is Straightforward.

Below link discusses various aspects need to be considered during Performance Tuning.

IBM InfoSphere DataStage Performance Tuning: Overview of Best Practices

-Ritesh
Disclaimer: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions"