Thursday, June 30, 2011

Optimal Processing of Data ....Lots of Data with InfoSphere DataStage - Series 1

For any ETL tool to help enterprises in processing their data then it is tool's ability to process large volumes of data in a short period of time. This processing depends on all aspects of the flow and the environment being optimized for maximum throughput and performance. It is the Flow Design which need to be considered initially and address all design aspects. Performance tuning and optimization are iterative processes that begin with job design and unit tests, proceed through integration and volume testing, and continue throughout the production life cycle of the application. 
Even though multiple documents and lots of information available for optimal design, here are few basic performance pointers while using IBM InfoSphere DataStage.
  • If intermediate results only shared between parallel jobs use persistent data sets (using Data Set stages). Ensure that the data is partitioned, partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O to avoid impact on performance.
  • Data Set should be used to create check-points or say restart points in the event that a job or sequence needs to be rerun. Data Sets are platform and configuration specific.
  • Based on available system resources overall processing time can be optimized at run time by executing jobs concurrently. Design time attention required on arrival of data and re-processing requirements during flow design.
  • We can have multiple sets of Parallel configuration files. These files set at run-time decides degree of parallelism and resources used by parallel jobs. Different configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources in development, test, and production modes.
  • Accurate and thoughtful configuration of scratch and resource disks and the underlying filesystem and physical hardware architecture significantly affect overall job performance.
  • Within clustered ETL and database environments, resource-pool naming can be used to limit processing to specific nodes, including database nodes when appropriate. 
Will cover other aspects of Job Design in next set.
 -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

No comments:

Post a Comment