Thursday, June 30, 2011

Optimal Processing of Data ....Design Tips with InfoSphere DataStage - Series 3

Here are some basic tips for designing good performance into DataStage Job. 

1.      Avoid Type Conversions
Any Type Conversion means can lead into multiple conversions. So why not retrieve data in desired format and keep it same. Need to take care of data type and its required conversions at Design Time. Use OSH_PRINT_SCHEMAS environment variable to verify that runtime schemas match the job design column definitions during your Development Cycle. If using stage variables on a Transformer stage, ensure their data types match the expected result types.  

2.      Wise Usage of Transformer Stages
Consider merging multiple stages if functionality can be incorporated into single stage, use other stage types to perform simple transformation operations based on requirement. 

3.      Optimal use of "Sort"
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of other stage types. 

4.      Keep Columns only if required
If any column is not required remove it as soon as it serves its purpose. Every additional unused column requires additional buffer memory, data transfer and it can impact performance making each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table. 

5.      Avoid Same Partitioning if accessing Sequential File
Unless specified more than one source file, use of same partitioning technique will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless explicitly repartition.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

No comments:

Post a Comment