Wednesday, November 30, 2011

DataStage Functions and Routines

 DataStage BASIC functions:
A function performs mathematical or string manipulations on the arguments supplied to it, and return a value. Some functions have 0 arguments; most have 1 or more. Arguments are always in parentheses, separated by commas, as shown in this general syntax:FunctionName (argument, argument). These functions can be used in a job controlroutine, which is defined as part of a job’s properties and allows other jobs to berun and controlled from the first job. Some of the functions can also be used for getting status information on the current job; these are useful in active stageexpressions and before- and after-stage subroutine

DataStage Routines:
DataStage Routines are stored in the Routines branch of the Data Stage Repository, where you cancreate, view or edit. The following programming components are classified as routines:Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX(OLE) functions, Web Service routines.

Here are few functions which provides various features to Developer and Production Team to collect and execute the DataStage Jobs.



-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Sunday, November 13, 2011

Handling Large Sequential File Data Parallely with InfoSphere DataStage

While handling huge volumes of data, the Sequential Files and from DataStage perspective Sequential File Stage can itself become one of the major bottlenecks as reading and writing from this stage is slow. Certainly do not use sequential files for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a file. Rather Dataset stages should be used for intermediate storage between different jobs.
Datasets are key to good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order. No repartitioning or import/export conversions are needed.
In order to have faster reading from the Sequential File stage the number of readers per node can be increased (default value is one). This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor mode).
This is an optional property and only applies to files containing fixed-length records. But this provides a way of partitioning data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP (Symmetric Multi Processing) system.


It can also be specified that single files can be read by multiple nodes. This is also an optional property and only applies to files containing fixed-length records. Set this option to "Yes” to allow individual files to be read by several nodes. This can improve performance on cluster systems.
IBM DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates to the reader on each node a separate region within the file to process. The regions will be of roughly equal size.

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions