While handling huge volumes of data, the Sequential Files and from DataStage perspective Sequential File Stage
can itself become one of the major bottlenecks as reading and writing
from this stage is slow. Certainly do not use sequential files for
intermediate storage between jobs. It causes performance overhead, as it
needs to do data conversion before writing and reading from a file.
Rather Dataset stages should be used for intermediate storage between
different jobs.
Datasets are key to good performance in a set of linked jobs. They
help in achieving end-to-end parallelism by writing data in partitioned
form and maintaining the sort order. No repartitioning or import/export
conversions are needed.
In order to have faster reading from the Sequential File stage the
number of readers per node can be increased (default value is one). This
means, for example, that a single file can be partitioned as it is read
(even though the stage is constrained to running sequentially on the
conductor mode).
This is an optional property and only applies to files containing
fixed-length records. But this provides a way of partitioning data
contained in a single file. Each node reads a single file, but the file
can be divided according to the number of readers per node, and written
to separate partitions. This method can result in better I/O performance
on an SMP (Symmetric Multi Processing) system.
It can also be specified that single files can be read by multiple
nodes. This is also an optional property and only applies to files
containing fixed-length records. Set this option to "Yes” to allow
individual files to be read by several nodes. This can improve
performance on cluster systems.
IBM DataStage knows the number of nodes available, and using the
fixed length record size, and the actual size of the file to be read,
allocates to the reader on each node a separate region within the file
to process. The regions will be of roughly equal size.