InfoSphere DataStage has 3 different Stages which can be used to create files on the System. File Set Stage, DataSet and Sequential File Stage.
Fileset: InfoSphere® DataStage® can generate and name exported
files, write them to their destination, and list the files it has generated
in a file whose extension is, by convention, .fs. The data files and the file
that lists them are called a file set. This capability is useful because
some operating systems impose a 2 GB limit on the size of a file and you need
to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination
data file is limited by the characteristics of the file system and the amount
of free disk space available. The number of files created by a file set depends
on:
- The number of processing nodes in the default node pool
- The number of disks in the export or default disk pool connected
to each processing node in the default node pool
- The size of the partitions of the data set
-
fileset preserves partition
scheme & can view data in the order defined in partitioning scheme.
- file sets carry formatting information
that describe the format of the files to be read or written.
Sequential File Stage: It is a file stage which allows you to read data from or write data
one or more flat files. The stage can have a single input link or a single
output link, and a single rejects link.
The stage executes in parallel mode if reading multiple files
but executes sequentially if it is only reading one file. By default a complete
file will be read by a single node (although each node might read more than
one file). For fixed-width files you can configure the stage to
behave differently:
- You can specify that single files can be read by multiple
nodes. This can improve performance on cluster systems.
- You can specify number of readers run on a single
node means a single file can be partitioned as it
is read (even though stage is constrained to running sequentially on the
conductor node).
The stage executes in parallel if writing to multiple files,
but executes sequentially if writing to a single file. Each node writes to
a single file, but a node can write more than one file.When reading or writing a flat file, InfoSphere® DataStage® needs to know something
about the format of the file. The information required is how the file is
divided into rows and how rows are divided into columns.
Dataset Stage: It is also a file stage which allows you to read data from or write data to a data set.
The stage can have a single input link or a single output link. It can be
configured to execute in parallel or sequential mode. The Data Set stage allows you to store data being operated on in a persistent
form, which can then be used by other InfoSphere® DataStage® jobs. Data sets are operating
system files, each referred to by a control file, which by convention has
the suffix .ds. Using data sets wisely can be key to good performance in a
set of linked jobs.
- It
preserves partition & stores data on the nodes and avoids to repartition the data
- Stores data in internal format of DataStage & takes less time to read/write from ds to any other