Saturday, April 21, 2012

InfoSphere DataStage File Set, DataSet and Sequential File Are they Same?

InfoSphere DataStage has 3 different Stages which can be used to create files on the System. File Set Stage, DataSet and Sequential File Stage.

Fileset:  InfoSphere® DataStage® can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on:
  • The number of processing nodes in the default node pool
  • The number of disks in the export or default disk pool connected to each processing node in the default node pool
  • The size of the partitions of the data set
  • fileset preserves partition scheme & can view data in the order defined in partitioning scheme.
  • file sets carry formatting information that describe the format of the files to be read or written.
Sequential File Stage: It is a file stage which allows you to read data from or write data one or more flat files. The stage can have a single input link or a single output link, and a single rejects link.
The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file). For fixed-width files you can configure the stage to behave differently:
  • You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems. 
  • You can specify number of readers run on a single node means a single file can be partitioned as it is read (even though stage is constrained to running sequentially on the conductor node). 
The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each node writes to a single file, but a node can write more than one file.When reading or writing a flat file, InfoSphere® DataStage® needs to know something about the format of the file. The information required is how the file is divided into rows and how rows are divided into columns.

Dataset Stage: It is also a file stage which allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other InfoSphere® DataStage® jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs.
  • It preserves partition & stores data on the nodes and avoids to repartition the data
  • Stores data in internal format of DataStage & takes less time to read/write from ds to any other
-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

1 comment:

  1. Hi I have a transformer stage with two output links, one calculating checksum and then writing to the table and other one calculating checksum and then writing to a dataset. Here, I want the dataset to be written only after writing to the stage successfully, can this be possible in the same job..?

    ReplyDelete