InfoSphere Information Server How To: What is InfoSphere DataStage DataSet

InfoSphere DataStage has its proprietary file format which creates files on the System meant for processing by other dependent Jobs or ETL processing via DataStage. These files stored in native format are known as Data Sets. These Dataset processed in parallel and also called as Orchestrate File.
Dataset is created via a file Stage also named as Dataset and allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs for further processing.
The fundamental concept of the DataStage Parallel framework (Orchestrate framework) is the Data Set serves as inputs and outputs of Orchestrate operators.
To put it in simple words Data Set is like a database table consist of collection of identically-defined rows. It is one of the mechanism by which Data stored during transit for parallel processing later and shared across. Each operator accepts input from one Data Set and sends its output to another Data Set.
Data Set exists on all the processing nodes defined for the job in APT CONFIG FILE that is currently processing it. Subset of rows in a Data Set located on a single processing node is referred to as a "partition" of the Data Set. Technically, a partition is a subset of the rows in a Data Set earmarked for processing on the same processing node. A control file is associated with each data set contains the record schema that defines the row structure.

As mentioned above DataSet preserves partition by storing data on the nodes to avoid repartition the data. Also as data is stored in internal binary format it takes less time to read/write from with-in DataStage Job.
Dataset consist of multiple files mentioned below

Descriptor File: we can see the Schema details and address of data.
Data File: we can see the data in Native format.
Control & Header files: resides in Operating System consist of Schema and other relevant information

We can use DataSet Management Utility from GUI and for command line or automation purpose use orchadmin (discussed in next blog)

-Ritesh

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions