Friday, September 30, 2011

InfoSphere DataStage Job What is it?

In ETL world Flow/Job/Process all points to combination of various data processing modules in integrated form. IBM InfoSphere DataStage job is no different. It consists of stages linked together on its design canvas named Designer which describe the flow of data from a data source to a data target. Stage usually has at least one data input and/or one data output. Some stages can accept more than one data input, and output to more than one stage. These stages in simple terms support multiple input or output links. Each stage has a set of predefined and editable properties which might include the file name for the Sequential File stage, the columns to sort, the transformations to perform, and the database table name for the DB stages. 

Stages and links can be grouped in a shared container and Instances of the shared container can then be reused in different parallel jobs. A local container within a job can also be defined by grouping stages and links into a single unit, though limit its usage within the Job where it is defined. Different Business Logic requires to use different set of stages. Stages available on canvas varies from General, Database, Development, Processing and so on...Set of these stages used to define a flow and this flow is called Job.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

InfoSphere DataStage Job Execution Flow & Job Score

DataStage Job design and execution consist of multiple steps. It starts with Job creation, saving and then compiling. Once Parallel Job is compiled it generates OSH (discussed in osh ). When we execute a job, the generated OSH and information provided in Config File (APT_CONFIG_FILE) is used to compose a “score”. During run-time IBM InfoSphere DataStage identifies degree of parallelism & node assignments for each operator. It then inserts sorts and partitioners as needed to ensure correct results. It also defines the connection topology (virtual data sets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines number of actual OS processes combining multiple operators/stages within a single OS process as appropriate, to improve performance & optimize resource requirements.
The job score is used to fork processes with communication interconnects for data, message and control. Processing begins after the job score and processes are created so we can say it is initialization stage. Job processing ends when either complete data is processed by the final operator, a fatal error is encountered by any operator, or the job halted by DataStage Job Control or manual intervention such as DataStage STOP.

Job scores consist of data sets (partitioning and collecting) and operators (node/operator mapping). The execution manages control and message flow across processes and consists of the conductor node and one or more processing nodes. Real data flows from player to player while conductor and section leader are only used to control process execution through control and message channels.

Conductor is initial framework process which creates Section Leader processes (1 per node) consolidates messages to the DataStage log & manages orderly shutdown. The Conductor node has the start-up process & also communicates with the players. APT_DUMP_SCORE can be used to print the score inside the log.

Section Leader is a process that forks player processes (one per stage) & manages communications. SLs communicate between the conductor and player processes only. For a given parallel configuration file, one section leader will be started for each logical node.
Players are the real processes associated with stages and sends stderr & stdout to the SL, also establishes connections to other players for data flow & cleans up on completion. Each player need to communicate with every other player. There are separate communication channels (pathways) for control, errors, messages & data. Data channel does not go via the section leader/conductor as this would limit scalability. Instead data flows directly from upstream operator to downstream operator.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions