Friday, September 30, 2011

InfoSphere DataStage Job Execution Flow & Job Score

DataStage Job design and execution consist of multiple steps. It starts with Job creation, saving and then compiling. Once Parallel Job is compiled it generates OSH (discussed in osh ). When we execute a job, the generated OSH and information provided in Config File (APT_CONFIG_FILE) is used to compose a “score”. During run-time IBM InfoSphere DataStage identifies degree of parallelism & node assignments for each operator. It then inserts sorts and partitioners as needed to ensure correct results. It also defines the connection topology (virtual data sets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines number of actual OS processes combining multiple operators/stages within a single OS process as appropriate, to improve performance & optimize resource requirements.
The job score is used to fork processes with communication interconnects for data, message and control. Processing begins after the job score and processes are created so we can say it is initialization stage. Job processing ends when either complete data is processed by the final operator, a fatal error is encountered by any operator, or the job halted by DataStage Job Control or manual intervention such as DataStage STOP.

Job scores consist of data sets (partitioning and collecting) and operators (node/operator mapping). The execution manages control and message flow across processes and consists of the conductor node and one or more processing nodes. Real data flows from player to player while conductor and section leader are only used to control process execution through control and message channels.

Conductor is initial framework process which creates Section Leader processes (1 per node) consolidates messages to the DataStage log & manages orderly shutdown. The Conductor node has the start-up process & also communicates with the players. APT_DUMP_SCORE can be used to print the score inside the log.

Section Leader is a process that forks player processes (one per stage) & manages communications. SLs communicate between the conductor and player processes only. For a given parallel configuration file, one section leader will be started for each logical node.
Players are the real processes associated with stages and sends stderr & stdout to the SL, also establishes connections to other players for data flow & cleans up on completion. Each player need to communicate with every other player. There are separate communication channels (pathways) for control, errors, messages & data. Data channel does not go via the section leader/conductor as this would limit scalability. Instead data flows directly from upstream operator to downstream operator.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

No comments:

Post a Comment