Sunday, September 1, 2013

Conductor and Compute Nodes in Datastage - II


DataStage job development is platform independent  and job execution completely relies on parallel configuration file which can be set as  APT_CONFIG_FILE for each or set of jobs. This configuration file provides mapping of real resources/infrastructure which can be used to execute the DataStage Job at run time based on these logical processing nodes.
Information Server parallel framework is based on process based architecture which can scale up by adding more resources to it and making changes in this configuration file to use the new infrastructure.

Every DataStage job starts (Refer APT_CONFIF_FILE) multiple processes and based on sample configuration file it will start
1 Conductor  Process (Started on Conductor Node)
4 Section leaders (4 nodes * 1 section leader per node) Create & manage player player process.
8 Player processes (2 stages * 4 nodes) Perform real execution, process value changes based on optimization

We should look at Dump Score for details and based on sample configuration file DataStage Job creates 13 processes. 
Conductor Node meant for Job Start-up, assigning resources, Create Section leader, Coordinate among various processes and status and even stopping all processes in the event of failure.
Primary Process which gets triggered when DataStage job starts is called conductor and reads the job design and parallel execution configuration file specified to start a coordinating process called a “section leader” for each node. Each section leader based on score (number of Stages) triggers separate process called “player”.
Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP.

-Ritesh 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Conductor and Compute Nodes in Datastage - I

Below is the sample APT CONFIG FILE ,consist of 3 Engines where 2 of them used as Compute Nodes and 1Engine is used as Conductor Node only. Conductor Node Starts the Job.
If requirement is not to use main engine as Compute Node and use it only to start the jobs (Conductor Node) then do not include the conductor in the default pool (the represented by ""). Below you will notice that the conductor is assigned to a pool called "conductor" (this name is used just as an example but it could have been any name), but it doesn't include the default pool "". You will also notice all other nodes contain only the default pool "". With this change the conductor node starts the job but all Section leaders and other processes run on the remote nodes.

 {
node "node0"
{
fastname "Engine01"
pools "conductor"
resource disk "/opt/IBM/InformationServer/Server/Datasets/node0" {pools "conductor"}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch/node0" {pools ""}
}
node "node1"
{
fastname "Engine02"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets/node1" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch/node1" {pools ""}
}
node "node2"
{
fastname "Engine02"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets/node2" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch/node2" {pools ""}
}
node "node3"
{
fastname "Engine03"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets/node3" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch/node3" {pools ""}
}
node "node4"
{
fastname "Engine03"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets/node4" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch/node4" {pools ""}
}
}

Brief Summary of different DataStage Processes
  • Conductor Node (one per job): the main process used to startup jobs, determine resource assignments, and create Section Leader processes on one or more processing nodes. Acts as a single coordinator for status and error messages, manages orderly shutdown when processing completes or in the event of a fatal error. The conductor node is run from the primary server
  • Section Leaders (one per logical processing node): used to create and manage player processes which perform the actual job execution. The Section Leaders also manage communication between the individual player processes and the master Conductor Node.
  • Players: one or more logical groups of processes used to execute the data flow logic. All players are created as groups on the same server as their managing Section Leader process.
Next Blog discuss it further providing more insight on Node Concept of DataStage
-Ritesh 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions