Tuesday, September 27, 2011

SAS Processing with InfoSphere DataStage - an Example Flow


  



Parallelizing a SAS data step that executes a SAS DATA step in parallel.

The step takes a single SAS data set as input and writes its results to a single SAS data set as output. The DATA step recodes the salary field of the input data to replace a dollar amount with a salary-scale value.This DATA step requires little effort to parallelize because it processes records without regard to record order or relationship to any other record in the input. Also, the step performs the same operation on every input record and contains no BY clauses or RETAIN statements.
Executing this DATA step in parallel: 
 
  • Get the input from a SAS data set using a sequential sas operator;
  • Execute the DATA step in a parallel sas operator;
  • Output the results as a standard InfoSphere DataStage data set (you must provide a schema for this) or as a parallel SAS data set. You might also pass the output to another sas operator for further processing. The schema required might be generated by first outputting the data to a Parallel SAS data set, then referencing that data set. InfoSphere DataStage automatically generates the schema.
The SAS operator can then do one of three things: use the sasout operator with its -schema option to output the results as a standard InfoSphere DataStage data set, output the results as a Parallel SAS data set, or pass the output directly to another sas operator as an SAS data set. The default output format is SAS data set. When the output is to a Parallel SAS data set or to another sas operator, for example, as a standard InfoSphere DataStage data set, the liborch statement must be used.
 
-Ritesh 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions





No comments:

Post a Comment