Friday, December 16, 2011

What is Modify Stage

With a Modify stage you can perform
  • null handling
  • date/time manipulation 
  • date/time interval calculation (limited) 
  • string trimming and substring 
  • data type conversion
One major limitation is that Modify stage functions are limited to a single argument - straight away arithmetic and concatenation are ruled out.
As pointed by experience from many you can get bitten by the transformer stage rejecting rows in which some fields were participating in derivations or stage variables and the dang things had NULLS in them. Solving the NULLS problem with IF IsNull() for every single field being evaluated in some way can get overly complex and very messy. Instead put a Modify stage before the Transformer, call the stage MhandleNull and handle_null() for all fields being evaluated in the transformer. This simplifies the already cumbersome IF THEN ELSE syntax of the transformer and/or the stage variables.

Courtesy: Vincent McBurney

The Modify stage is a metadata panel beater. It does just one thing: converts columns from one data type to another. It is quite restrictive, you cannot nest functions or use constants as values. Almost all the functions in the Modify stage are also available in the all rounder Transformer stage. The Transformer is an automatic, the Modify Stage is a manual

Transformers are easy to use, which is one of the reasons why DataStage has been successful as it is the most commonly used stage. When adding derivation / mapping / transformation code into the text entry windows there is a right mouse click menu for finding a list of transformation functions.
Back when the parallel engine was a product called Orchestrate there was a Transform function that worked much the same way as other parallel commands. It required a lot of manual coding. When Ascential turned Orchestrate into parallel jobs they took the transformer stage from Server Edition and sat it over the top of the parallel transform function. This is when the trouble began. The transformer stage and the Transform function didn't quite fit together, so each parallel transformer needs a C++ compilation, a kind of automated custom stage build, even in the latest versions. The message is that made it slower then other parallel stages.
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Wednesday, December 14, 2011

Transformer & DRS "The Real All Rounders"

Various functions which are available as part of Transformer can be performed by various other individual specialized stages.
Lets take example of Constraints which can be achieved by Filter Stage and so is Metadata Conversion by Modify Stage. Copy Stage can take care of copy of field and producing multiple outputs and Column Generator can create new fields. Surrogate Key Stage meant for counters. So Transformer provides them all at one place but for performance should use these highly specialized and faster stages. The reason these stages can perform faster because they do not carry as much overhead. Also it is the ease of use as when you want to use Specialized Stages it is not "Fun". You need to be expert and so is understanding other developer's mind and may be well defined strategy to work with these specialized stages.
The Transformer is hands down the best stage to write any type of transformation code in the entire DataStage product given the right mouse click menu, the red text marking for syntax errors and the grid layout.
The transformer stage is one of the few stages that has a robust reject link for rows that fail transformation. The Modify and Filter stages lack this link so if a piece of data defies logic the job will abort rather then drop a row down a reject link.  
Similarly DRS Dynamic Relational Stage and Now a Connector as well is another in the All Rounder Category. DRS Stage and Connector provides all RDBMS connectivity at one place. You don't need to use multiple Plugins or Connectors and  instead can use this one and connect to any database. Though with Connectors you get same look across but still it is ease when use "Single Stage" for all database operations. Simple benefit, assume you moved from Oracle to IBM DB2 or from Teradata to IBM Neteeza go to Job, Select Drop Down and move to new Database "So Simple". No need to remove one stage and add another. "But it is recommended to use specialized Connectors or Stages" when you are not having multiple databases in house.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Why I need to use Run Time Column Propagation (RCP)

While designing DataStage jobs need to consider only columns which are required and unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one stage to another and from one job to the next. It means keep RCP away from the Job "BUT" there are areas where this comes handy in processing.When you need to handle undefined columns that you encounter when the job is run, and propagate these columns through to the rest of the job RCP (Runtime Column Propagation) is the way forward. This check box enables the feature, to actually use it you need to explicitly select the option on each stage
If runtime column propagation is enabled in the DataStage Administrator, you can select the Runtime column propagation to specify that columns encountered by a stage in a parallel job can be used even if they are not explicitly defined in the meta data. You should always ensure that runtime column propagation is turned on if you want to use schema files to define column meta data.

RCP Set at DataStage Adminstrator:
RCP Set at DataStage Stage Output:
Here are few areas where RCP can make difference

Merge Stage & Lookup Stage: Ensure required column meta data has been specified (this may be omitted altogether if you are relying on Runtime Column Propagation).

Shared Container: When inserted in a job, a shared container instance already has meta data defined for its various links. This meta data must match that on the link that the job uses to connect to the container exactly in all properties. The Inputs page enables you to map meta data as required. The only exception to this is where you are using runtime column propagation (RCP) with a parallel shared container. If RCP is enabled for the job, and specifically for the stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time.

For parallel shared containers, you can take advantage of runtime column propagation to avoid the need to map the meta data. If you enable runtime column propagation, then, when the jobs runs, meta data will be automatically propagated across the boundary between the shared container and the stage(s) to which it connects in the job.

Again "Do use RCP" when you need to propagate these "un-knowns" else processing additional columns for large data means utilization of precious resources and time on "un-wanted data".

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, December 13, 2011

How to detect Last Row in Group within DataStage

When Data  is grouped on a column, and also is sorted on this column, you can detect when you process the last row before the value in that column changes. The data must be sorted in the job, so that the Transformer stage can detect that it is sorted. If the data is sorted before being input to the job, you can include a dummy Sort stage and set the stage's Sort Key property to Don't Sort (previously sorted - as always) for every column on which the data was previously sorted.

To detect key breaks, & then process data accordingly use new function LastRowInGroup(InputColumn). It can be used when processing an input row, and the function returns TRUE if the value in the named column changes after this row (that is, this row is the last row in a group). The function also returns TRUE if this row is the last row in the input data. This function can be called even if input data is sorted by more than one column, or If the argument specified is the primary sorted key, then the behavior is as described for a single column.

If the argument specified is the secondary or other sorted key, then LastRowInGroup() returns TRUE if the value of the specified column is about to change, or if any of its higher level of key columns are about to change. For example, if the primary sorted column is Col1 and the secondary sorted column is Col2, then LastRowInGroup(Col2) returns true when either the value in Col2 is about to change, or the value in Col1 is about to change. Therefore it can return true when Col2 is not about to change, but Col1 is, because Col1 is a higher level sorted key than Col2

Please refer previous Blog on Caching with-in Transformer for example and usage.
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, December 10, 2011

How To Find a DataStage Job is Multi Instance or not for Reporting

Here are steps which can be used in a script to check if the given datastage job is multi instance or not.
What is needed to retrieve this is Datastage Project Name and Job Name

ScriptName=$(\basename $0)
if [[ $# != 2 ]];then
printf “Error: $ScriptName – Required parameter(s) was not supplied!\n\n”
printf “Usage: $ScriptName  <Project> <JobName>\n\n”
printf “       <Project> : is the unique identifier for the datastage project, required\n”
printf “       <JobName> : is the Datastage JobName, required\n”
exit 1


echo “step ………: checking job whether it is multiinstance or not”

DSHOME=`cat /.dshome`
bin/uvsh <<! > $tmpFile
LOGTO $Project
if [[ $? != 0 ]]; then
echo “CheckJobType – unable to determine jobtype for job $JobNameh in $Project datastage project!”
exit 1
JobType=$(\grep “^JobType:” $tmpFile |\cut -f 2 -d
if [[ -z $JobType ]]; then
echo “CheckJobType – Job $JobName not found in $Project datastage project. JobType cannot be determined!”
\rm -f $tmpFile
exit 1
if [[ $JobType -eq 0 ]]; then
echo “*****$JobName is not a multi instance job*****”
echo “*****$JobName is  multi instance job*****”
\rm -f $tmpFile
exit 0

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Friday, December 9, 2011

A Nice DataStage FAQ

It is bit old but still functionality wise cover major aspects of DataStage. It does not cover Information Server perspective but ETL is still the same and so is this FAQ. A "must" go to before jumping into any deep dives on DataStage apart from various DataStage manuals.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Caching 'Time Saver' in DataStage Transformer

This blog discusses simplification of even the most complicated data integration challenges.  When we can achieve that, and make data processing more efficent, it's the best of both worlds.  The new cache mechanism is a benefit to both of those goals. 
The Transformer Cache is an in-memory storage mechanism that is available from within the Transformer stage and is used to help solve complex data integration scenarios. The cache is a first-in/first-out (i.e. FIFO) construct and is accessible to the developer via two new functions:
  • SaveInputRecord: stores an input row to back of the cache
  • GetInputRecord: retrieves a saved row from the front of the cache 
These functions should be called from the stage variable or loop variable sections of the transformer in most cases. Developers will find the cache most useful when a set of records need to be analyzed as a single unit and then have a result of that data appended to each record in the group. 
Here are few scenarios discussed by Tony in detail where using a cache will prove VERY helpful:
  • The input data set is sorted by fund id and valuation date in ascending order. We have an unknown number of records for each fund.  The requirement is to output the five most recent valuations for any fund and if there are not at least five, do not output any.
  • There is a varying number of clients (N) related to each salesperson.  The requirement is to label each such client detail record with a label that reads "1 of N". 
  • An input file contains multiple bank accounts for each customer. The requirement is to show the percentage of the total balance for each individual account record.
    Perhaps one or more of these sounds familiar to you. You may also refer to the Information Server InfoCenter for more detail on this solution.
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Wednesday, December 7, 2011

Looping in InfoSphere DataStage Transformer

InfoSphere DataStage introduced Looping Transformer and is built into the Transformer Stage that developers have been using for years, so you can begin using it in any existing job. This Looping feature allows very complex data integration challenges to be solved elegantly from a design experience, and efficiently from a resource perspective
Variable Length Records with Embedded Payloads 
Every Customer and Developer faces challenge related to a variable length string that included multiple record types and payloads embedded in the data.  Here's a sample record.
You can see there is a series of record types (the first being "A"), payload lengths ("005") and payloads ("abcde"). Now assume we wanted to convert that data to the following
Rcd Type
Data Record

What makes this a challenging problem is the fact that the length is defined in the data and the number of segments can vary tremendously (record 2 may have 100 payloads in that string).  In DataStage 8.5, the looping transformer handles this very easily by simply introducing a loop condition.  Here's the transfomer logic for solving this:


The logic that has been circled includes a new variable named "@ITERATION" which is a counter indicating what pass through the loop this is.  One other item that will appear new are the Loop variables - basically the same as Stage variables, but these get evaluated each time through the loop. The test for RemainingRecord <> "" allows us to exit the loop when all bytes in the string have been consumed.
It avoids several other transformers and funnels the customer is using in the current implementation. The savings therefore apply not only to the initial design experience and run time performance, but then also the ongoing maintenance of this job as related requirements in the organization change.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Monday, December 5, 2011

Looping in InfoSphere DataStage Sequencer

Looping in any Data Processing Flow is must and I discuss how it can be used with in InfoSphere Datastage with the help of  some example.
I have data available in a File (any format) as DataStage can process can pass file id as parameter reading from another file. File ID will keep on increasing  and we take advantage of datastage looping instead of adding job or change sequencer.Contents in the File can be S.No and File Id and use this File ID as Parameter to the parallel Job.
To achieve this can start with getting total number of lines in the file using DataStage External Routine which contains Command = “wc -l”:” “:FileName:”| awk ‘{print $1}’” and get the number of lines. Routine can be triggered by nall DSExecute(“UNIX”, Command, Output, SystemReturnCode)
This will provide "Upper Limit" for the Loop.

File Id can be retrieved with startLoop.$Counter variable with the help of grep and awk command as mentioned above and will get file ID for each iteration.
This is how DataStage Job will look Like 

Will cover Loop in Transformer in a follow-up.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Sunday, December 4, 2011

DataStage Job Logs in DataStage Designer

With InfoSphere DataStage 8.7 now Director’s Log Detail window has been integrated into Designer. With this feature, users can access the full set of job run-time information directly alongside their job design, thus simplifying the development life cycle.  You will also note that additional job control buttons (job stop/job reset) are provided in the log toolbar. It not only saves additional connection to Server from Director but uses less resources on client machine as required to open additional director.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, December 3, 2011

Bloom Filter Stage now Part of InfoSphere DataStage

With the growth in Data and requirement of processing it optimally Filtering Duplicate data is becoming amajor concern for the Enterprises. InfoSphere DataStage recently released Bloom Filter stage iwhich s based on the algorithm developed by Burton Howard Bloom and provides highly performant and resource efficient duplicate key filtering.  This is particularly useful in Telco data integration patterns involving the enormous volumes of call detail records. It can be used to perform key lookups more efficiently.You might find false positives but Bloom Filter never generates false negatives in your output dataset.Bloom Filter Stage takes a single input dataset, and can generate multiple output sets depending on the operating mode. 

Will cover detailed Job Design and Usage in a follow-up.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Wednesday, November 30, 2011

DataStage Functions and Routines

 DataStage BASIC functions:
A function performs mathematical or string manipulations on the arguments supplied to it, and return a value. Some functions have 0 arguments; most have 1 or more. Arguments are always in parentheses, separated by commas, as shown in this general syntax:FunctionName (argument, argument). These functions can be used in a job controlroutine, which is defined as part of a job’s properties and allows other jobs to berun and controlled from the first job. Some of the functions can also be used for getting status information on the current job; these are useful in active stageexpressions and before- and after-stage subroutine

DataStage Routines:
DataStage Routines are stored in the Routines branch of the Data Stage Repository, where you cancreate, view or edit. The following programming components are classified as routines:Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX(OLE) functions, Web Service routines.

Here are few functions which provides various features to Developer and Production Team to collect and execute the DataStage Jobs.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Sunday, November 13, 2011

Handling Large Sequential File Data Parallely with InfoSphere DataStage

While handling huge volumes of data, the Sequential Files and from DataStage perspective Sequential File Stage can itself become one of the major bottlenecks as reading and writing from this stage is slow. Certainly do not use sequential files for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a file. Rather Dataset stages should be used for intermediate storage between different jobs.
Datasets are key to good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order. No repartitioning or import/export conversions are needed.
In order to have faster reading from the Sequential File stage the number of readers per node can be increased (default value is one). This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor mode).
This is an optional property and only applies to files containing fixed-length records. But this provides a way of partitioning data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP (Symmetric Multi Processing) system.

It can also be specified that single files can be read by multiple nodes. This is also an optional property and only applies to files containing fixed-length records. Set this option to "Yes” to allow individual files to be read by several nodes. This can improve performance on cluster systems.
IBM DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates to the reader on each node a separate region within the file to process. The regions will be of roughly equal size.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Monday, October 31, 2011

Deploying assets via IBM InfoSphere Information Server Manager

The work flow for deploying assets by using the IBM® InfoSphere™ Information Server Manager might involve more than one user. In tightly controlled environments with limited access to the target and source systems, a developer performs definition and build steps to write the package to the file system. A production manager (or any user with access to the required target project and the build package) then applies the package to a project on the target system.
Deployed assets are typically read-only in every environment except development. If an error is discovered in a test or production environment that requires a change to an asset, or if an improved implementation of a specific asset (such as a InfoSphere QualityStage ruleset) is created, then the change is made in the development environment and applied to an existing package by rebuilding the package. The updated package can be deployed into an existing project, replacing the corresponding assets. There might be a requirement to revert to an earlier version of a deployment. In this case a previous build of the package is deployed, replacing the corresponding assets.
The following diagram shows the work flow of a created package: 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, October 29, 2011

DataStage Jobs on Single Processor and Multiple Processor Systems

The default behavior when compiling DataStage jobs is to run all adjacent active stages in a single process. This makes good sense when you are running the job on a single processor system. When you are running on a multi-processor system it is better to run each active stage in a separate process so the processes can be distributed among available processors and run in parallel. It can be achieved either by inserting IPC stages between connected active stages or by turning on inter-process row buffering either project wide (using theDataStage Administrator) or for individual jobs (in the Job Properties dialog box)The IPC facility can also be used to produce multiple processes where passive stages aredirectly connected. This means that an operation reading from one data source and writing to another could be divided into a reading process and a writing process able totake advantage of multiprocessor systems.
Behavior of Passive Stages
Behavior of Active Stages
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Friday, September 30, 2011

InfoSphere DataStage Job What is it?

In ETL world Flow/Job/Process all points to combination of various data processing modules in integrated form. IBM InfoSphere DataStage job is no different. It consists of stages linked together on its design canvas named Designer which describe the flow of data from a data source to a data target. Stage usually has at least one data input and/or one data output. Some stages can accept more than one data input, and output to more than one stage. These stages in simple terms support multiple input or output links. Each stage has a set of predefined and editable properties which might include the file name for the Sequential File stage, the columns to sort, the transformations to perform, and the database table name for the DB stages. 

Stages and links can be grouped in a shared container and Instances of the shared container can then be reused in different parallel jobs. A local container within a job can also be defined by grouping stages and links into a single unit, though limit its usage within the Job where it is defined. Different Business Logic requires to use different set of stages. Stages available on canvas varies from General, Database, Development, Processing and so on...Set of these stages used to define a flow and this flow is called Job.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

InfoSphere DataStage Job Execution Flow & Job Score

DataStage Job design and execution consist of multiple steps. It starts with Job creation, saving and then compiling. Once Parallel Job is compiled it generates OSH (discussed in osh ). When we execute a job, the generated OSH and information provided in Config File (APT_CONFIG_FILE) is used to compose a “score”. During run-time IBM InfoSphere DataStage identifies degree of parallelism & node assignments for each operator. It then inserts sorts and partitioners as needed to ensure correct results. It also defines the connection topology (virtual data sets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines number of actual OS processes combining multiple operators/stages within a single OS process as appropriate, to improve performance & optimize resource requirements.
The job score is used to fork processes with communication interconnects for data, message and control. Processing begins after the job score and processes are created so we can say it is initialization stage. Job processing ends when either complete data is processed by the final operator, a fatal error is encountered by any operator, or the job halted by DataStage Job Control or manual intervention such as DataStage STOP.

Job scores consist of data sets (partitioning and collecting) and operators (node/operator mapping). The execution manages control and message flow across processes and consists of the conductor node and one or more processing nodes. Real data flows from player to player while conductor and section leader are only used to control process execution through control and message channels.

Conductor is initial framework process which creates Section Leader processes (1 per node) consolidates messages to the DataStage log & manages orderly shutdown. The Conductor node has the start-up process & also communicates with the players. APT_DUMP_SCORE can be used to print the score inside the log.

Section Leader is a process that forks player processes (one per stage) & manages communications. SLs communicate between the conductor and player processes only. For a given parallel configuration file, one section leader will be started for each logical node.
Players are the real processes associated with stages and sends stderr & stdout to the SL, also establishes connections to other players for data flow & cleans up on completion. Each player need to communicate with every other player. There are separate communication channels (pathways) for control, errors, messages & data. Data channel does not go via the section leader/conductor as this would limit scalability. Instead data flows directly from upstream operator to downstream operator.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions