Friday, December 16, 2011

What is Modify Stage

With a Modify stage you can perform
  • null handling
  • date/time manipulation 
  • date/time interval calculation (limited) 
  • string trimming and substring 
  • data type conversion
One major limitation is that Modify stage functions are limited to a single argument - straight away arithmetic and concatenation are ruled out.
As pointed by experience from many you can get bitten by the transformer stage rejecting rows in which some fields were participating in derivations or stage variables and the dang things had NULLS in them. Solving the NULLS problem with IF IsNull() for every single field being evaluated in some way can get overly complex and very messy. Instead put a Modify stage before the Transformer, call the stage MhandleNull and handle_null() for all fields being evaluated in the transformer. This simplifies the already cumbersome IF THEN ELSE syntax of the transformer and/or the stage variables.

Courtesy: Vincent McBurney

The Modify stage is a metadata panel beater. It does just one thing: converts columns from one data type to another. It is quite restrictive, you cannot nest functions or use constants as values. Almost all the functions in the Modify stage are also available in the all rounder Transformer stage. The Transformer is an automatic, the Modify Stage is a manual

Transformers are easy to use, which is one of the reasons why DataStage has been successful as it is the most commonly used stage. When adding derivation / mapping / transformation code into the text entry windows there is a right mouse click menu for finding a list of transformation functions.
Back when the parallel engine was a product called Orchestrate there was a Transform function that worked much the same way as other parallel commands. It required a lot of manual coding. When Ascential turned Orchestrate into parallel jobs they took the transformer stage from Server Edition and sat it over the top of the parallel transform function. This is when the trouble began. The transformer stage and the Transform function didn't quite fit together, so each parallel transformer needs a C++ compilation, a kind of automated custom stage build, even in the latest versions. The message is that made it slower then other parallel stages.
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Wednesday, December 14, 2011

Transformer & DRS "The Real All Rounders"

Various functions which are available as part of Transformer can be performed by various other individual specialized stages.
Lets take example of Constraints which can be achieved by Filter Stage and so is Metadata Conversion by Modify Stage. Copy Stage can take care of copy of field and producing multiple outputs and Column Generator can create new fields. Surrogate Key Stage meant for counters. So Transformer provides them all at one place but for performance should use these highly specialized and faster stages. The reason these stages can perform faster because they do not carry as much overhead. Also it is the ease of use as when you want to use Specialized Stages it is not "Fun". You need to be expert and so is understanding other developer's mind and may be well defined strategy to work with these specialized stages.
The Transformer is hands down the best stage to write any type of transformation code in the entire DataStage product given the right mouse click menu, the red text marking for syntax errors and the grid layout.
The transformer stage is one of the few stages that has a robust reject link for rows that fail transformation. The Modify and Filter stages lack this link so if a piece of data defies logic the job will abort rather then drop a row down a reject link.  
Similarly DRS Dynamic Relational Stage and Now a Connector as well is another in the All Rounder Category. DRS Stage and Connector provides all RDBMS connectivity at one place. You don't need to use multiple Plugins or Connectors and  instead can use this one and connect to any database. Though with Connectors you get same look across but still it is ease when use "Single Stage" for all database operations. Simple benefit, assume you moved from Oracle to IBM DB2 or from Teradata to IBM Neteeza go to Job, Select Drop Down and move to new Database "So Simple". No need to remove one stage and add another. "But it is recommended to use specialized Connectors or Stages" when you are not having multiple databases in house.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Why I need to use Run Time Column Propagation (RCP)

While designing DataStage jobs need to consider only columns which are required and unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one stage to another and from one job to the next. It means keep RCP away from the Job "BUT" there are areas where this comes handy in processing.When you need to handle undefined columns that you encounter when the job is run, and propagate these columns through to the rest of the job RCP (Runtime Column Propagation) is the way forward. This check box enables the feature, to actually use it you need to explicitly select the option on each stage
If runtime column propagation is enabled in the DataStage Administrator, you can select the Runtime column propagation to specify that columns encountered by a stage in a parallel job can be used even if they are not explicitly defined in the meta data. You should always ensure that runtime column propagation is turned on if you want to use schema files to define column meta data.

RCP Set at DataStage Adminstrator:
RCP Set at DataStage Stage Output:
Here are few areas where RCP can make difference

Merge Stage & Lookup Stage: Ensure required column meta data has been specified (this may be omitted altogether if you are relying on Runtime Column Propagation).

Shared Container: When inserted in a job, a shared container instance already has meta data defined for its various links. This meta data must match that on the link that the job uses to connect to the container exactly in all properties. The Inputs page enables you to map meta data as required. The only exception to this is where you are using runtime column propagation (RCP) with a parallel shared container. If RCP is enabled for the job, and specifically for the stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time.

For parallel shared containers, you can take advantage of runtime column propagation to avoid the need to map the meta data. If you enable runtime column propagation, then, when the jobs runs, meta data will be automatically propagated across the boundary between the shared container and the stage(s) to which it connects in the job.

Again "Do use RCP" when you need to propagate these "un-knowns" else processing additional columns for large data means utilization of precious resources and time on "un-wanted data".

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, December 13, 2011

How to detect Last Row in Group within DataStage

When Data  is grouped on a column, and also is sorted on this column, you can detect when you process the last row before the value in that column changes. The data must be sorted in the job, so that the Transformer stage can detect that it is sorted. If the data is sorted before being input to the job, you can include a dummy Sort stage and set the stage's Sort Key property to Don't Sort (previously sorted - as always) for every column on which the data was previously sorted.

To detect key breaks, & then process data accordingly use new function LastRowInGroup(InputColumn). It can be used when processing an input row, and the function returns TRUE if the value in the named column changes after this row (that is, this row is the last row in a group). The function also returns TRUE if this row is the last row in the input data. This function can be called even if input data is sorted by more than one column, or If the argument specified is the primary sorted key, then the behavior is as described for a single column.

If the argument specified is the secondary or other sorted key, then LastRowInGroup() returns TRUE if the value of the specified column is about to change, or if any of its higher level of key columns are about to change. For example, if the primary sorted column is Col1 and the secondary sorted column is Col2, then LastRowInGroup(Col2) returns true when either the value in Col2 is about to change, or the value in Col1 is about to change. Therefore it can return true when Col2 is not about to change, but Col1 is, because Col1 is a higher level sorted key than Col2

Please refer previous Blog on Caching with-in Transformer for example and usage.
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, December 10, 2011

How To Find a DataStage Job is Multi Instance or not for Reporting

Here are steps which can be used in a script to check if the given datastage job is multi instance or not.
What is needed to retrieve this is Datastage Project Name and Job Name

ScriptName=$(\basename $0)
if [[ $# != 2 ]];then
printf “Error: $ScriptName – Required parameter(s) was not supplied!\n\n”
printf “Usage: $ScriptName  <Project> <JobName>\n\n”
printf “       <Project> : is the unique identifier for the datastage project, required\n”
printf “       <JobName> : is the Datastage JobName, required\n”
exit 1


echo “step ………: checking job whether it is multiinstance or not”

DSHOME=`cat /.dshome`
bin/uvsh <<! > $tmpFile
LOGTO $Project
if [[ $? != 0 ]]; then
echo “CheckJobType – unable to determine jobtype for job $JobNameh in $Project datastage project!”
exit 1
JobType=$(\grep “^JobType:” $tmpFile |\cut -f 2 -d
if [[ -z $JobType ]]; then
echo “CheckJobType – Job $JobName not found in $Project datastage project. JobType cannot be determined!”
\rm -f $tmpFile
exit 1
if [[ $JobType -eq 0 ]]; then
echo “*****$JobName is not a multi instance job*****”
echo “*****$JobName is  multi instance job*****”
\rm -f $tmpFile
exit 0

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Friday, December 9, 2011

A Nice DataStage FAQ

It is bit old but still functionality wise cover major aspects of DataStage. It does not cover Information Server perspective but ETL is still the same and so is this FAQ. A "must" go to before jumping into any deep dives on DataStage apart from various DataStage manuals.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Caching 'Time Saver' in DataStage Transformer

This blog discusses simplification of even the most complicated data integration challenges.  When we can achieve that, and make data processing more efficent, it's the best of both worlds.  The new cache mechanism is a benefit to both of those goals. 
The Transformer Cache is an in-memory storage mechanism that is available from within the Transformer stage and is used to help solve complex data integration scenarios. The cache is a first-in/first-out (i.e. FIFO) construct and is accessible to the developer via two new functions:
  • SaveInputRecord: stores an input row to back of the cache
  • GetInputRecord: retrieves a saved row from the front of the cache 
These functions should be called from the stage variable or loop variable sections of the transformer in most cases. Developers will find the cache most useful when a set of records need to be analyzed as a single unit and then have a result of that data appended to each record in the group. 
Here are few scenarios discussed by Tony in detail where using a cache will prove VERY helpful:
  • The input data set is sorted by fund id and valuation date in ascending order. We have an unknown number of records for each fund.  The requirement is to output the five most recent valuations for any fund and if there are not at least five, do not output any.
  • There is a varying number of clients (N) related to each salesperson.  The requirement is to label each such client detail record with a label that reads "1 of N". 
  • An input file contains multiple bank accounts for each customer. The requirement is to show the percentage of the total balance for each individual account record.
    Perhaps one or more of these sounds familiar to you. You may also refer to the Information Server InfoCenter for more detail on this solution.
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Wednesday, December 7, 2011

Looping in InfoSphere DataStage Transformer

InfoSphere DataStage introduced Looping Transformer and is built into the Transformer Stage that developers have been using for years, so you can begin using it in any existing job. This Looping feature allows very complex data integration challenges to be solved elegantly from a design experience, and efficiently from a resource perspective
Variable Length Records with Embedded Payloads 
Every Customer and Developer faces challenge related to a variable length string that included multiple record types and payloads embedded in the data.  Here's a sample record.
You can see there is a series of record types (the first being "A"), payload lengths ("005") and payloads ("abcde"). Now assume we wanted to convert that data to the following
Rcd Type
Data Record

What makes this a challenging problem is the fact that the length is defined in the data and the number of segments can vary tremendously (record 2 may have 100 payloads in that string).  In DataStage 8.5, the looping transformer handles this very easily by simply introducing a loop condition.  Here's the transfomer logic for solving this:


The logic that has been circled includes a new variable named "@ITERATION" which is a counter indicating what pass through the loop this is.  One other item that will appear new are the Loop variables - basically the same as Stage variables, but these get evaluated each time through the loop. The test for RemainingRecord <> "" allows us to exit the loop when all bytes in the string have been consumed.
It avoids several other transformers and funnels the customer is using in the current implementation. The savings therefore apply not only to the initial design experience and run time performance, but then also the ongoing maintenance of this job as related requirements in the organization change.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Monday, December 5, 2011

Looping in InfoSphere DataStage Sequencer

Looping in any Data Processing Flow is must and I discuss how it can be used with in InfoSphere Datastage with the help of  some example.
I have data available in a File (any format) as DataStage can process can pass file id as parameter reading from another file. File ID will keep on increasing  and we take advantage of datastage looping instead of adding job or change sequencer.Contents in the File can be S.No and File Id and use this File ID as Parameter to the parallel Job.
To achieve this can start with getting total number of lines in the file using DataStage External Routine which contains Command = “wc -l”:” “:FileName:”| awk ‘{print $1}’” and get the number of lines. Routine can be triggered by nall DSExecute(“UNIX”, Command, Output, SystemReturnCode)
This will provide "Upper Limit" for the Loop.

File Id can be retrieved with startLoop.$Counter variable with the help of grep and awk command as mentioned above and will get file ID for each iteration.
This is how DataStage Job will look Like 

Will cover Loop in Transformer in a follow-up.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Sunday, December 4, 2011

DataStage Job Logs in DataStage Designer

With InfoSphere DataStage 8.7 now Director’s Log Detail window has been integrated into Designer. With this feature, users can access the full set of job run-time information directly alongside their job design, thus simplifying the development life cycle.  You will also note that additional job control buttons (job stop/job reset) are provided in the log toolbar. It not only saves additional connection to Server from Director but uses less resources on client machine as required to open additional director.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, December 3, 2011

Bloom Filter Stage now Part of InfoSphere DataStage

With the growth in Data and requirement of processing it optimally Filtering Duplicate data is becoming amajor concern for the Enterprises. InfoSphere DataStage recently released Bloom Filter stage iwhich s based on the algorithm developed by Burton Howard Bloom and provides highly performant and resource efficient duplicate key filtering.  This is particularly useful in Telco data integration patterns involving the enormous volumes of call detail records. It can be used to perform key lookups more efficiently.You might find false positives but Bloom Filter never generates false negatives in your output dataset.Bloom Filter Stage takes a single input dataset, and can generate multiple output sets depending on the operating mode. 

Will cover detailed Job Design and Usage in a follow-up.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions