Tuesday, January 25, 2011

Finally!! DRS plugin got its replacement as DRS Connector

Until recently DRS plugin was common and extensively used connectivity mechanism with-in InfoSphere DataStage. As it was intended to run sequential mode many performance concerns. As it provides one place shop for all RDBMS was 1st being considered as option. As it provides mechanism for custom SQL and use of SQL from file was famous with customers than other plug-ins.

IBM InfoSphere DataStage team recently released a new Dynamic Relational Stage Connector which can provide better performance and based on new parallel technology. Currently it is available only for IBM DB2, Oracle and ODBC database types. It even provides bulk load for DB2 and Oracle a complete add on compare to older DRS stage.
All older ETL jobs can be easily migrated to this new Connector Stage and get all benefits using Connector Migration Tool. As it is part of Next Generation Suite, will get all the benefits including more features and better support. It can use for all features available as part of other available Connectors including XML and blob. Also shared metadata. It wasn't the case with DRS Plug-in.

-Ritesh
 Disclaimer: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

Integegrated InfoSphere Information Server and InfoSphere Change Data Capture

With IBM InfoSphere Information Server 8.5 we can use this products together with IBM InfoSphere Change Data Capture (InfoSphere CDC) to perform real time, continuous replication with guaranteed delivery and transactional integrity in the event of failure.
In simple terms IBM InfoSphere DataStage job can read data which is captured by IBM InfoSphere Change Data Capture (InfoSphere CDC) and apply the change to a target database. This can be achieved with new CDC Transaction Stage which integrates the replication capabilities that are provided by InfoSphere CDC with the ETL capabilities that are provided by InfoSphere DataStage.
Data flow in a CDC Transaction stage job
Above Picture demonstrate how it works in reality.
Steps Involved
  1. InfoSphere CDC transfers the change data according to the replication definition.
  2. The InfoSphere CDC for InfoSphere DataStage server sends data to the CDC Transaction stage through a TCP/IP session that is created when replication begins
  3. In the InfoSphere DataStage job, the data flows over links from the CDC Transaction stage to the target database connector stage.
  4. The target database connector stage connects to the target database and sends data over the session.
  5. Periodically, the InfoSphere CDC for InfoSphere DataStage server requests bookmark information from a bookmark table on the target database. In response, the CDC Transaction stage fetches the bookmark information through ODBC and returns it to the InfoSphere CDC for InfoSphere DataStage
This stage going to provide real time ETL in seamless manner.

Step by Step use of  New CDC Transaction Stage
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=/com.ibm.swg.im.iis.conn.cdc.doc/topics/scenario_applying_change_data.html

Will cover more on this based on this later stage.
-Ritesh
 Disclaimer: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

Sunday, January 23, 2011

Ray Wurlod on DataStage 8.5


Ray, One of the old DataStage Guru highlight best among various 8.5 Features.
-Ritesh

Friday, January 21, 2011

Tips and Tricks to Debug InfoSphere DataStage Job

Here is a Small List of tricks for debugging DataStage parallel jobs. I will keep updating this post on need basis. So it is a continuous evolution of Steps

APT_DISABLE_COMBINATION
It is to disable or enable combining multiple stages into one process. If facing any issues, disabling this makes life easier to identify the error and source of error.

APT_CONFIG_FILE
Must to run any DataStage Parallel Job. You can have multiple configuration files and can use them based on your requirements. Can use various node combination and use it.

APT_PM_SHOW_PIDS
Provide information about Process ID for each stage

APT_RECORD_COUNTS
Provide number of records in the DataStage Director Log

APT_PM_PLAYER_TIMING
 CPU time used by each stage

OSH_DUMP
Generate/Display OSH code for DataStage Parallel job including unexpected settings if any set by the GUI.

APT_DUMP_SCORE
Details out all processes and inserted operators in your job
   
Row Generator can be used to generate sample data required by DataStage Parallel Job for consumption.

Phantom files provide details about addition error messages and resides $DSHOME/../DSProjects/<Project_Name>/&PH&

In case we want to see intermediate data can use Copy Stage to dump out data in Peek Stage or Sequential File
...Ongoing
-Ritesh

Disclaimer: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.