Thursday, June 30, 2011

How to Measure Performance of each process in DataStage Job

Use  APT_PM_PLAYER_TIMING environment variable to collect information for each operator in a job flow.An example output is:
Output explains that each partition of each operator has consumed about one tenth of a second of CPU time during its run-time portion. In a real world ETL flow, we'd see many operators, and many partitions. It is must to understand how much CPU each operator (and each partition of each component) is using. If one partition of an operator is using significantly more CPU than others, it might mean the data is partitioned in an unbalanced way, and that repartitioning, or choosing different partitioning keys might be a useful strategy.

If one operator is using a much larger portion of the CPU than others, it might be an indication of a problem in flow. A sort is going to use dramatically more CPU time than a copy.  However, get a sense of which operators are the CPU hogs, and when combined with other metrics presented in this document can be very enlightening.

Setting APT_DISABLE_COMBINATION might be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. Be aware, however, that setting this flag will change the performance behavior of your flow, so this should be done with care. Unlike the job monitor cpu percentages, setting APT_PM_PLAYER_TIMING will provide timings on every operator within the flow.

-Ritesh 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Optimal Processing of Data ....Design Tips with InfoSphere DataStage - Series 3

Here are some basic tips for designing good performance into DataStage Job. 

1.      Avoid Type Conversions
Any Type Conversion means can lead into multiple conversions. So why not retrieve data in desired format and keep it same. Need to take care of data type and its required conversions at Design Time. Use OSH_PRINT_SCHEMAS environment variable to verify that runtime schemas match the job design column definitions during your Development Cycle. If using stage variables on a Transformer stage, ensure their data types match the expected result types.  

2.      Wise Usage of Transformer Stages
Consider merging multiple stages if functionality can be incorporated into single stage, use other stage types to perform simple transformation operations based on requirement. 

3.      Optimal use of "Sort"
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of other stage types. 

4.      Keep Columns only if required
If any column is not required remove it as soon as it serves its purpose. Every additional unused column requires additional buffer memory, data transfer and it can impact performance making each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table. 

5.      Avoid Same Partitioning if accessing Sequential File
Unless specified more than one source file, use of same partitioning technique will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless explicitly repartition.

-Ritesh 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Optimal Processing of Data ....Modular Development with InfoSphere DataStage - Series 2

When designing Jobs in InfoSphere DataStage aim should be to use modular development techniques. Modular Development maximizes reuse of parallel jobs and components and save development time.
  • Parametrize the Job while designing and supply values at run time. This allows a single job design to process different data in different circumstances or say share the Job Design and can access its multiple invocations processing different set of data. It saves designer from producing multiple copies of the same job with slightly different arguments.
  • Use shared containers to share common logic across a number of jobs. Shared containers are inserted when a job is compiled. If the shared container is changed, the jobs using it will need recompiling.
-Ritesh 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Optimal Processing of Data ....Lots of Data with InfoSphere DataStage - Series 1

For any ETL tool to help enterprises in processing their data then it is tool's ability to process large volumes of data in a short period of time. This processing depends on all aspects of the flow and the environment being optimized for maximum throughput and performance. It is the Flow Design which need to be considered initially and address all design aspects. Performance tuning and optimization are iterative processes that begin with job design and unit tests, proceed through integration and volume testing, and continue throughout the production life cycle of the application. 
Even though multiple documents and lots of information available for optimal design, here are few basic performance pointers while using IBM InfoSphere DataStage.
  • If intermediate results only shared between parallel jobs use persistent data sets (using Data Set stages). Ensure that the data is partitioned, partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O to avoid impact on performance.
  • Data Set should be used to create check-points or say restart points in the event that a job or sequence needs to be rerun. Data Sets are platform and configuration specific.
  • Based on available system resources overall processing time can be optimized at run time by executing jobs concurrently. Design time attention required on arrival of data and re-processing requirements during flow design.
  • We can have multiple sets of Parallel configuration files. These files set at run-time decides degree of parallelism and resources used by parallel jobs. Different configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources in development, test, and production modes.
  • Accurate and thoughtful configuration of scratch and resource disks and the underlying filesystem and physical hardware architecture significantly affect overall job performance.
  • Within clustered ETL and database environments, resource-pool naming can be used to limit processing to specific nodes, including database nodes when appropriate. 
Will cover other aspects of Job Design in next set.
 -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Wednesday, June 29, 2011

Why do I need InfoSphere FastTrack

Data warehouses can help companies gain insight into customer behavior, predict sales and determine the profitability of a set of customers or products. However, implementing a data warehouse has traditionally been a time consuming process as it requires to understand Spread Sheet information on Data Transformation Requirements and then create an ETL process for the same.  That is where IBM comes in. By helping companies fast-track their data warehousing initiatives IBM can help deliver dramatically better information that enables innovation and responsiveness. IBM Fast Track project can lay the foundation for both dynamic warehousing and master data management. 
InforSphere FastTrack automates multiple data integration tasks from analysis to code generation, while incorporating the business perspective and maintaining lineage and documented requirements. IBM Information Server FastTrack can help enterprises accelerate the deployment of data integration projects by simplifying and improving the communication process between the business analyst-who thinks in business logic-and the developer-who thinks in code. FastTrack accelerates the design time to create source-to-target mappings and to automatically generate jobs. So can say saves lots of Developer's time earlier consumed for Mappings and Designing jobs. These Mapping and Jobs are then integrated into the metadata repository for IBM InfoSphere Information Server.


 -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

How to Transform Data using InfoSphere DataStage

As mentioned in my another Blog Data Transformation is core of any Business Process. IBM InfoSphere DataStage a leading ETL Tool handle data transformation by providing various annotations, rule expressions, filters, Joins, Aggregation, Functions and many other Transformation Components.
-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, June 21, 2011

InfoSphere DataStage in Telco - Etisalat Egypt

InfoSphere DataStage is extensively used by various Telecom Companies world wide for their ETL process and other Information Server modules being used for Data Cleansing and Data Lineage. Etisalat Egypt is another telecommunication company in the list which provides fixed-line and mobile phone service and a host of 3G offerings, including video calling, mobile TV, mobile broadband Internet and data services. The value of a data warehouse is that it provides a single trusted source for information. 
For data quality and integration teams, loading data from dozens or hundreds of source systems throughout the enterprise into the data warehouse can be challenging. To address the challenge company uses IBM InfoSphere DataStage and other Information Server modules to move data from numerous source systems into its data warehouse. For detailed Case Study and Performance Gains on the process have a look at DataStage in Telecom Domain (Etisalat Egypt)


-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

InfoSphere DataStage and Reliance Life


Reliance Life Insurance Company managed to decrease total cost of ownership by 50 percent and increases application uptime from 80 to 95 percent when it replaces an Oracle/Sun platform with an IBM DB2/Power server platform and replaced other applications with IBM InfoSphere Suite of Products. When Oracle Database on Sun systems didn’t meet the company’s needs for performance and reliability, Reliance Life decided to opt for IBM proposed solution. 

Here is detailed Case Study from InfoSphere DataStage perspective. Please Note InfoSphere DataStage and InfoSphere QualityStage were prefixed with WebSphere earlier. Insurance_case_study

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Version Control of DataStage and Quality Stage Jobs

Multiple DataStage and QualityStage developers connect to the same InfoSphere Information Server system, and its metadata repository using Client named Designer. Now when they are working on shared assets it is necessary to store them in Source Control so can store the changes and revert to last checkpoint in case changes impacted the business logic.
With IBM InfoSphere Information Server 8.5 we can manage InfoSphere DataStage and Quality Stage jobs and related assets like metadata and other related designs in enterprise Source Control System. It can be done  using Information Server Manager deployment tool. Tool gives us direct access to CVS (Concurrent Versions System)  and IBM Rational ClearCase. Developers can use deployment tool from their client system and control these Assets.The deployment tool uses an Eclipse workspace, referred to as the source control workspace, as a local transfer area for assets being submitted to the source control system.
Submitting an asset from the metadata repository to your source control system is a two-step process:
  1. Send the asset to the source control workspace.
  2. Submit (Commit) the artifact to the source control system.
Restoring an asset in metadata repository from version in the source control repository is a two-step process:
  1. Tool retrieves asset archive (artifact) from the source control repository.
  2. Tool imports artifact from source control project as asset in metadata repository.
Conceptual relationship between metadata repository, the deployment tool, and the source control : 
Here are details on DataStage Version Control
 
-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions



Tuesday, June 14, 2011

Executing InfoSphere DataStage from outside

It is about executing InfoSphere DataStage jobs and the range of job control options. How we can interact with DataStage from the command line or scripts as discussed in previous 2 posts using dsjob and dsadmin. dsjob even can allow interact with InfoSphere DataStage from other products as well. Even though dsjob and dsadmin are available utilities, getting their information from Server Job Developers Guide is not easily accessible as Parallel Job Developers not frequent readers of this guide.
A common requirement for any Tool is Scheduling of its processes. DataStage comes with a scheduling tool, the Director. Though it provides a all required functionality required by a scheduler like viewing jobs, running jobs and looking at job log results. Under the hood it adds scheduled jobs to the operating system scheduler. The main advantage of it over a third party scheduling tool is the job run options screen that lets you enter job parameter values when you schedule the job.
But Enterprises often want to have one common interface for all their scheduling requirements. It means we need to set job parameters as run the job in some type of scripting language. Jobs are executed by scheduling tools using the dsjob command. Here is an old DSExchange Thread by Ken and Steve explain how this script is used as the main interface between an enterprise scheduler and the single controlling job for a jobstream. This script is responsible for fetching a batch number from a process metadata table and other items. 
Vincent McBurney also provided a detailed perspective on how we can Run DataStage from outside DataStage. Vincent also covers Sequence Jobs and Job Control Code in detail. 

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, 
strategies or opinions 

InfoSphere DataStage Command Line Interfaces - Series 2 "dsadmin"

To continue my previous post on DataStage CLI methods, let me take it forward from Job Level to Project Level. As discussed earlier InfoSphere DataStage CLI connects to the engine on the local system using the user name and password of the user running the command. This is the 1st step happens as part of dsadmin or any other command line interface. Followed by this we can use multiple options which can be triggered from dsadmin.
"dsadmin" can be used to create a project, delete a project, enable or disable the Generated osh display. dsadmin even can be used to enable/disable RCP (run time column propagation). dsadmin also instrumental in displaying all environment details, add any new environment variable or delete an existing variable. dsadmin also can be used to display all projects and Project properties.
Here are few dsadmin options.
dsadmin -listprojects | -listproperties ProjectName | -listenv ProjectName 
dsadmin -createproject ProjectName [-location ProjectLocation] | -deleteproject ProjectName
dsadmin -oshvisible TRUE | FALSE ProjectName
dsadmin -enablercp TRUE | FALSE ProjectName
dsadmin -enablejobadmin TRUE | FALSE ProjectName
dsadmin -enablegeneratexml TRUE | FALSE ProjectName
Detail information about dsadmin and its options can be referred from dsadmin listing
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, 
strategies or opinions