Sunday, April 29, 2012

Reviewing health of Information Server Instance

 IBM Support Assistant Lite for InfoSphere Information Server commonly known as ISA Lite for InfoSphere Information Server is a lightweight serviceability tool that provides both a graphical and a text-based interface that can be used to facilitate the collection of various product and system log files, product and system configuration files, as well as system environment details.
This tool can also be used to check on the health of Information Server modules, verify that a given system meets product requirements prior to the installation for a selected product and also includes a tool to resolve issues with corrupted DataStage projects. ISA Lite benefits customers by:
ISA Lite further can reduce time to resolution for any problem as can collect all the relevant information and save multiple level of communication. Will cover it in detail in my upcoming blogs how it can be used for each of the component.

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Friday, April 27, 2012

Use of NFS for Parallel Engine scratchdisk storage


While Planning and Designing next generation deployment architecture for Information Server Deployment we should consider for Storage area as it is the key. Has seen NFS planned for scratchdisk storage and it is not a good idea. For any new architecture performance is the key and from this perspective use of NFS as scratchdisk storage or for temporary storage (location specified in TMPDIR) is not recommend.

Dedicated High Speed SAN is recommended or use local disk storage. Shared SAN works well but it is possible can impact other applications when the Parallel Engine is used for disk intensive operations like sorting, as one example which can require significant temporary disk resources.

Will cover more aspects of Storage in Grid Deployment Discussion in coming weeks.

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, April 21, 2012

InfoSphere DataStage File Set, DataSet and Sequential File Are they Same?

InfoSphere DataStage has 3 different Stages which can be used to create files on the System. File Set Stage, DataSet and Sequential File Stage.

Fileset:  InfoSphere® DataStage® can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on:
  • The number of processing nodes in the default node pool
  • The number of disks in the export or default disk pool connected to each processing node in the default node pool
  • The size of the partitions of the data set
  • fileset preserves partition scheme & can view data in the order defined in partitioning scheme.
  • file sets carry formatting information that describe the format of the files to be read or written.
Sequential File Stage: It is a file stage which allows you to read data from or write data one or more flat files. The stage can have a single input link or a single output link, and a single rejects link.
The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file). For fixed-width files you can configure the stage to behave differently:
  • You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems. 
  • You can specify number of readers run on a single node means a single file can be partitioned as it is read (even though stage is constrained to running sequentially on the conductor node). 
The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each node writes to a single file, but a node can write more than one file.When reading or writing a flat file, InfoSphere® DataStage® needs to know something about the format of the file. The information required is how the file is divided into rows and how rows are divided into columns.

Dataset Stage: It is also a file stage which allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other InfoSphere® DataStage® jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs.
  • It preserves partition & stores data on the nodes and avoids to repartition the data
  • Stores data in internal format of DataStage & takes less time to read/write from ds to any other
-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Friday, April 13, 2012

Processing InfoSphere DataStage DataSet from Command Line (Orchadmin Utility)

I recently discussed Dataset which is the core of any ETL Design with-in InfoSphere DataStage. Now this Dataset need to be maintained from command line is key for Automation purpose. We can't rely on GUI for automation and validation purpose.

We can use "orchadmin"  utility shipped with InfoSphere DataStage for delete, copy, describe and dump dataset aliad ORCHESTRATE file. As it is command line utility can be used for automation purpose and read input from a file or standard input.

As usual environment need to be set like below before we use orchadmin utility.

export DSHOME=$(cat /.dshome)
. $DSHOME/dsenv

export LD_LIBRARY_PATH=$APT_ORCHHOME/lib
export APT_CONFIG_FILE=$DSHOME/../Configurations/default.apt
export PATH=$DSHOME/bin:$APT_ORCHHOME/bin:/$PATH

Lets see how to Delete all DataSet in a specified directory.
orchadmin rm *.ds

Direct rm will not delete all the related contents of Dataset which is spread
across multiple nodes.

Removing Data from specific DataSet: orchadmin truncate -n 10 input.ds
Remove all data from input.ds:    orchadmin truncate input.ds
Dump all records of all partitions:  orchadmin dump -name input.ds
Dump value of the name field of the first n-records of partition 0 of input.ds
orchadmin dump -part 0 -n 17 -field name input.ds
List the partitioning info, data files and schema:  orchadmin ll file1 file2
Describe disk pool pl1 in node pool ritnodes: orchadmin diskinfo -np ritnodes pl1
Check command for orchadmin checks the configuration file for any problems

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Friday, April 6, 2012

What is InfoSphere DataStage DataSet

InfoSphere DataStage has its proprietary file format which creates files on the System meant for processing by other dependent Jobs or ETL processing via DataStage. These files stored in native format are known as Data Sets. These Dataset processed in parallel and also called as Orchestrate File.
Dataset is created via a file Stage also named as Dataset and allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs for further processing.
The fundamental concept of the DataStage Parallel framework (Orchestrate framework) is the Data Set serves as inputs and outputs of Orchestrate operators.
To put it in simple words Data Set is like a database table consist of collection of identically-defined rows. It is one of the mechanism by which Data stored during transit for parallel processing later and shared across. Each operator accepts input from one Data Set and sends its output to another Data Set.
Data Set exists on all the processing nodes defined for the job in APT CONFIG FILE that is currently processing it. Subset of rows in a Data Set located on a single processing node is referred to as a "partition" of the Data Set. Technically, a partition is a subset of the rows in a Data Set earmarked for processing on the same processing node. A control file is associated with each data set contains the record schema that defines the row structure.

As mentioned above DataSet preserves partition by storing data on the nodes to avoid repartition the data. Also as data is stored in internal binary format it takes less time to read/write from with-in DataStage Job.
Dataset consist of multiple files mentioned below
  1. Descriptor File: we can see the Schema details and address of data.
  2. Data File: we can see the data in Native format.
  3. Control & Header files: resides in Operating System consist of Schema and other relevant information
We can use DataSet Management Utility from GUI and for command line or automation purpose use orchadmin (discussed in next blog)

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions