InfoSphere Information Server How To: Data Cleansing & Analyzing source data

Once we are aware of organization goals about Data Quality we need to collect insight into Source Data as Source is what gets reflected into multiple results. Investigate Stage can be used to analyze the quality of the source data as it helps you determine the business rules that can be used in designing any data cleansing project.

The Investigate stage indicates the degree of processing needed to create the target cleansed data. Investigating data identifies errors and validates the contents of fields in a data file and lets team identify and correct data problems before they infect new systems.The Investigate stage analyzes data by determining the number and frequency of unique values, and classifying or assigning a business meaning to each occurrence of a value within a column. The Investigate stage has the following capabilities:

Assesses the content of the source data. This stage organizes, parses, classifies, and analyzes patterns in the source data. It operates on both single-domain data columns as well as free-form text columns such as address columns.
Links to a single input from any database connector supported by InfoSphere® DataStage®, a flat file or data set, or from any processing stage. It is not necessary to restrict the data to fixed-length columns, but all input data must be alphanumeric.
Links to one or two outputs, depending on whether you are preparing one or two reports. Character investigations produce a column frequency report and Word investigations produce both pattern and token reports. The Investigate stage performs a single investigation.

The Investigation reports, that you can generate from the IBM® InfoSphere Information Server Web console by using data processed in the Investigate job, can help you evaluate your data and develop better business practices. Please refer below link for more details.

Deep Insight into Analyzing_Source_Data & Phase 2 of Data Cleansing

-Ritesh

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Data Cleansing & Analyzing source data - QualityStage-III