Once we are aware of organization goals about Data Quality we need to collect insight into Source Data as Source is what gets reflected into multiple results. Investigate Stage can be used to analyze the quality of
the source data as it helps you determine the business
rules that can be used in designing any data cleansing project.
The Investigate stage indicates the degree of processing needed
to create the target cleansed data. Investigating data identifies
errors and validates the contents of fields in a data file and lets team identify and correct data problems before they infect new
systems.The Investigate stage analyzes data by determining the number and
frequency of unique values, and classifying or assigning a business
meaning to each occurrence of a value within a column. The Investigate
stage has the following capabilities:
- Assesses the content of the source data. This stage
organizes, parses, classifies, and analyzes patterns in the source
data. It operates on both single-domain data columns as well as free-form
text columns such as address columns.
- Links to a single input from any database connector
supported by InfoSphere® DataStage®,
a flat file or data set, or from any processing stage. It is not necessary
to restrict the data to fixed-length columns, but all input data must
be alphanumeric.
- Links to one or two outputs, depending on whether
you are preparing one or two reports. Character investigations produce
a column frequency report and Word investigations produce both pattern
and token reports. The Investigate stage performs a single investigation.
The Investigation reports, that you can generate from the IBM® InfoSphere Information Server Web console
by using data processed in the Investigate job, can help you evaluate
your data and develop better business practices. Please refer below link for more details.