Post confirming the goals and analyzing the source data we can start creating the process which will generate cleansed data and is known as "
Design & Develop Jobs". Once these Jobs are designed need to run in specified sequence with-in Sequencer. Designing the components that are required to build data quality
jobs with
InfoSphere® QualityStage™ involves
one or more of the following steps.
Standardizing data: Standardizing data involves preparing and conditioning
data by using various stages
and reports with-in InfoSphere DataStage & QualityStage which helps to correctly parse and identify each element or token, and place
them in the appropriate column in the output file, the Standardize
stage uses rule sets that are designed to comply with standards or
conventions. The Standardize rule sets can assimilate the data
and append additional information from the input data, such as gender.
Matching data: After the data is standardized here comes matching to identify either duplicates or cross-references to other
files. Data cleansing assignment determines matching strategy whether it is to match individuals,
match companies, perform house-holding, or reconcile inventory transactions.
- Matching identifies all records in input source that correspond to similar records (such as a person,
household, address, & event) in another source (the reference source).
Matching also identifies duplicate records in one source & builds
relationships between records in multiple sources. Relationships are
defined by business rules at the data level.
- Identifying surviving data: After the data matching is complete need to identify which records
(or columns of a set of duplicate records) from the match data survive
and become available for formatting, loading, or reporting. Survivorship facilitates that the best available
data survives and is correctly prepared for the target destination.
Thus, survivorship consolidates duplicate records, creating a best-of-breed
representation of the matched data, enabling organizations to cross-populate
all data sources with the best available data. In this step, when you have duplicate records, you must make these
decisions:
- To keep all the duplicates
- To keep only one record that contains all the information that
is in the duplicates
InfoSphere QualityStage provides
survivorship to perform one or more of the following functions
on your data:
- Resolve conflicts with records that pertain to one entity
- Optionally create a cross-reference table to link all surviving
records to the legacy source
- Supply missing values in one record with values from other records
on the same entity
- Resolve conflicting data values on an entity according to your
business rules
- Enrich existing data with data from external sources
- Customizes the output to meet specific organizational and technical
requirements