Tuesday, May 22, 2012

Data Cleansing - Consistancy or Standardization Phase- QualityStage-IV

Post confirming the goals and analyzing the source data we can start creating the process which will generate cleansed data and is known as "Design & Develop Jobs". Once these Jobs are designed need to run in specified sequence with-in Sequencer. Designing the components that are required to build data quality jobs with InfoSphere® QualityStage™ involves one or more of the following steps. 
Standardizing data: Standardizing data involves preparing and conditioning data by using various stages and reports with-in InfoSphere DataStage & QualityStage which helps to correctly parse and identify each element or token, and place them in the appropriate column in the output file, the Standardize stage uses rule sets that are designed to comply with standards or conventions. The Standardize rule sets can assimilate the data and append additional information from the input data, such as gender. 
Matching data: After the data is standardized here comes matching to identify either duplicates or cross-references to other files. Data cleansing assignment determines matching strategy whether it is to match individuals, match companies, perform house-holding, or reconcile inventory transactions. 
  • Matching identifies all records in input source that correspond to similar records (such as a person, household, address, & event) in another source (the reference source). Matching also identifies duplicate records in one source & builds relationships between records in multiple sources. Relationships are defined by business rules at the data level. 
Identifying surviving data: After the data matching is complete need to identify which records (or columns of a set of duplicate records) from the match data survive and become available for formatting, loading, or reporting. Survivorship facilitates that the best available data survives and is correctly prepared for the target destination. Thus, survivorship consolidates duplicate records, creating a best-of-breed representation of the matched data, enabling organizations to cross-populate all data sources with the best available data. In this step, when you have duplicate records, you must make these decisions: 
  • To keep all the duplicates
  • To keep only one record that contains all the information that is in the duplicates 
InfoSphere QualityStage provides survivorship to perform one or more of the following functions on your data:
  • Resolve conflicts with records that pertain to one entity
  • Optionally create a cross-reference table to link all surviving records to the legacy source
  • Supply missing values in one record with values from other records on the same entity
  • Resolve conflicting data values on an entity according to your business rules
  • Enrich existing data with data from external sources
  • Customizes the output to meet specific organizational and technical requirements
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

1 comment:

  1. Saturam provides a platform to integrate your entire data infrastructure into one secure stronghold in order to provide you with greater control over your organization's data while increasing the ease of operating all aspects of your business. Analytics pipelines on your data lake will improve the efficiency of your entire organization while improving your control over your enterprise's data and the valuable, confidential data of your customers.