Monday, May 28, 2012

Data Cleansing & Cleansed Data Result Evaluation - QualityStage-IV

Once Data is Standardized and matched final stage in data cleansing workflow is to evaluate the results of the previous phases and to identify any organizational process improvements. The success of a cleansing project comes from iterative reviews and refinements throughout each phase.
In this phase of the workflow need to look at the results of the process and determine whether need to perform any of these activities:
  • Revisit a previous phase
  • Refine some of the conditions
  • Repeat the process, starting from the phase you revisit
If data quality goals are simple results produced post data cleansing iteration might be satisfactory. If it is not so need to repeat this workflow, making different decisions and further refinements with each iteration. Although it is the final step of the data cleansing process, for a well-designed and developed job or a sequence of jobs this requires to evaluate each step followed and decides future actions. As a best practice results should be evaluated at the end of each phase to avoid following entirely wrong direction with inconstant data. This process fine tunes a job and its stage components to achieve the highest quality data.
At the end of the design and development of data cleansing jobs should evaluate the entire process. It means insight about data, data cleansing process, data collection process and evaluation process. Evaluation helps to make changes to next data cleansing project and refine jobs or help organization make changes to its business rules or even to its organizational goals.Evaluating the results of data cleansing process can help organization maintain data management and ensure that corporate data supports the organizational goals.

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, May 22, 2012

Data Cleansing - Consistancy or Standardization Phase- QualityStage-IV


Post confirming the goals and analyzing the source data we can start creating the process which will generate cleansed data and is known as "Design & Develop Jobs". Once these Jobs are designed need to run in specified sequence with-in Sequencer. Designing the components that are required to build data quality jobs with InfoSphere® QualityStage™ involves one or more of the following steps. 
Standardizing data: Standardizing data involves preparing and conditioning data by using various stages and reports with-in InfoSphere DataStage & QualityStage which helps to correctly parse and identify each element or token, and place them in the appropriate column in the output file, the Standardize stage uses rule sets that are designed to comply with standards or conventions. The Standardize rule sets can assimilate the data and append additional information from the input data, such as gender. 
Matching data: After the data is standardized here comes matching to identify either duplicates or cross-references to other files. Data cleansing assignment determines matching strategy whether it is to match individuals, match companies, perform house-holding, or reconcile inventory transactions. 
  • Matching identifies all records in input source that correspond to similar records (such as a person, household, address, & event) in another source (the reference source). Matching also identifies duplicate records in one source & builds relationships between records in multiple sources. Relationships are defined by business rules at the data level. 
Identifying surviving data: After the data matching is complete need to identify which records (or columns of a set of duplicate records) from the match data survive and become available for formatting, loading, or reporting. Survivorship facilitates that the best available data survives and is correctly prepared for the target destination. Thus, survivorship consolidates duplicate records, creating a best-of-breed representation of the matched data, enabling organizations to cross-populate all data sources with the best available data. In this step, when you have duplicate records, you must make these decisions: 
  • To keep all the duplicates
  • To keep only one record that contains all the information that is in the duplicates 
InfoSphere QualityStage provides survivorship to perform one or more of the following functions on your data:
  • Resolve conflicts with records that pertain to one entity
  • Optionally create a cross-reference table to link all surviving records to the legacy source
  • Supply missing values in one record with values from other records on the same entity
  • Resolve conflicting data values on an entity according to your business rules
  • Enrich existing data with data from external sources
  • Customizes the output to meet specific organizational and technical requirements
  -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, May 15, 2012

Data Cleansing & Analyzing source data - QualityStage-III

Once we are aware of organization goals about Data Quality we need to collect insight into Source Data as Source is what gets reflected into multiple results. Investigate Stage can be used to analyze the quality of the source data as it helps you determine the business rules that can be used in designing any data cleansing project.
The Investigate stage indicates the degree of processing needed to create the target cleansed data. Investigating data identifies errors and validates the contents of fields in a data file and lets team identify and correct data problems before they infect new systems.The Investigate stage analyzes data by determining the number and frequency of unique values, and classifying or assigning a business meaning to each occurrence of a value within a column. The Investigate stage has the following capabilities:
  • Assesses the content of the source data. This stage organizes, parses, classifies, and analyzes patterns in the source data. It operates on both single-domain data columns as well as free-form text columns such as address columns.
  • Links to a single input from any database connector supported by InfoSphere® DataStage®, a flat file or data set, or from any processing stage. It is not necessary to restrict the data to fixed-length columns, but all input data must be alphanumeric.
  • Links to one or two outputs, depending on whether you are preparing one or two reports. Character investigations produce a column frequency report and Word investigations produce both pattern and token reports. The Investigate stage performs a single investigation.
The Investigation reports, that you can generate from the IBM® InfoSphere Information Server Web console by using data processed in the Investigate job, can help you evaluate your data and develop better business practices. Please refer below link for more details.

 -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Friday, May 11, 2012

Data Cleansing & Understand goals and requirements - QualityStage-II

Phase one of the data cleansing workflow with-in InfoSphere Quality Stage is to understand your organizational goals and requirements. It helps to
  • Translate high-level mission directives into specific data cleansing assignments
  • Make assumptions about the requirements and structure of the cleansed data
Data Quality needs & objectives varies in each organization. So before start designing data cleansing project need to understand the organizational goals that are driving the data-cleansing need and how they define data cleansing assignment (the effective goal). This insight helps gain a sense of the complexity of the intended cleansed data and provides a context which helps to make decisions throughout the workflow.
The success of a data cleansing project benefits from well-defined requirements for the output data results. As a best practice, provide opportunities throughout the every phase for domain experts and knowledge holders, who understand the organizational requirements of the data, to review the output results, to help iteratively refine requirements, and ultimately to approve the results. This collaborative process helps you meet the organizational requirements, increasing your chances for successful quality results.

 -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, May 8, 2012

How InfoSphere QualityStage clean the data - Quality Stage - 1

 For any task related to Data Mining and cleansing it is must to have knowledge of the overall workflow helps you streamline your data cleansing implementation. From InfoSphere Quality Stage perspective creating cleansed data is a four-phase and iterative approach as shown in the following diagram:
  1. Understand organizational goals and how they determine your requirements
  2. Understand and analyze the nature and content of the source data
  3. Design and develop the jobs that cleanse the data
  4. Evaluate the results
 


Will cover each of these Phases in upcoming blogs.

 -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, May 1, 2012

What is Data Cleansing

Any organization's data contains valuable information that organization needs in order to conduct business, whether it is managing customers and products, managing operations, evaluating corporate performance, or providing business intelligence. Data is high quality when it is up-to-date, complete, accurate, and easy to use. Depending on your organizational goals, high quality data can mean any of the following items:
  • Your customer records do not include duplicate records for the same person.
  • Your inventory records do not include duplicates for the same materials.
  • Your vendor records do not include vendors you no longer use or suppliers no longer in business.
  • You can be confident that Paul Allen and Allen Paul are records for two different customers, not the result of a data entry mistake.
  • Your employees can find the data they need when they need it. Confident that they are working with high quality data, they do not need to create their own individual version of a database.
IBM® InfoSphere® QualityStage™ provides a methodology and development environment for cleansing and improving data quality for any domain. InfoSphere QualityStage helps you deliver and maintain data quality so that your organization can rely upon its corporate data investment.Whether your organization is transitioning from one or more information systems to another, upgrading its organization and its processes, or integrating and leveraging information across the enterprise, your goal is to determine the requirements and structure of the data that will address the organizational goal. Data that is restructured to conform to these new requirements is called cleansed data (sometimes referred to generally as data re-engineering). 

 -Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions