0% found this document useful (0 votes)
2 views2 pages

cs614 notes

The document discusses the serious issues caused by dirty data, including incorrect government decisions and financial losses in marketing. It categorizes data anomalies into syntactic, semantic, and coverage issues, and outlines various causes for missing data and methods for handling it. Additionally, it describes automatic data cleansing techniques and the Basic Sorted Neighborhood method for identifying and merging duplicate records.

Uploaded by

anizafar4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views2 pages

cs614 notes

The document discusses the serious issues caused by dirty data, including incorrect government decisions and financial losses in marketing. It categorizes data anomalies into syntactic, semantic, and coverage issues, and outlines various causes for missing data and methods for handling it. Additionally, it describes automatic data cleansing techniques and the Basic Sorted Neighborhood method for identifying and merging duplicate records.

Uploaded by

anizafar4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

 Serious Problems due to dirty data

 Decisions taken at government level using wrong data resulting in undesirable results.
 In direct mail marketing sending letters to wrong addresses loss of money and bad reputation
 3 Classes of Anomalies
 Syntactically Dirty Data
o Lexical Errors
o Irregularities
 Semantically Dirty Data
o Integrity Constraint Violation
o Business rule contradiction
o Duplication
 Coverage Anomalies
o Missing Attributes
o Missing Records
 Lexical errors: For example, assume the data to be stored in table form with each row representing a
tuple and each column an attribute. If we expect the table to have five columns because each tuple has
five attributes but some or all of the rows contain only four columns then the actual structure of the
data does not conform to the specified format.
 Why Missing Rows?
 Equipment malfunction (bar code reader, keyboard etc.)
 Inconsistent with other recorded data and thus deleted.
 Data not entered due to misunderstanding/illegibility.
 Data not considered important at the time of entry (e.g. Y2K).
 OCR (Optical Character Reader)
 Handling missing data
 Dropping records.
 “Manually” filling missing values.
 Using a global constant as filler.
 Using the attribute mean (or median) as filler.
 Using the most probable value as filler.
 Key Based Classification of Problems
 Primary key problems
o Same PK but different data.
o Same entity with different keys.
o PK in one system but not in other.
o Same PK but in different formats.
 Non-Primary key problems
o Different encoding in different sources.
o Multiple ways to represent the same information.
o Sources might contain invalid data.
o Two fields with different data but same name.
o Required fields left blank.
o Data erroneous or incomplete.
o Data contains null values.
 Data that is genuinely missing or unknown,
 An attribute does not apply to an entity,
 Data that is pending, or
 Data that is only partially known.
 Automatic Data Cleansing
 Statistical
 Pattern Based
 Clustering
 Association Rules
 Problems due to data duplication
 False frequency distributions.
 Incorrect aggregates due to double counting.
 duplicate records will appear in the merged database. The issue is to identify and eliminate these
duplicates. The problem is known as the merge/purge problem.
 cleansing into six steps:
 elementizing
 standardizing
 Verifying
 Matching
 house holding
 Documenting
 Basic Sorted Neighborhood (BSN) Method
 Steps 1: Create Keys
o Compute a key for each record in the list by extracting relevant fields or portions of fields
o Effectiveness of this method highly depends on a properly chosen key
 Step 2: Sort Data
o Sort the records in the data list using the key of step 1
 Step 3: Merge
o Move a fixed size window through the sequential list of records limiting the comparisons for
matching records to those records in the window
o If the size of the window is w records, then every new record entering the window is compared
with the previous w-1 records.

You might also like