cs614 notes
cs614 notes
Decisions taken at government level using wrong data resulting in undesirable results.
In direct mail marketing sending letters to wrong addresses loss of money and bad reputation
3 Classes of Anomalies
Syntactically Dirty Data
o Lexical Errors
o Irregularities
Semantically Dirty Data
o Integrity Constraint Violation
o Business rule contradiction
o Duplication
Coverage Anomalies
o Missing Attributes
o Missing Records
Lexical errors: For example, assume the data to be stored in table form with each row representing a
tuple and each column an attribute. If we expect the table to have five columns because each tuple has
five attributes but some or all of the rows contain only four columns then the actual structure of the
data does not conform to the specified format.
Why Missing Rows?
Equipment malfunction (bar code reader, keyboard etc.)
Inconsistent with other recorded data and thus deleted.
Data not entered due to misunderstanding/illegibility.
Data not considered important at the time of entry (e.g. Y2K).
OCR (Optical Character Reader)
Handling missing data
Dropping records.
“Manually” filling missing values.
Using a global constant as filler.
Using the attribute mean (or median) as filler.
Using the most probable value as filler.
Key Based Classification of Problems
Primary key problems
o Same PK but different data.
o Same entity with different keys.
o PK in one system but not in other.
o Same PK but in different formats.
Non-Primary key problems
o Different encoding in different sources.
o Multiple ways to represent the same information.
o Sources might contain invalid data.
o Two fields with different data but same name.
o Required fields left blank.
o Data erroneous or incomplete.
o Data contains null values.
Data that is genuinely missing or unknown,
An attribute does not apply to an entity,
Data that is pending, or
Data that is only partially known.
Automatic Data Cleansing
Statistical
Pattern Based
Clustering
Association Rules
Problems due to data duplication
False frequency distributions.
Incorrect aggregates due to double counting.
duplicate records will appear in the merged database. The issue is to identify and eliminate these
duplicates. The problem is known as the merge/purge problem.
cleansing into six steps:
elementizing
standardizing
Verifying
Matching
house holding
Documenting
Basic Sorted Neighborhood (BSN) Method
Steps 1: Create Keys
o Compute a key for each record in the list by extracting relevant fields or portions of fields
o Effectiveness of this method highly depends on a properly chosen key
Step 2: Sort Data
o Sort the records in the data list using the key of step 1
Step 3: Merge
o Move a fixed size window through the sequential list of records limiting the comparisons for
matching records to those records in the window
o If the size of the window is w records, then every new record entering the window is compared
with the previous w-1 records.