DWH Fall2010 Lecture Slides Week13
DWH Fall2010 Lecture Slides Week13
Naveed Iqbal
Iqbal, Assistant Professor
NUCES, Islamabad
(Lecture Slides Week # 13)
Data Duplication Elimination
& BSN Method
Data Duplication
3
Data Duplication: Non-Unique PK
• Multiple Customer Numbers
N
Name Ph
Phone Number
N b C t No.
Cust. N
M. Ismail Siddiqi 021.666.1244 780701
M. Ismail Siddiqi 021.666.1244 780203
M. Ismail Siddiqi 021.666.1244 780009
• Multiple Employee Numbers
Bonus Date Name Department
p Emp.
p No.
Jan. 2000 Khan Muhammad 213 (MKT) 5353536
Dec. 2001 Khan Muhammad 567 (SLS) 4577833
Mar. 2002 Khan Muhammad 349 ((HR)) 3457642
4
Data Duplication: House Holding
Why bother ?
5
Data Duplication: Individualization
6
Overview of the Basic Concept
7
Basic Sorted Neighborhood (BSN) Method
8
BSN Method : Sliding Window
.
.
.
Current window
w
of records
Next window
w
of records
.
.
.
9
BSN Method: Selection of Keys
z Selection of Keys
z Effectiveness highly dependent on the key selected to sort the records
e.g. middle name vs. family name
z A key is a sequence of a subset of attributes or sub-strings within the
attributes chosen from the record
z The keys are used for sorting the entire dataset with the intention that
matched candidates will appear close to each other
Technology
T h
Tech.
Techno.
Tchnlgy
11
BSN Method: Problem with keys
If contents of fields are not properly ordered, similar records will NOT
fall in the same window.
Example: Records 1 and 2 are similar but will occur far apart.
No Name Address Gender
1 N Jaffri
N. Jaffri, Syed No 420
No. 420, Street 15
15, Chaklala 4,
4 Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M
3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
Solution is to TOKENize the fields i.e. break them further. Use the tokens
in different fields for sorting to fix the error.
Example: Either using the name or the address field records 1 and 2 will
fall close.
close
No Name Address Gender
1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M
2 Syed Noman 420 4 Rwp Scheme M
3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
12
BSN Method: Matching Candidates
13
Introduction to Data Quality
Management (DQM)
What is Quality?
z Informally
z Some
S thi
things are better
b tt than
th others
th i
i.e. th
they are off higher
hi h
quality. How much “better” is better?
z Is the right item the best item to purchase? How about after the
purchase?
h ?
z What is quality of service? The bank example
z Formally
z “Quality is conformance to requirements” / “Degree of
excellence”
z Example:
p
z Quality means meeting customer’s needs, not necessarily exceeding
them.
z Quality means improving things customers care about, because that
makes their lives easier and more comfortable.
15
What is Data Quality?
What is Data?
Height = 5’8”
Weight = 160 lbs
Emp ID = 440
Emp_ID
Gender = Male
Age = 35 yrs
Muhammad Khan
All data
d t is
i an abstraction
b t ti off something
thi real.
l
Intrinsic Data Quality
El t i reproduction
Electronic d ti off reality.
lit
16
Data Qualityy & Organizations
g
z Th Dysfunctional
The D f ti l Learning
L i Organization:
O i ti
z Low-quality data is a proprietary resource
with
ith cost-adding
t ddi processes.
17
Orr’s Laws of Data Quality
z Philosophy
p y of involving g all concepts
p for
systematic and continuous improvement.
z It is customer oriented.
oriented Why?
z TQM incorporates
p the concept p of p product
quality, process control, quality assurance, and
quality improvement.
19
Cost of Fixing Data Quality
g quality
Cost of achieving
Exponential rise
in cost
zControllable Costs
z Recurring costs for analyzing, correcting, and
preventing data errors
zResultant Costs
z Internal and external failure costs of business /
opportunities missed
zEquipment
E i t & Training
T i i Costs
C t
21
Characteristics or Dimensions of Data Quality
Data Quality
Definition
Characteristic
Accuracy Qualitatively assessing lack of error, high accuracy corresponding
to small error.
Completeness The degree to which values are present in the attributes that require
them.
th
Consistency A measure of the degree to which a set of data satisfies a set of
constraints.
Timeliness A measure of how current or up
up-to-date
to date the data is.
is
Uniqueness The state of being only one of its kind or being without an equal or
parallel.
Interpretability
e p e b y Thee eextent
e toow
which
c d data iss in appropriate
pp op e languages,
gu ges, sy
symbols,
bo s, and
d
units, and the definitions are clear.
Accessibility The extent to which data is available, or easily and quickly
retrievable
Objectivity The extent to which data is unbiased, unprejudiced, and impartial
22
Completeness vs. Accuracy
Which is better?
Depends on data quality ((ii) tolerances,
the (ii) corresponding application and the (iii) cost
of achieving that data quality vs
vs. the (iv) business
value.
23
Data Quality Management Process
Establish TDQM
Environment
24
Data Quality Management Process
• Development Professionals
25
Data Quality
y Management
g Process
26
Data Quality Management Process
27
Data Quality Management Process
28
How to improve Data Quality?
z System
z Data Design
29
Quality Management Maturity Grid
CMM Level-1
Uncertainty
CMM Level-2
Awakening
CMM Level-3
Enlightenment
CMM Level-4
Wisdom
CMM Level-5
Certainity
30
Misconceptions on Data Quality
z D t Quality
Data Q lit is
i an IT Problem
P bl
z It is the company problem.
z Define the metrics of quality.
z Business has to strike a balance between quality
and ROI.
z J i t business
Joint b i andd IT effort.
ff t
31
Misconceptions on Data Quality
z (All) Problem is in the Data Sources or Data Entry
z NOT the
th only
l problem.
bl
z Systems could be responsible, but actually it is the metrics.
z Two divisions using different codes for same entity.
z N d to
Need t track,
t k trace,
t check
h k data
d t from
f creation
ti to t usage.
32