0% found this document useful (0 votes)
83 views32 pages

DWH Fall2010 Lecture Slides Week13

The document discusses data duplication in data warehousing, explaining that data from heterogeneous sources can contain duplicated records which can lead to errors in analysis. It describes how records can be duplicated due to non-unique primary keys or differences in names and addresses. Methods for eliminating data duplication include house holding, individualization, and the basic sorted neighborhood method which sorts and compares records within a sliding window to identify matching records.

Uploaded by

Ammarsyed
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views32 pages

DWH Fall2010 Lecture Slides Week13

The document discusses data duplication in data warehousing, explaining that data from heterogeneous sources can contain duplicated records which can lead to errors in analysis. It describes how records can be duplicated due to non-unique primary keys or differences in names and addresses. Methods for eliminating data duplication include house holding, individualization, and the basic sorted neighborhood method which sorts and compares records within a sliding window to identify matching records.

Uploaded by

Ammarsyed
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Warehousing

Naveed Iqbal
Iqbal, Assistant Professor
NUCES, Islamabad
(Lecture Slides Week # 13)
Data Duplication Elimination
& BSN Method
Data Duplication

z Why data duplicated?


z A data warehouse is created from heterogeneous sources, with
heterogeneous databases (different schema / representation)
of the same entity.
z The data coming from outside the organization owning the
DWH, can have even lower quality data i.e. different
representation for same entity, transcription or typographical
errors.
z Problems due to data duplication
z Data duplication,
p , can result in costly
y errors,, such as:
z False frequency distributions
z Incorrect aggregates due to double counting
z Difficultyy with catching
g fabricated identities byy credit card
companies.

3
Data Duplication: Non-Unique PK
• Multiple Customer Numbers
N
Name Ph
Phone Number
N b C t No.
Cust. N
M. Ismail Siddiqi 021.666.1244 780701
M. Ismail Siddiqi 021.666.1244 780203
M. Ismail Siddiqi 021.666.1244 780009
• Multiple Employee Numbers
Bonus Date Name Department
p Emp.
p No.
Jan. 2000 Khan Muhammad 213 (MKT) 5353536
Dec. 2001 Khan Muhammad 567 (SLS) 4577833
Mar. 2002 Khan Muhammad 349 ((HR)) 3457642

Unable to determine customer relationships (CRM)


Unable to analyze
y employee
p y benefits trends

4
Data Duplication: House Holding

z Group together all records that belong to the same


h
household.
h ld
……… S. Ahad 440, Munir Road, Lahore

……… ………….… ………………………………

……… Shiekh Ahad No. 440, Munir Rd, Lhr

……… ………….… ………………………………

……… Shiekh Ahed House # 440, Munir Road, Lahore

Why bother ?

5
Data Duplication: Individualization

z Identify multiple records in each household which


representt the
th same individual
i di id l

……… M. Ahad 440, Munir Road, Lahore

……… ………….… ………………………………

……… Maj Ahad 440 Munir Road


440, Road, Lahore

Address field is standardized.


By coincidence ??

6
Overview of the Basic Concept

z In its simplest form, there is an identifying attribute (or


combination)
bi ti ) per record d for
f identification.
id tifi ti

z Records can be from single source or multiple sources


sharing same PK or other common unique attributes.

z Sorting performed on identifying attributes and


neighboring records checked.

z What if no common attributes or dirty data?


z The degree of similarity measured numerically, different
attributes may contribute differently.

7
Basic Sorted Neighborhood (BSN) Method

z Concatenate data into one sequential list of N records


z Steps 1: Create Keys
z Compute a key for each record in the list by extracting
relevant fields or portions of fields
z Effectiveness of this method highly depends on a properly
chosen key
z Step 2: Sort Data
z Sort the records in the data list using the key of step 1
z Step 3: Merge
z Move a fixed size window through the sequential list of
records limitingg the comparisons
p for matching
g records to
those records in the window
z If the size of the window is w records then every new
record entering the window is compared with the previous
w-11 records.
d

8
BSN Method : Sliding Window
.
.
.

Current window
w
of records
Next window
w
of records

.
.
.
9
BSN Method: Selection of Keys

z Selection of Keys
z Effectiveness highly dependent on the key selected to sort the records
e.g. middle name vs. family name
z A key is a sequence of a subset of attributes or sub-strings within the
attributes chosen from the record
z The keys are used for sorting the entire dataset with the intention that
matched candidates will appear close to each other

First Middle Address NID Key


Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
10
BSN Method: Problem with keys

Since data is dirty, so keys WILL also be dirty, and


matching records will not come together.

Data becomes dirty due to data entry errors or use of


abbreviations. Some real examples are as follows:

Technology
T h
Tech.
Techno.
Tchnlgy

Solution is to use external standard source files to validate the


data and resolve any data conflicts.

11
BSN Method: Problem with keys

If contents of fields are not properly ordered, similar records will NOT
fall in the same window.
Example: Records 1 and 2 are similar but will occur far apart.
No Name Address Gender
1 N Jaffri
N. Jaffri, Syed No 420
No. 420, Street 15
15, Chaklala 4,
4 Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M
3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
Solution is to TOKENize the fields i.e. break them further. Use the tokens
in different fields for sorting to fix the error.
Example: Either using the name or the address field records 1 and 2 will
fall close.
close
No Name Address Gender
1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M
2 Syed Noman 420 4 Rwp Scheme M
3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
12
BSN Method: Matching Candidates

Merging of records is a complex inferential process.


Example--1: Two persons with names spelled nearly but not
Example
identically, have the exact same address. We infer they are same
person i.e. Noma Abdullah and Noman Abdullah.
Example--2: Two persons have same National ID numbers but
Example
names and addresses are completely different. We infer same
person who changed his name and moved or the records
represent different persons and NID is incorrect for one of them.
them
Use of further information such as age, gender etc
etc.. can alter the
decision..
decision
Example
Example-
p -3: Noma
Noma-F and Noman
Noman-M we could p perhaps p infer that
Noma and Noman are siblings i.e. brothers and sisters. Noma
Noma-30
and Noman-5
Noman i.e. mother and son.

13
Introduction to Data Quality
Management (DQM)
What is Quality?

z Informally
z Some
S thi
things are better
b tt than
th others
th i
i.e. th
they are off higher
hi h
quality. How much “better” is better?
z Is the right item the best item to purchase? How about after the
purchase?
h ?
z What is quality of service? The bank example

z Formally
z “Quality is conformance to requirements” / “Degree of
excellence”
z Example:
p
z Quality means meeting customer’s needs, not necessarily exceeding
them.
z Quality means improving things customers care about, because that
makes their lives easier and more comfortable.

15
What is Data Quality?

What is Data?

Height = 5’8”
Weight = 160 lbs
Emp ID = 440
Emp_ID
Gender = Male
Age = 35 yrs
Muhammad Khan

All data
d t is
i an abstraction
b t ti off something
thi real.
l
Intrinsic Data Quality
El t i reproduction
Electronic d ti off reality.
lit

Realistic Data Quality


D
Degree off utility
tilit or value
l off d
data
t tto business.
b i

16
Data Qualityy & Organizations
g

z Intelligent Learning Organization:


z High-quality data is an open, shared
resource with value
value-adding
adding processes.
processes

z Th Dysfunctional
The D f ti l Learning
L i Organization:
O i ti
z Low-quality data is a proprietary resource
with
ith cost-adding
t ddi processes.

17
Orr’s Laws of Data Quality

Law #1 - “Data that is not used cannot be correct!”

Law #2 - “Data quality is a function of its use, not its


collection!”

Law #3 - “Data will be no better than its most stringent


use!
use!”

Law #4 - “Data quality problems increase with the age of


th system!”
the t !”

Law #5 – “The less likely something is to occur, the more


traumatic it will be when it happens!”
18
Total Quality Control / Management (TQM)

z Philosophy
p y of involving g all concepts
p for
systematic and continuous improvement.

z It is customer oriented.
oriented Why?

z TQM incorporates
p the concept p of p product
quality, process control, quality assurance, and
quality improvement.

z Quality assurance is NOT Quality improvement.

19
Cost of Fixing Data Quality

g quality
Cost of achieving

Exponential rise
in cost

Lowest Quality Highest quality

„ Defect minimization is economical.


„ Defect
D f t elimination
li i ti isi very very expensive.
i
20
Cost of Data Quality Defects

zControllable Costs
z Recurring costs for analyzing, correcting, and
preventing data errors

zResultant Costs
z Internal and external failure costs of business /
opportunities missed

zEquipment
E i t & Training
T i i Costs
C t

21
Characteristics or Dimensions of Data Quality

Data Quality
Definition
Characteristic
Accuracy Qualitatively assessing lack of error, high accuracy corresponding
to small error.
Completeness The degree to which values are present in the attributes that require
them.
th
Consistency A measure of the degree to which a set of data satisfies a set of
constraints.
Timeliness A measure of how current or up
up-to-date
to date the data is.
is
Uniqueness The state of being only one of its kind or being without an equal or
parallel.
Interpretability
e p e b y Thee eextent
e toow
which
c d data iss in appropriate
pp op e languages,
gu ges, sy
symbols,
bo s, and
d
units, and the definitions are clear.
Accessibility The extent to which data is available, or easily and quickly
retrievable
Objectivity The extent to which data is unbiased, unprejudiced, and impartial

22
Completeness vs. Accuracy

95% accurate and 100% complete


OR
100% accurate and 95% complete

Which is better?
Depends on data quality ((ii) tolerances,
the (ii) corresponding application and the (iii) cost
of achieving that data quality vs
vs. the (iv) business
value.

23
Data Quality Management Process

Establish TDQM
Environment

Evaluate Data Quality Scope Data Quality Projects &


Management Methods Develop Implementation Plans

Implement Data Quality Projects


(Define, Measure, Analyze, Improve)

24
Data Quality Management Process

z Establish Data Quality


y Management
g
Environment
• Information System Project Managers

• Development Professionals

• Functional users of legacy


g y information
systems with domain knowledge
• IS developers know solutions but don’t
know how and where to modify

25
Data Quality
y Management
g Process

z Scope Data Quality Projects & Develop


Implementation Plans
• Task Summary: Project goals, scope, and potential
benefits
• Task Description: Describe data quality analysis tasks
• Project Approach: Summarize tasks and tools used to
provide a baseline of existing data quality
• Schedule: Identify task start, completion dates, and project
milestones
• R
Resources: I l d costs
Include t connected
t d with
ith tools
t l acquisition,
i iti
labor hours (by labor category), training, travel, and other
direct and indirect costs

26
Data Quality Management Process

z Implement Data Quality Projects (Define,


Measure, Analyze, Improve)
• Plan / Define: Identify functional user DQ requirements
and establish DQ metrics
• Do / Measure: Conformance to current business rules and
develop exception reports
• Check / Analyze: Verify,
Verify validate,
validate and assess poor DQ
causes. Define improvement opportunities
• Act / Improve: Select/prioritize DQ improvement
opportunities i.e.
i e data entry procedures,
procedures updating data
validation rules, and/or company data standards.

27
Data Quality Management Process

z Evaluate Data Quality


y Management
g
Methods
• Modifying existing methods of DQ management

• Determining if DQ projects have helped to


achieve demonstrable goals and benefits?

• Evaluating and assessing DQ work as, it is not a


program, but a new way of doing business

28
How to improve Data Quality?

z The four categories of Data Quality


Improvement
z Process

z System

z Policy & Procedure

z Data Design

29
Quality Management Maturity Grid

CMM Level-1
Uncertainty

CMM Level-2
Awakening

CMM Level-3
Enlightenment

CMM Level-4
Wisdom

CMM Level-5
Certainity

30
Misconceptions on Data Quality

z You Can Fix Data


z Problem NOT in data, but how it was used.
z It is NOT a one time process.
z Buying a cleansing tool is NOT the solution.
solution
z Some live with the problem, cant afford the tool.

z D t Quality
Data Q lit is
i an IT Problem
P bl
z It is the company problem.
z Define the metrics of quality.
z Business has to strike a balance between quality
and ROI.
z J i t business
Joint b i andd IT effort.
ff t
31
Misconceptions on Data Quality
z (All) Problem is in the Data Sources or Data Entry
z NOT the
th only
l problem.
bl
z Systems could be responsible, but actually it is the metrics.
z Two divisions using different codes for same entity.
z N d to
Need t track,
t k trace,
t check
h k data
d t from
f creation
ti to t usage.

z The Data Warehouse will provide a single source of


truth
z In ideal world it is indeed true.
z In real world may be multiple data warehouses, data marts,
external sources i.e. silos of data resulting in multiple sources
of “truth”.
z Even with single source of truth, if transformations and
interpretations are different, an issue.

32

You might also like