0% found this document useful (0 votes)

83 views32 pages

DWH Fall2010 Lecture Slides Week13

The document discusses data duplication in data warehousing, explaining that data from heterogeneous sources can contain duplicated records which can lead to errors in analysis. It describes how records can be duplicated due to non-unique primary keys or differences in names and addresses. Methods for eliminating data duplication include house holding, individualization, and the basic sorted neighborhood method which sorts and compares records within a sliding window to identify matching records.

Uploaded by

Ammarsyed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views32 pages

DWH Fall2010 Lecture Slides Week13

Uploaded by

Ammarsyed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Warehousing

Naveed Iqbal
Iqbal, Assistant Professor
NUCES, Islamabad
(Lecture Slides Week # 13)
Data Duplication Elimination
& BSN Method
Data Duplication

z Why data duplicated?

z A data warehouse is created from heterogeneous sources, with
heterogeneous databases (different schema / representation)
of the same entity.
z The data coming from outside the organization owning the
DWH, can have even lower quality data i.e. different
representation for same entity, transcription or typographical
errors.
z Problems due to data duplication
z Data duplication,
p , can result in costly
y errors,, such as:
z False frequency distributions
z Incorrect aggregates due to double counting
z Difficultyy with catching
g fabricated identities byy credit card
companies.

3
Data Duplication: Non-Unique PK
• Multiple Customer Numbers
N
Name Ph
Phone Number
N b C t No.
Cust. N
M. Ismail Siddiqi 021.666.1244 780701
M. Ismail Siddiqi 021.666.1244 780203
M. Ismail Siddiqi 021.666.1244 780009
• Multiple Employee Numbers
Bonus Date Name Department
p Emp.
p No.
Jan. 2000 Khan Muhammad 213 (MKT) 5353536
Dec. 2001 Khan Muhammad 567 (SLS) 4577833
Mar. 2002 Khan Muhammad 349 ((HR)) 3457642

Unable to determine customer relationships (CRM)

Unable to analyze
y employee
p y benefits trends

4
Data Duplication: House Holding

z Group together all records that belong to the same

h
household.
h ld
……… S. Ahad 440, Munir Road, Lahore

……… ………….… ………………………………

……… Shiekh Ahad No. 440, Munir Rd, Lhr

……… ………….… ………………………………

……… Shiekh Ahed House # 440, Munir Road, Lahore

Why bother ?

5
Data Duplication: Individualization

z Identify multiple records in each household which

representt the
th same individual
i di id l

……… M. Ahad 440, Munir Road, Lahore

……… ………….… ………………………………

……… Maj Ahad 440 Munir Road

440, Road, Lahore

Address field is standardized.

By coincidence ??

6
Overview of the Basic Concept

z In its simplest form, there is an identifying attribute (or

combination)
bi ti ) per record d for
f identification.
id tifi ti

z Records can be from single source or multiple sources

sharing same PK or other common unique attributes.

z Sorting performed on identifying attributes and

neighboring records checked.

z What if no common attributes or dirty data?

z The degree of similarity measured numerically, different
attributes may contribute differently.

7
Basic Sorted Neighborhood (BSN) Method

z Concatenate data into one sequential list of N records

z Steps 1: Create Keys
z Compute a key for each record in the list by extracting
relevant fields or portions of fields
z Effectiveness of this method highly depends on a properly
chosen key
z Step 2: Sort Data
z Sort the records in the data list using the key of step 1
z Step 3: Merge
z Move a fixed size window through the sequential list of
records limitingg the comparisons
p for matching
g records to
those records in the window
z If the size of the window is w records then every new
record entering the window is compared with the previous
w-11 records.
d

8
BSN Method : Sliding Window
.
.
.

Current window
w
of records
Next window
w
of records

.
.
.
9
BSN Method: Selection of Keys

z Selection of Keys
z Effectiveness highly dependent on the key selected to sort the records
e.g. middle name vs. family name
z A key is a sequence of a subset of attributes or sub-strings within the
attributes chosen from the record
z The keys are used for sorting the entire dataset with the intention that
matched candidates will appear close to each other

First Middle Address NID Key

Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
10
BSN Method: Problem with keys

Since data is dirty, so keys WILL also be dirty, and

matching records will not come together.

Data becomes dirty due to data entry errors or use of

abbreviations. Some real examples are as follows:

Technology
T h
Tech.
Techno.
Tchnlgy

Solution is to use external standard source files to validate the

data and resolve any data conflicts.

11
BSN Method: Problem with keys

If contents of fields are not properly ordered, similar records will NOT
fall in the same window.
Example: Records 1 and 2 are similar but will occur far apart.
No Name Address Gender
1 N Jaffri
N. Jaffri, Syed No 420
No. 420, Street 15
15, Chaklala 4,
4 Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M
3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
Solution is to TOKENize the fields i.e. break them further. Use the tokens
in different fields for sorting to fix the error.
Example: Either using the name or the address field records 1 and 2 will
fall close.
close
No Name Address Gender
1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M
2 Syed Noman 420 4 Rwp Scheme M
3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
12
BSN Method: Matching Candidates

Merging of records is a complex inferential process.

Example--1: Two persons with names spelled nearly but not
Example
identically, have the exact same address. We infer they are same
person i.e. Noma Abdullah and Noman Abdullah.
Example--2: Two persons have same National ID numbers but
Example
names and addresses are completely different. We infer same
person who changed his name and moved or the records
represent different persons and NID is incorrect for one of them.
them
Use of further information such as age, gender etc
etc.. can alter the
decision..
decision
Example
Example-
p -3: Noma
Noma-F and Noman
Noman-M we could p perhaps p infer that
Noma and Noman are siblings i.e. brothers and sisters. Noma
Noma-30
and Noman-5
Noman i.e. mother and son.

13
Introduction to Data Quality
Management (DQM)
What is Quality?

z Informally
z Some
S thi
things are better
b tt than
th others
th i
i.e. th
they are off higher
hi h
quality. How much “better” is better?
z Is the right item the best item to purchase? How about after the
purchase?
h ?
z What is quality of service? The bank example

z Formally
z “Quality is conformance to requirements” / “Degree of
excellence”
z Example:
p
z Quality means meeting customer’s needs, not necessarily exceeding
them.
z Quality means improving things customers care about, because that
makes their lives easier and more comfortable.

15
What is Data Quality?

What is Data?

Height = 5’8”
Weight = 160 lbs
Emp ID = 440
Emp_ID
Gender = Male
Age = 35 yrs
Muhammad Khan

All data
d t is
i an abstraction
b t ti off something
thi real.
l
Intrinsic Data Quality
El t i reproduction
Electronic d ti off reality.
lit

Realistic Data Quality

D
Degree off utility
tilit or value
l off d
data
t tto business.
b i

16
Data Qualityy & Organizations
g

z Intelligent Learning Organization:

z High-quality data is an open, shared
resource with value
value-adding
adding processes.
processes

z Th Dysfunctional
The D f ti l Learning
L i Organization:
O i ti
z Low-quality data is a proprietary resource
with
ith cost-adding
t ddi processes.

17
Orr’s Laws of Data Quality

Law #1 - “Data that is not used cannot be correct!”

Law #2 - “Data quality is a function of its use, not its

collection!”

Law #3 - “Data will be no better than its most stringent

use!
use!”

Law #4 - “Data quality problems increase with the age of

th system!”
the t !”

Law #5 – “The less likely something is to occur, the more

traumatic it will be when it happens!”
18
Total Quality Control / Management (TQM)

z Philosophy
p y of involving g all concepts
p for
systematic and continuous improvement.

z It is customer oriented.
oriented Why?

z TQM incorporates
p the concept p of p product
quality, process control, quality assurance, and
quality improvement.

z Quality assurance is NOT Quality improvement.

19
Cost of Fixing Data Quality

g quality
Cost of achieving

Exponential rise
in cost

Lowest Quality Highest quality

Defect minimization is economical.

Defect
D f t elimination
li i ti isi very very expensive.
i
20
Cost of Data Quality Defects

zControllable Costs
z Recurring costs for analyzing, correcting, and
preventing data errors

zResultant Costs
z Internal and external failure costs of business /
opportunities missed

zEquipment
E i t & Training
T i i Costs
C t

21
Characteristics or Dimensions of Data Quality

Data Quality
Definition
Characteristic
Accuracy Qualitatively assessing lack of error, high accuracy corresponding
to small error.
Completeness The degree to which values are present in the attributes that require
them.
th
Consistency A measure of the degree to which a set of data satisfies a set of
constraints.
Timeliness A measure of how current or up
up-to-date
to date the data is.
is
Uniqueness The state of being only one of its kind or being without an equal or
parallel.
Interpretability
e p e b y Thee eextent
e toow
which
c d data iss in appropriate
pp op e languages,
gu ges, sy
symbols,
bo s, and
d
units, and the definitions are clear.
Accessibility The extent to which data is available, or easily and quickly
retrievable
Objectivity The extent to which data is unbiased, unprejudiced, and impartial

22
Completeness vs. Accuracy

95% accurate and 100% complete

OR
100% accurate and 95% complete

Which is better?
Depends on data quality ((ii) tolerances,
the (ii) corresponding application and the (iii) cost
of achieving that data quality vs
vs. the (iv) business
value.

23
Data Quality Management Process

Establish TDQM
Environment

Evaluate Data Quality Scope Data Quality Projects &

Management Methods Develop Implementation Plans

Implement Data Quality Projects

(Define, Measure, Analyze, Improve)

24
Data Quality Management Process

z Establish Data Quality

y Management
g
Environment
• Information System Project Managers

• Development Professionals

• Functional users of legacy

g y information
systems with domain knowledge
• IS developers know solutions but don’t
know how and where to modify

25
Data Quality
y Management
g Process

z Scope Data Quality Projects & Develop

Implementation Plans
• Task Summary: Project goals, scope, and potential
benefits
• Task Description: Describe data quality analysis tasks
• Project Approach: Summarize tasks and tools used to
provide a baseline of existing data quality
• Schedule: Identify task start, completion dates, and project
milestones
• R
Resources: I l d costs
Include t connected
t d with
ith tools
t l acquisition,
i iti
labor hours (by labor category), training, travel, and other
direct and indirect costs

26
Data Quality Management Process

z Implement Data Quality Projects (Define,

Measure, Analyze, Improve)
• Plan / Define: Identify functional user DQ requirements
and establish DQ metrics
• Do / Measure: Conformance to current business rules and
develop exception reports
• Check / Analyze: Verify,
Verify validate,
validate and assess poor DQ
causes. Define improvement opportunities
• Act / Improve: Select/prioritize DQ improvement
opportunities i.e.
i e data entry procedures,
procedures updating data
validation rules, and/or company data standards.

27
Data Quality Management Process

z Evaluate Data Quality

y Management
g
Methods
• Modifying existing methods of DQ management

• Determining if DQ projects have helped to

achieve demonstrable goals and benefits?

• Evaluating and assessing DQ work as, it is not a

program, but a new way of doing business

28
How to improve Data Quality?

z The four categories of Data Quality

Improvement
z Process

z System

z Policy & Procedure

z Data Design

29
Quality Management Maturity Grid

CMM Level-1
Uncertainty

CMM Level-2
Awakening

CMM Level-3
Enlightenment

CMM Level-4
Wisdom

CMM Level-5
Certainity

30
Misconceptions on Data Quality

z You Can Fix Data

z Problem NOT in data, but how it was used.
z It is NOT a one time process.
z Buying a cleansing tool is NOT the solution.
solution
z Some live with the problem, cant afford the tool.

z D t Quality
Data Q lit is
i an IT Problem
P bl
z It is the company problem.
z Define the metrics of quality.
z Business has to strike a balance between quality
and ROI.
z J i t business
Joint b i andd IT effort.
ff t
31
Misconceptions on Data Quality
z (All) Problem is in the Data Sources or Data Entry
z NOT the
th only
l problem.
bl
z Systems could be responsible, but actually it is the metrics.
z Two divisions using different codes for same entity.
z N d to
Need t track,
t k trace,
t check
h k data
d t from
f creation
ti to t usage.

z The Data Warehouse will provide a single source of

truth
z In ideal world it is indeed true.
z In real world may be multiple data warehouses, data marts,
external sources i.e. silos of data resulting in multiple sources
of “truth”.
z Even with single source of truth, if transformations and
interpretations are different, an issue.

БЖБ, ТЖЮ талдау
0% (1)
БЖБ, ТЖЮ талдау
28 pages
Data Quality - Information Quality For Northwind
No ratings yet
Data Quality - Information Quality For Northwind
18 pages
Data Warehousing & Data Mining PDF
100% (6)
Data Warehousing & Data Mining PDF
143 pages
Applied Math Unit1 Summary and Useful Formulas
100% (1)
Applied Math Unit1 Summary and Useful Formulas
4 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
49 pages
Cs614
No ratings yet
Cs614
12 pages
A New Efficient Data Cleansing Method
No ratings yet
A New Efficient Data Cleansing Method
11 pages
Cs614 (1)
No ratings yet
Cs614 (1)
10 pages
Introduction To Data Cleaning
No ratings yet
Introduction To Data Cleaning
36 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
unit -1 (b) DWM.docx
No ratings yet
unit -1 (b) DWM.docx
26 pages
DP
No ratings yet
DP
44 pages
Data Warehousing - Data Mining CSE - IT (4th Year) Engineering Lecture Notes, Ebook PDF Download
No ratings yet
Data Warehousing - Data Mining CSE - IT (4th Year) Engineering Lecture Notes, Ebook PDF Download
146 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Resume 1
100% (1)
Resume 1
106 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
5 Data Cleaning
No ratings yet
5 Data Cleaning
36 pages
Data Cleaning: Information Integration
No ratings yet
Data Cleaning: Information Integration
42 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Introduction To Data Mining With Case Studies - Sample Index
0% (1)
Introduction To Data Mining With Case Studies - Sample Index
16 pages
Ilyas FN TDB2015
No ratings yet
Ilyas FN TDB2015
115 pages
BA_CH02
No ratings yet
BA_CH02
23 pages
772s Data - Mining.concepts - And.techniques.2nd - Ed
No ratings yet
772s Data - Mining.concepts - And.techniques.2nd - Ed
239 pages
Introduction to Business Intelligence 2
No ratings yet
Introduction to Business Intelligence 2
23 pages
Unit I
No ratings yet
Unit I
57 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
185 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
DMDW_ Preprocessing L-6,7
No ratings yet
DMDW_ Preprocessing L-6,7
16 pages
DM HarshQuesAns
No ratings yet
DM HarshQuesAns
183 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
PPT 2
No ratings yet
PPT 2
51 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Unit-2
No ratings yet
Unit-2
144 pages
Correlation
No ratings yet
Correlation
14 pages
dwdm
No ratings yet
dwdm
11 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
162 pages
CS614 FinalTerm Solved Papers
No ratings yet
CS614 FinalTerm Solved Papers
24 pages
Data Mining
No ratings yet
Data Mining
4 pages
IDQ Functionality Imp
No ratings yet
IDQ Functionality Imp
7 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
pptcs1661
No ratings yet
pptcs1661
38 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Enforcing Data Quality
No ratings yet
Enforcing Data Quality
28 pages
Data Mining and Data Warehouse - Mukesh Prasad Chaudhary
No ratings yet
Data Mining and Data Warehouse - Mukesh Prasad Chaudhary
651 pages
Defining Data Mining and Data Warehouse (Adugna Gutema)
No ratings yet
Defining Data Mining and Data Warehouse (Adugna Gutema)
9 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
UNIT-1 Introduction: Motivation: Why Data Mining?
No ratings yet
UNIT-1 Introduction: Motivation: Why Data Mining?
86 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
108 pages
Chapter 4-INFORMATION MANAGEMENT
No ratings yet
Chapter 4-INFORMATION MANAGEMENT
39 pages
ADBMS ppt2
No ratings yet
ADBMS ppt2
158 pages
2020 intro
No ratings yet
2020 intro
58 pages
Unit 2
No ratings yet
Unit 2
22 pages
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet
Method Statement Pigmented Plaster Repair
No ratings yet
Method Statement Pigmented Plaster Repair
10 pages
Cata Basic PDF
No ratings yet
Cata Basic PDF
23 pages
Detailed Lesson Plan For Grade 10
No ratings yet
Detailed Lesson Plan For Grade 10
9 pages
Modelling and Control of Ball-Plate System PDF
100% (1)
Modelling and Control of Ball-Plate System PDF
22 pages
The Future of Work and Robotics
No ratings yet
The Future of Work and Robotics
30 pages
OPM Forecasting
No ratings yet
OPM Forecasting
57 pages
Verification and Validation With Edits by Dan Fleck: Coming Up: Objectives
No ratings yet
Verification and Validation With Edits by Dan Fleck: Coming Up: Objectives
41 pages
CH 100: Fundamentals For Chemistry: Chapter 1: Introduction Lecture Notes
No ratings yet
CH 100: Fundamentals For Chemistry: Chapter 1: Introduction Lecture Notes
132 pages
Chapter 1
No ratings yet
Chapter 1
48 pages
C MCQS
No ratings yet
C MCQS
38 pages
B2 English Test With Answer Key
100% (2)
B2 English Test With Answer Key
10 pages
MCQ 1
No ratings yet
MCQ 1
2 pages
Azani Bin Omar PDF
No ratings yet
Azani Bin Omar PDF
5 pages
RSCH2111 Q2 Written Work 1 - Attempt Review
100% (1)
RSCH2111 Q2 Written Work 1 - Attempt Review
4 pages
Zwsa Business Plan 2011-12 PDF
No ratings yet
Zwsa Business Plan 2011-12 PDF
36 pages
Light and Eye Review Answer Key
No ratings yet
Light and Eye Review Answer Key
7 pages
Irrationally Yours: (Allprofits From This Book Support Research)
No ratings yet
Irrationally Yours: (Allprofits From This Book Support Research)
5 pages
12.therapeutic Landscape-Therapeutic Landscapes and Healing Gardens-Review Chinese&Western Literature Rivew
No ratings yet
12.therapeutic Landscape-Therapeutic Landscapes and Healing Gardens-Review Chinese&Western Literature Rivew
13 pages
PS3 Stata
No ratings yet
PS3 Stata
3 pages
Game Theory On Gold Industry
No ratings yet
Game Theory On Gold Industry
10 pages
Mcs 10096
No ratings yet
Mcs 10096
3 pages
GRDCF013028 PDF
No ratings yet
GRDCF013028 PDF
5 pages
Defense University, College of Engineering Project Proposal On
No ratings yet
Defense University, College of Engineering Project Proposal On
15 pages
ESC 103 Exam Dec 2012 v1.1
No ratings yet
ESC 103 Exam Dec 2012 v1.1
9 pages
Chapter-5-transportation-and-assignment-models
No ratings yet
Chapter-5-transportation-and-assignment-models
29 pages
Alappuzha HSS
No ratings yet
Alappuzha HSS
39 pages
Assign 1
No ratings yet
Assign 1
2 pages
Saad Review
No ratings yet
Saad Review
3 pages