Intro To Duplicate Detection
Intro To Duplicate Detection
Duplicate Detection
Synthesis Lectures on Data
Management
Editor
M. Tamer zsu, University of Waterloo
Synthesis Lectures on Data Management is edited by Tamer zsu of the University of Waterloo.
The series will publish 50- to 125 page publications on topics pertaining to data management. The
scope will largely followthe purviewof premier information and computer science conferences,
such as ACMSIGMOD, VLDB, ICDE, PODS, ICDT, and ACMKDD. Potential topics include,
but they are not limited to the following: query languages, database systemarchitectures,
transaction management, data warehousing, XML and databases, data streamsystems, wide scale
data distribution, multimedia data management, data mining, and related subjects.
An Introduction to Duplicate Detection
Felix Naumann and Melanie Herschel
2010
Privacy-Preserving Data Publishing: An Overview
Raymond Chi-Wing Wong and Ada Wai-Chee Fu
2010
Keyword Search in Databases
Jeffrey Xu Yu, Lu Qin, and Lijun Chang
2009
Copyright 2010 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
An Introduction to Duplicate Detection
Felix Naumann and Melanie Herschel
www.morganclaypool.com
ISBN: 9781608452200 paperback
ISBN: 9781608452217 ebook
DOI 10.2200/S00262ED1V01Y201003DTM003
A Publication in the Morgan & Claypool Publishers series
SYNTHESIS LECTURES ON DATA MANAGEMENT
Lecture #3
Series Editor: M. Tamer zsu, University of Waterloo
Series ISSN
Synthesis Lectures on Data Management
Print Electronic 2153-5418 2153-5426
An Introduction to
Duplicate Detection
Felix Naumann
Hasso Plattner Institute, Potsdam
Melanie Herschel
University of Tbingen
SYNTHESIS LECTURES ON DATA MANAGEMENT #3
C
M
&
cLaypool Morgan publishers
&
ABSTRACT
With the ever increasing volume of data, data quality problems abound. Multiple, yet different
representations of the same real-world objects in data, duplicates, are one of the most intriguing data
quality problems. The effects of such duplicates are detrimental; for instance, bank customers can
obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple
times to the same household, etc.
Automatically detecting duplicates is difcult: First, duplicate representations are usually
not identical but slightly differ in their values. Second, in principle all pairs of records should be
compared, which is infeasible for large volumes of data. This lecture examines closely the two main
components to overcome these difculties: (i) Similarity measures are used to automatically identify
duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness
of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in
search for duplicates. Well-designed algorithms improve the efciency of duplicate detection. Finally,
we discuss methods to evaluate the success of duplicate detection.
KEYWORDS
data quality, data cleansing, data cleaning, ETL, similarity measures, entity matching,
object matching, record linkage
vii
Contents
1
Data Cleansing: Introduction and Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Data Quality Dimensions 3
1.1.2 Data Cleansing 4
1.2 Causes for Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Intra-Source Duplicates 6
1.2.2 Inter-Source Duplicates 7
1.3 Use Cases for Duplicate Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Customer Relationship Management 8
1.3.2 Scientic Databases 9
1.3.3 Data Spaces and Linked Open Data 10
1.4 Lecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
ProblemDenition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Formal Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Data in Complex Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Data Model 18
2.3.2 Challenges of Data with Complex Relationships 20
3
Similarity Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Token-based Similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Jaccard Coefcient 24
3.1.2 Cosine Similarity Using Token Frequency and Inverse Document
Frequency 26
3.1.3 Similarity Based on Tokenization Using q-grams 29
3.2 Edit-based Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
viii CONTENTS
3.2.1 Edit Distance Measures 30
3.2.2 Jaro and Jaro-Winkler Distance 32
3.3 Hybrid Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Extended Jaccard Similarity 34
3.3.2 Monge-Elkan Measure 35
3.3.3 Soft TF/IDF 36
3.4 Measures for Data with Complex Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Other Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Rule-based Record Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.1 Equational Theory 40
3.6.2 Duplicate Proles 42
4
Duplicate Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Pairwise Comparison Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Blocking 43
4.1.2 Sorted-Neighborhood 45
4.1.3 Comparison 47
4.2 Algorithms for Data with Complex Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Hierarchical Relationships 48
4.2.2 Relationships Forming a Graph 49
4.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Clustering Based on the Duplicate Pair Graph 52
4.3.2 Clustering Adjusting to Data & Cluster Characteristics 56
5
Evaluating Detection Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 Real-World Data Sets 65
5.2.2 Synthetic Data Sets 66
5.2.3 Towards a Duplicate Detection Benchmark 67
6
Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
CONTENTS ix
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Authors Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1
C H A P T E R 1
Data Cleansing: Introduction
and Motivation
With the ever increasing volume of data and the ever improving ability of information systems to
gather data from many, distributed, and heterogeneous sources, data quality problems abound. One
of the most intriguing data quality problems is that of multiple, yet different representations of the
same real-world object in the data. For instance, an individual might be represented multiple times
in a customer database, a single product might be listed many times in an online catalog, and data
about a single type protein might be stored in many different scientic databases.
Such so-called duplicates are difcult to detect, especially in large volumes of data. Simulta-
neously, they decrease the usability of the data, cause unnecessary expenses, customer dissatisfaction,
incorrect performance indicators, and they inhibit comprehension of the data and its value. Note
that duplicates in the context of this lecture are not exact replicas but exhibit slight or even large
differences in the individual data values. Thus, such duplicates are also called fuzzy duplicates. In the
traditional context of database management systems, duplicates are exact copies of records. They are
easy to detect, simply by sorting the data by all columns, and they are thus not topic of this book: We
use the term duplicate to refer to fuzzy duplicates. In Table 1.1, records r
1
and r
2
are exact duplicates
while r
3
is a fuzzy duplicate with either r
1
or r
2
, probably representing the same real-world object.
Table 1.1: Exact duplicate r
1
and r
2
and fuzzy dupli-
cate r
1
and r
3
FN LN Phone email
r
1
John Doe (407) 356 8888 john@doe.com
r
2
John Doe (407) 356 8888 john@doe.com
r
3
Jon Doe (407) 356 8887 john@doe.com
Duplicates in databases of organizations have many detrimental effects, for instance:
Fraudulent bank customers can obtain duplicate identities, thus possibly receiving more credit.
Inventory levels are monitored incorrectly if duplicate products are not recognized. In such
cases, the same products are restocked in separate orders, and quantity discounts cannot be
gained.
2 1. DATACLEANSING: INTRODUCTIONANDMOTIVATION
The total revenue of preferred customers is unknown because it is spread over multiple in-
stances. Gold customers are not recognized and are dissatised.
Catalogs are mailed multiple times to the same household. Householding is a special case
of person duplicate detection where it is not the same person but persons from the same
household who are recognized.
Duplicate data causes so much more unnecessary IT expenses, simply due to its volume that
must be maintained and backed up.
Both researchers and developers in industry have tackled this problemfor many decades. It was
rst examined in 1969 [Fellegi and Sunter, 1969] and has since spawned many research projects and
software products. Methods for duplicate detection are included in almost every data warehouse and
ETL suite and have reached such mundane places as Microsoft Outlooks contacts-database. In this
lecture, we motivate and formally dene the problem of duplicate detection. We examine closely the
two main components towards its solution: (i) Similarity measures are used to automatically identify
duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness
of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in
search for duplicates. Well-design algorithms improve the efciency of duplicate detection. Finally,
we discuss methods to evaluate the success of duplicate detection.
Figure 1.1 shows a typical duplicate detection process. A set of records R is suspected to
contain duplicates. Conceptually, the cross-product of all records is formed, and among those, an
algorithm chooses promising candidate pairs, usually those that have a sufciently high similarity
at rst glance. These record pairs are then input to a similarity measure, which produces a more
sophistically calculated similarity. Finally, similarity thresholds are used to decide whether the pair
is indeed a duplicate or not. In some cases, an automated decision is not possible and human expert
must inspect the pair in question. One of the goals of duplicate detection is to minimize this set of
unsure duplicates without compromising the quality of the overall result.
Aslightly different scenario is that of entity search. There, it is not the goal to rid a large dataset
of duplicates but rather to insert new records into a dataset without producing new duplicates. For
instance, customer support specialists solicit name and address of a customer via phone and must be
able to immediately recognize whether this customer is already represented in the database. The two
scenarios are sometimes distinguished as batch duplicate detection vs. duplicate search or as ofine
and online duplicate detection. Because of the largely differing requirements, relevant techniques
differ, and the topic of entity search warrants a separate lecture. Thus, it is not covered in this lecture.
An obvious next step after detecting duplicates is to combine or merge themand thus produce
a single, possibly more complete representation of that real-world object. During this step, possible
data conicts among the multiple representations must somehow be resolved. This second step
is called data fusion and is not covered in this lecture. Instead, we refer to a survey on data fusion
[Bleiholder and Naumann, 2008]. An alternative approach to handle duplicates after their discovery
is to somehow link them, thus representing their duplicate status without actually fusing their data.
1.1. DATAQUALITY 3
Duplicates
Calculate
sim(c
1
,c
2
)
using similarity
measure
R:
?
Algorithm to
choose candidate
pairs (c
1
,c
2
)
among R x R
Apply
similarity
thresholds
measure
Non
duplicates
among R x R
Figure 1.1: A prototypical duplicate detection process
The process of duplicate detection is usually embedded in the more broadly dened process of
data cleansing, which not only removes duplicates but performs various other steps, such as address
normalization or constraint verication, to improve the overall data quality. In the sections of this
chapter, we give an overview of this eld of data quality beyond duplicates. We then motivate many
causes for inadvertent duplicate creations. Finally, we present several use cases to display the various
areas in which duplicate detection plays an important role in an overall information management
environment.
1.1 DATAQUALITY
Information and data quality are wide and active research and development areas, both from an
information system and a database perspective. They are used synonymously. Broadly, data quality
is dened by Tayi and Ballou [1998] as tness for use, which is usually broken down into a set of
quality dimensions. An in-depth coverage of the topic is given in the book by Batini and Scannapieco
[2006]. Here we briey cover the mainissues of data quality witha particular focus of duplicates. First,
we mention several pertinent quality dimensions and then cover various aspects of data cleansing,
which are usually performed before duplicate detection.
1.1.1 DATAQUALITY DIMENSIONS
In their seminal paper, Wang and Strong [1996] elicit fteen quality dimensions from ques-
tionnaires given to data consumers in the industry; many other classications are discussed by
Batini and Scannapieco [2006]. These widely cited dimensions are categorized as intrinsic (believ-
ability, accuracy, objectivity, reputation), contextual (value-added, relevancy, timeliness, completeness,
appropriate amount of data), representational (interpretability, ease of understanding, representa-
tional consistency, concise representation), and accessibility (accessibility, access security). Obviously,
some of the dimensions are highly subjective, and others are not related to duplicates. Duplicate de-
4 1. DATACLEANSING: INTRODUCTIONANDMOTIVATION
tection and the subsequent fusion of duplicates directly improve quality in the dimensions accuracy
and completeness.
Accuracy is the extent to which data are correct, reliable and free of error. It is usually measured
as the number of records with errors compared to the overall number of records. Accuracy is usually
improved because multiple but different representations of the same real-world object usually implies
some conicting and thus inaccurate data in at least one of the representations otherwise, the
duplicate would have had identical values and would have been easily recognized.
Completeness is dened by Naumann et al. [2004] as the extent to which data are of sufcient
depth, breadth and scope for the task at hand. Completeness is often sub-divided into intensional
completeness, i.e., the completeness per record, and extensional completeness, i.e., the number or
records compared to the complete domain. In general, intensional completeness can be improved
through duplicate detection because, in many cases, the multiple representations cover data from
different properties; one customer entry might contain a phone number, another entry about the
same individual might contain an email address. The combined data are more complete.
Almost all other dimensions are indirectly improved if duplicates are handled appropriately
by the information management system. For instance, believability is enhanced because users can
assume that a given particular record is the only record about that entity.
1.1.2 DATACLEANSING
A concrete measure to improve quality of data is to directly modify the data by correcting errors and
inconsistencies. Data cleansing
1
is a complex set of tasks that takes as input one or more sets of data
and produces as output a single, clean data set. Data cleansing tasks include look-up checks, format
transformations, currency conversions, constraint verication, deduplication, data fusion, and many
others.
A prominent paradigm for data cleansing is the Extract-Transform-Load (ETL) process,
usually built to populate a data warehouse: The extraction phase is responsible for gathering source
data and placing it into a so-called staging area. The staging area is usually a relational database
system that supports various data cleansing steps. These are performed sequentially or in parallel
in the transformation stage. Finally, the load phase is responsible for loading the extracted and
cleansed data into the data warehouse / data cube. Figure 1.2 shows a prototypical ETL process
with its individual phases and several data cleansing steps. The names of the individual operators
have been chosen to reect typical labels used in ETL products. In the process, two extractors gather
data from different sources. A cleansing step specic to one source is performed. Next, a funnel step
builds the union of the two sources. A lookup is performed for records with missing zip code data.
The match operator performs duplicate detection; the survive operator performs data fusion. Finally,
a load operator populates the customer dimension of the data warehouse.
Figure 1.2 is a rather simple process; larger organizations often manage hundreds of more
complex ETL processes, which need to be created, maintained, scheduled, and efciently executed
1
Data cleaning and data cleansing are used synonymously in the literature.
1.2. CAUSES FORDUPLICATES 5
Web
customers
Data cube
with
Funnel Match Survive
Extract
Web
application
Load
ZIP lookup
Store
customers
customer
dimension
Funnel Match Survive
Extract
Business
card capture
Load
Address
verification
Figure 1.2: A prototypical ETL process
[Albrecht and Naumann, 2008]. ETLengines specialize inefcient, usually parallel executionduring
times of low load on the transactional system.
Most cleansing tasks are very efcient because they can be executed in a pipeline; therefore,
each record can be cleansed individually. For instance, conversions of data values from one measure
to another (inch to cm; $ to E) or transforming phone numbers or dates to a standard pattern can
be performed independently of other values in that column. Such steps are often referred to as data
scrubbing. The intrinsic efciency of data scrubbing operations is not true for duplicate detection. In
principle, each record must be compared to all other records in order to nd all duplicates. Chapter 4
describes techniques to avoid this quadratic complexity, but a simple record-by-record approach is
not applicable.
Data cleansing is an important pre-processing step before duplicate detection. The more effort
is expended here and the cleaner the data become, the better duplicate detection can perform. For
instance, a typical standardization is to abbreviate all occurrences of the term Street in an address
column to St.. Thus, the addresses 20 Main Street and 20 Main St. appear as identical strings
after standardization. Without standardization, duplicate detection would have to allow for slight
deviations to still capture the duplicity of the two addresses. However, this allowance of deviation
might already be too loose: 20 Main St. and 20 Maine St. could incorrectly be recognized as the
same.
1.2 CAUSES FORDUPLICATES
One can distinguish two main causes for duplicates in data. First, data about a single entity might
be inadvertently entered multiple times into the same database we call such duplicates intra-source
duplicates. Second, duplicates appear when integrating multiple data sources, each of which might
have a different representation of a single entity we call such duplicates inter-source duplicates.
While the causes for duplicates might differ, methods to detect them are largely the same.
What again differs is the treatment of the duplicates once they are discovered: Intra-source duplicates
usually undergo some data fusionprocess to produce a newand enhanced single representation. Inter-
source duplicates, on the other hand, are usually merely annotated as such. Combining them is then
left to the application that uses the data.
6 1. DATACLEANSING: INTRODUCTIONANDMOTIVATION
1.2.1 INTRA-SOURCEDUPLICATES
Duplicates within a single data source usually appear when data about an entity are entered without
(properly) checking for existing representations of that entity. A typical case is when a customer calls
in a new order. While entering the customer data, the system or the clerk does not recognize that
this customer is already represented in the database. This can happen due to poor online duplicate
detection methods or because of signicant changes in the customer data new last name, new
address, etc. Other sources of error are poor data entry: even if the customer data are unchanged, it
is entered differently. Causes are, for instance,
inadvertent typos: 1969 vs. 1996, Felix vs. FElix
misheard words: Mohammed vs. Mohammad
difcult names: Britney vs. Britny, Spears vs. Speers, or Vaithyanathan.
lacking spelling abilities: Philadela
Apart from actually incorrect errors, other causes of errors and thus of duplicates are different
conventions and data entry formats. This is especially a problemwhen users, rather than trained data
entry clerks, enter their data themselves, as is the case with many online shopping portals and other
points of registration on the Web. Figure 1.3 shows ve conference name tags with members from
the same institute. On the conference registration site, each participant entered his or her own data.
While the names and afliations are all correct, it is not trivial for a computer to decide whether
their afliations are all the same.
Figure 1.3: Five different representations of the same conference attendee afliation
1.2. CAUSES FORDUPLICATES 7
A typical non-human source of error is the automated scanning of paper documents. Even
though this optical character recognition (OCR) technology is very advanced and considered solved
for non-cursive machine- and handwriting, not all hand-written or printed documents have the
necessary quality. Figure 1.4 shows an example of an incorrect scan. While letters with both addresses
are likely to arrive at the correct, same destination, it would be difcult to trace the letter: A database
search for the correct name and address would probably not produce that shipment.
Figure 1.4: Erroneous character recognition for Jens Bleiholder of the Hasso Plattner Institute
1.2.2 INTER-SOURCEDUPLICATES
The problems mentioned above with respect to data entry are also valid for inter-source duplicates.
When data about a single individual are entered differently in each source, they constitute a duplicate
when the sources are integrated. However, when integrating data sources, there are yet more causes
for duplicates:
Different data entry requirements. Even when entering the same information about a real-world
object, different sources might have different data entry requirements. For instance, one source
might allowabbreviated rst names while the other enters the full rst name. Titles of a person
might be entered in front of the rst or in front of the last name. Numerical values might be
recorded at different precision levels, data might be entered in different languages, etc. All these
are reasons that semantically same information is represented differently. These differences are
difcult to specify, making duplicate detection a challenging problem.
Different data entry times. Different sources might enter data about a real-world object at different
points in time. Thus, data about an object might differ. For instance, the address of a person
can change, temperatures vary, prices uctuate, etc. Again, these differences lead to different
representations of the same real-world object and must be accounted by similarity measures.
Heterogeneous schemata. In general, different data sources use different schemata to represent
same real-world entities. Schemata can be aligned using schema mapping technology. For
instance, the attributes phone and fon have the same semantics and can be mapped to a
common attribute in the integrated result. However, not all sources cover all attributes. For
8 1. DATACLEANSING: INTRODUCTIONANDMOTIVATION
instance, one source may store the weight of a product, while the other source may not. Such
missing attributes are padded with NULL values in the integrated result. Thus, even when
sources agree on all common attributes, they are not necessarily exact duplicates.
Padding with NULL values () may lead to subsumed or complemented records. Table 1.2
shows three records from three different sources. Records r
1
and r
2
subsume r
3
, and r
1
is
complemented by r
2
.
Table 1.2: Records r
1
and r
2
subsume r
3
, and r
1
is com-
plemented by r
2
.
r
1
John Doe (407) 356 8888
r
2
John Doe (407) 358 7777
r
3
John Doe
Detecting such duplicates does not require a similarity measure and is thus easier than the
general case with contradicting values. We do not cover the efcient detection of subsumed or
complemented records but instead refer the reader to articles by Bleiholder et al. [2010], by
Galindo-Legaria [1994], and by Rao et al. [2004].
1.3 USECASES FORDUPLICATEDETECTION
Duplicate detection comes in many guises and for many different applications. Apart from the term
duplicate detection, which we use throughout this lecture, the broad problem of detecting multiple
representations of the same real-world objects has been named doubles, mixed and split citation
problem, record linkage, fuzzy/approximate match, object/instance/reference/entity matching, ob-
ject identication, deduplication, object consolidation, entity resolution, entity clustering, identity
uncertainty, reference reconciliation, hardening soft databases, merge/purge, household matching,
householding, and many more. Each mention might have a slightly different application or problem
denition in mind, but each need two basic components a similarity measure and an algorithm to
explore the data.
We now discuss several typical use cases for duplicate detection, each with a slightly different
problem angle. The classical use case is that of customer relationship management, where duplicates
within a large set of customer data are sought. Detecting duplicates in scientic databases empha-
sizes cleansing of data that is distributed and heterogeneous. Finally, the linked-open-data use case
emphasizes more the unstructured and dynamic nature of data on the Web.
1.3.1 CUSTOMERRELATIONSHIP MANAGEMENT
Large organizations provide many points of contact with their customers: They might have an online
store or information site, they might have walk-in stores, they might provide a phone hotline, they
might have personal sales representatives, they might correspond via mail, etc. Typically, each point
1.3. USECASES FORDUPLICATEDETECTION 9
of contact is served by a separate data store. Thus, information about a particular customer might be
stored in many different places inside the organization. In addition, each individual data set might
already contain duplicates.
Figuratively speaking, when organizations communicate with a customer, the left hand does
not know what the right hand is doing. Thus, revenue potential might be lost and customers might
be dissatised; it is likely that most readers have experienced having to state the same information
many times over. Managers have recognized the need to centrally integrate all information about
a customer in so-called customer relationship management (CRM) systems. Apart from efciently
providing up-to-date customer information to the various applications of an organization, CRM
systems must recognize and eliminate duplicates.
The domainof persons andcompanies is probably the most developedwithrespect to duplicate
detection. Many specialized similarity measures have been devised specically with that domain in
mind. For instance, a duplicate detection system might encode a high similarity between the two
dates-of-birth 5/25/71 and 1/1/71 even though they differ signicantly: January 1st is a typical
default value if only the year-of-birth of a customer is known. Other person-specic examples
include accentuation-based similarity measures for last names, nickname-based similarity measures
for rst names, and specialized similarity measures for streets and zip codes.
1.3.2 SCIENTIFICDATABASES
Scientic experiments and observations produce vast amounts of data. Prominent examples are the
results of the human genome project or images produced by astronomical observation. Many such
data sources are publicly available and thus very good candidates for integration. Many projects have
arisen with the primary goal of integrating scientic databases of a certain universe of discourse, such
as biological databases [Stein, 2003]. All such projects must solve the task of duplicate detection.
If real-world objects or phenomena are observed and documented multiple times, possibly with
different attributes, it is useful to recognize this and interlink or combine these representations. For
instance, many proteins have been studied numerous times under different conditions. Combining
and aggregating such data might avoid costly experiments [Hao et al., 2005; Tril et al., 2005].
To this end, researchers have developed specialized similarity measures to cope with the fact
that genomic data has two directly conicting properties: (i) As it is generated in high-throughput
experiments producing millions of data points in a single run, the data often have a high level of
noise. (ii) Even slight variations in genomic sequences may be important to tell a real duplicate from
a wrong one (think of the same gene in different yet closely related species).
Due to the vast amounts of data, scientic databases are usually not integrated into a large
warehouse, but they are annotated withlinks to other representations in other databases [Karp, 1996].
Thus, the task of a researcher exploring a specic object is eased. In fact, large efforts have been in-
vestedtodevelopsystems that allowintegratedqueries across these multiple sources [Bleiholder et al.,
2004; Shironoshita et al., 2008].
10 1. DATACLEANSING: INTRODUCTIONANDMOTIVATION
1.3.3 DATASPACES ANDLINKEDOPENDATA
Duplicate detection is also relevant in scenarios beyond typical, well-formed data. We illustrate
two use cases from domains with more heterogenous data. The rst use case is the prime example
for data spaces [Halevy et al., 2006], namely personal information management [Dittrich et al., 2009;
Dong and Halevy, 2005]. Dataspaces comprise complex, diverse, interrelated data sources and are a
generalization of traditional databases. The second, related use case introduces linked open data as
a new form of publishing data on the (semantic) web. For both use cases, we show why duplicate
detection is a necessary but difcult task.
Personal information management (PIM) can be considered as the management of an inte-
grated view of all our personal data that, for instance, reside on our desktop, our laptop, our PDA,
our MP3 player, etc. In addition to the heterogeneity of devices storing data, PIM also has to deal
with the heterogeneity of data themselves, as relevant data encompass emails, les, address books,
calendar entries, and so on. Clearly, PIM data do not correspond to the uniformly structured data
we consider throughout this lecture
2
. For instance, a person described in an address book and a
person being tagged on a picture may be duplicates, but it is highly unlikely that they share any
other attributes (assuming we can distinguish attributes) than their name. Clearly, in this type of
application, we need to rst identify what data describe what type of object, and what data are se-
mantically equivalent before being able to perform any reasonable comparisons. In data integration,
this preprocessing step to duplicate detection is known as schema matching. For structured data,
various algorithms for schema matching have been proposed [Rahm and Bernstein, 2001].
In PIM, due to the high variability of applications and data these applications store about
objects, candidates of the same type may not share a sufcient number of attributes to decide if
they represent the same real-world object. An extreme example where no common attribute exists
is shown in Figure 1.5: Figure 1.5(a) shows an email whereas Figure 1.5(b) depicts a calendar entry.
The email is addressed to naumann@hpi.uni-potsdam.de and concerns the Synthesis Lecture (see
subject). The calendar describes a meeting that also contains Synthesis Lecture in its title, and the
location refers to Felix ofce at HPI. Both the email and the calendar entry refer to the same person,
i.e., Felix Naumann, but based on the email address and the rst name alone, it is very difcult to
come to this conclusion. As a matter of fact, we most denitely cannot detect the duplicate reference
to Felix Naumann by simply comparing person candidates based on their descriptions, assuming
these have been extracted from the available data.
Beyond personal information, more and more data are publicly available on the Web. However,
it is difcult to query or integrate this data. The semantic web community introduces and advocates
the concept of linked open data (LOD) [Bizer et al., 2009b]. By assigning a URI to each object,
referencing these objects with links, using the HTTP protocol, and nally by providing data sets
openly (usually as sets of RDFtriples), the hope is to create a large web of data frommany sources that
can be queried and browsed easily and efciently. Indeed, a rapidly growing number of sources, such
2
Note that this is generally true for data in data integration scenarios. Here, we chose PIM as a representative because related
challenges are particularly accentuated.
1.4. LECTUREOVERVIEW 11
From: Melanie Herschel <melanie.herschel@uni-tuebingen.de>
Subject: Synthesis Lecture
Date: October 21, 2009 5:26:24 PM GMT+02:00
To: naumann@hpi.uni-potsdam.de
Hi,
I think we are done!
Regards,
Melanie
PM 1
PM 2
1:00 PM Synthesis Lecture Wrap-up
Oce Felix, HPI
(a) Email (b) Calendar entry
Figure 1.5: Personal Information Management (PIM) Data
as DBpedia [Bizer et al., 2009a] or Freebase
3
, constitute a wealth of easily accessible information.
As of May 2009, the W3C reports over 70 data sets consisting of over 4.7 billion RDF triples, which
are interlinked by around 142 million RDF links
4
. In many cases, these links represent duplicate
entries. For instance, the entry of the Bank of America in Freebase stores a link to the corresponding
Wikipedia entry.
Linked open data are from many very different domains of public, commercial, and scientic
interest. The data sets are provided by many, often overlapping sources, making duplicate detection
more important than ever. Much of the linking is currently performed manually, but automated
methods can help. Because much of the data are heterogeneously structured and have large textual
parts, similarity measures must be adapted accordingly. In addition, the sheer volume of data makes
it all the more important to use efcient algorithms to nd duplicate candidates.
1.4 LECTUREOVERVIEW
The remainder of this lecture is organized as follows: Chapter 2 formally introduces the problem
of duplicate detection and describes its complexity. The following two chapters address the two
specic problems of duplicate detection: Chapter 3 describes similarity measures to decide if two
candidates are in fact duplicates, and Chapter 4 describes algorithms that decide which candidates
to even consider. Finally, Chapter 5 shows how to evaluate the success of duplicate detection and
Chapter 6 concludes the lecture. While this lecture is intended to serve as an overview and primer
for the eld of duplicate detection, we refer interested readers to a recent and more comprehensive
literature survey by Elmagarmid et al. [2007].
3
http://freebase.com
4
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
13
C H A P T E R 2
ProblemDenition
As we have seen in Chapter 1, duplicate detection is a problemof highly practical relevance, and there
is a need for efcient and effective algorithms to tackle the problem. Before delving into the details of
such algorithms in the subsequent chapters, we rst formally dene the problem. More specically,
we start the discussion by providing a denition of the duplicate detection task in Section 2.1. This
denition focuses on duplicate detection in data stored in a single relation, a focus we maintain
throughout this lecture. We then discuss the complexity of the problem in Section 2.2. Finally,
in Section 2.3, we highlight issues and opportunities that exist when data exhibit more complex
relationships than a single relation.
2.1 FORMAL DEFINITION
Before we dene duplicate detection, we summarize important concepts of the relational model and
introduce their notation.
Given a relation R, we denote the schema of the relation as S
R
= a
1
, a
2
, , a
n
. Each a
i
corresponds to an attribute. A record stored in a relation R assigns a value to each attribute. We use
two alternative notations for a record r, depending on which notation is better suited in the context it
is used in: The rst notation implies the schema and simply lists values, i.e., r = v
1
, v
2
, . . . , v
n
; The
second notationexplicitly lists attribute-value pairs, that is, r = (a
1
, v
1
), (a
2
, v
2
), . . . , (a
n
, v
n
). We
assume that all data we consider for duplicate detection conform to a given schema.
Example 2.1 Figure 2.1 represents a relational table named Movie. The schema of the Movie
relation is MID, T it le, Year, Review. The Movie relation contains four records, one being
m1, BigFish, 2003, 8.1/10.
To properly dene duplicate detection, we rst have to decide which object representations
are subject to deduplication. For instance, do we search only for duplicates within movies? Or do we
MID Title Year Review
m1 Big Fish 2003 8.1 / 10
m2 Public Enemies 2009 7.4 / 10
m3 Public Enemies 09
m4 Big Fisch 2003 7.5 / 10
Figure 2.1: Sample Movie relation
14 2. PROBLEMDEFINITION
search within all movies that were inserted or updated since the last deduplication of the database?
Or do we want to identify duplicates only to those movies that do not have a review score? To specify
the set of records that is subject to duplicate detection, we can simply formulate a query that retrieves
these records. In our relational scenario, a possible SQL query to identify all records that describe
movies that appeared since 2009 is simply
SELECT *
FROM Movie
WHERE Year >= 2009
The result of such a query is a set of records that corresponds to the set of records that need
to be deduplicated. We refer to this set of records as the set of duplicate candidates, or candidates for
short. A candidate has a specic type T that corresponds to the type of object that it represents, for
instance, a movie. We denote the set of all candidates of type T as C
T
= {c
1
.c
2
, . . . , c
n
}.
The next aspect we have to consider is that not all information included in an object rep-
resentation is relevant for duplicate detection. In the relational context, this means that not every
attribute of a record is relevant. For instance, whereas movie titles are relevant for duplicate detection
as they are very descriptive of movies, the attribute Review is less descriptive of a movie, and hence,
it is not suited to distinguish duplicates to a movie from non-duplicates. This observation leads us
to the general denition of an object description.
Given a candidate c, the object description of c, denoted as OD(c), is a set of attribute-value
pairs that are descriptive of candidate c. That is, OD(c) = (a
1
, v
1
), (a
2
, v
2
), . . . , (a
m
, v
m
). For a
given candidate type T , the set of descriptive attributes of any candidate of type T is the same, i.e.,
all corresponding object descriptions contain attribute-value pairs for attributes {a
1
, a
2
, . . . , a
m
}. As
a consequence, object descriptions are usually dened on schema level, and we can retrieve object
descriptions by specifying SQL queries that return the desired attributes. As a simple example, we
can retrieve the object description of movie m1 using the SQL query
SELECT Title, Year
FROM Movie
WHERE MID = m1
Note that in practice, retrieving candidates and their corresponding description are usually combined
into a single query. In our example, the SQL query is the same as the query for object descriptions
shown above, except that the WHERE-clause is removed.
Having dened the input to duplicate detection algorithms, let us now move to the denition
of the output. The ultimate goal of duplicate detection is to partition the set of candidates based on
information provided by their object descriptions. Each partition returned by a duplicate detection
algorithm contains candidates that all represent the same real-world object. Hence, we say that a
partition represents a real-world object as well. The second property of the resulting partitions is
2.1. FORMAL DEFINITION 15
that no two distinct partitions represent the same real-world object. We denote the nal result of
duplicate detection of candidates in C
T
as the partitioning P
T
.
Example 2.2 Let us assume that our set of candidates includes all movies described in the
Movie relation shown in Figure 2.1. That is, C
Movie
= {m1, m2, m3, m4}. Assuming that the
object description of a movie consists of Title and Year only, we, for instance, have OD(m1) =
{(T it le, BigFish), (Year, 2003)}. The ideal result of duplicate detection is to identify that both
the movie Big Fish and the Movie Public Enemies are each represented twice in the Movie relation.
That is, ideally, a duplicate detection algorithm outputs P
Movie
= {{m1, m4}, {m2, m3}}.
In contrast to our example, where duplicate detection results in a perfect classication, per-
forming duplicate detection in practice does not reach the gold standard. Indeed, the similarity
measure may miss some duplicates because their similarity is not high enough. These missed du-
plicates are referred to as false negatives. On the other hand, candidates may also be classied as
duplicates although they are not because, despite a careful choice of descriptions, they still are very
similar. We call this type of classication error false positives. As we discuss in Section 5.1, dupli-
cate detection algorithms are designed to reduce the occurrence of at least one of these error types.
Depending on the application, minimizing false positives may be more, less, or equally important to
minimizing false negatives, so the choice of a proper duplicate detection algorithmand conguration
depends on this characteristic.
The above denition of duplicate detection does not specify any details on algorithmspecics.
As we see in subsequent chapters, numerous algorithms exist for duplicate detection. These have
varying characteristics, and hence their runtime may signicantly differ. A specic type of algorithm
widely used throughout the literature and industry are iterative duplicate detection algorithms.
Iterative duplicate detection algorithms are characterized by the fact that they rst detect
pairs of duplicates, and they then exploit the transitivity of the is-duplicate-of relationship to obtain
the desired duplicate partitions. Remember that two candidates are duplicates if they represent the
same real-world object. Clearly, if A is duplicate of B, and B is duplicate of C, it is true that A is a
duplicate of C, too.
Duplicate pairs are identied using a similarity measure sim(c, c
) =
sim(c
using a similarity
measure sim by
classify(c, c
) =
_
c and c
) >
c and c
) =
_
c and c
)
c and c
) = 1 sim(c, c
).
In our discussion, we distinguish different classes of measures. In Section 3.1, we discuss
token-based measures that take as input a pair of sets of tokens. Edit-based measures, covered in
Section 3.2, on the other hand, take as input a pair of strings. Token-based and edit-based measures
can be combined into hybrid measures, which we discuss in Section 3.3. All these measures only
consider object descriptions, and in Section 3.4, we describe several similarity measures that exploit
information from relationship descriptions as well. The discussion on similarity measures ends in
Section3.5 witha brief descriptionof further similarity measures. We thendiscuss another technique,
orthogonal to similarity measures, to classify pairs of candidates as duplicates or non-duplicates. The
essence of the technique is to dene domain-specic rules (see Section 3.6).
24 3. SIMILARITY FUNCTIONS
3.1 TOKEN-BASEDSIMILARITY
Token-based similarity measures compare two strings by rst dividing them into a set of tokens
using a tokenization function, which we denote as t okenize(). Intuitively, tokens correspond to
substrings of the original string. As a simple example, assume the tokenization function splits a
string into tokens based on whitespace characters. Then, the string Sean Connery results in the set
of tokens {Sean, Connery}. As we will show throughout our discussion, the main advantage of
token-based similarity measures is that the similarity is less sensitive to word swaps compared to
similarity measures that consider a string as a whole (notably edit-based measures). That is, the
comparison of Sean Connery and Connery Sean will yield a maximum similarity score because both
strings contain the exact same tokens. On the other hand, typographical errors within tokens are
penalized, for instance, the similarity of Sean Connery and Shawn Conery will be zero.
We discuss three token-based measures in this section: The basic Jaccard coefcient can be
used to compute the similarity as discussed in Section 3.1.1. We then present a more sophisticated
similarity measure that introduces weights of tokens in Section 3.1.2. In dening these two measures,
we assume that tokens do not overlap, that is, each character of the original string exists in exactly one
token. In Section 3.1.3, we discuss how q-grams, which are considered overlapping tokens, are used
for similarity measurement. The main advantage of using overlapping tokens is that the similarity
is affected less by typographical errors than when using non-overlapping tokens.
3.1.1 JACCARDCOEFFICIENT
The Jaccard coefcient is a similarity measure that, in its most general form, compares two sets P
and Q with the following formula:
Jaccard(P, Q) =
|P Q|
|P Q|
(3.1)
Essentially, the Jaccard coefcient measures the fraction of the data that is shared between P
and Q, compared to all data available in the union of these two sets.
In the context of duplicate detection, the question is what are the sets P and Q? Throughout
the duplicate detection literature, we observe two basic uses of the Jaccard coefcient: tokens are
either obtained from strings [Monge and Elkan, 1996], or they correspond to complete descrip-
tions [Bilenko et al., 2003]. The rst variant is useful to identify similar pairs of descriptions in
hybrid similarity measures (see Section 3.3) whereas the second variant computes the similarity of
candidates.
String comparison based on the Jaccard coefcient. Given a tokenization function t okenize(s)
that tokenizes a string s into a set of string tokens {s
1
, s
2
, . . . , s
n
}, we compute the Jaccard
coefcient of two strings s
1
and s
2
as
StringJaccard(s
1
, s
2
) =
|t okenize(s
1
) t okenize(s
2
)|
|t okenize(s
1
) t okenize(s
2
)|
(3.2)
3.1. TOKEN-BASEDSIMILARITY 25
Candidate comparison based on the Jaccard coefcient. Given two candidates, we have seen in
Section 2.1 that these are compared based on their respective object description. The Jaccard
coefcient of two candidates c
1
and c
2
is given by
DescriptionJaccard(c, c
) =
|OD(c
1
) OD(c
2
)|
|OD(c
1
) OD(c
2
)|
(3.3)
Example 3.1 Consider a scenario where Person is a candidate type. One way to represent the name
of a person in a database is to store the complete name in one attribute, e.g., the Name attribute.
Hence, Personcandidates have a descriptionattribute Name. Now, consider two candidates where the
Name descriptions have values Thomas Sean Connery and Sir Sean Connery, respectively. Assuming
a tokenization function that separates a string into tokens based on whitespace, we obtain
tokenize(Thomas Sean Connery) = {Thomas, Sean, Connery}
tokenize(Sir Sean Connery) = {Sir, Sean, Connery}
We observe that among these two token sets, there are only four distinct tokens, and both Sean
and Connery appear in both sets. Consequently, applying the Jaccard coefcient to these two strings
results in
StringJaccard(Thomas Sean Connery, Sir Sean Connery) =
2
4
To illustrate candidate comparison based on the Jaccard coefcient, let us now assume that
a persons name is split into several attributes in the Person relation, i.e., title, rst name, middle
name, and last name. Keeping the same example, we now have
OD(c
1
) = {(f irst name, Thomas), (middlename, Sean), (last name, Connery)}
OD(c
2
) = {(t it le, Sir), (middlename, Sean), (last name, Connery)}
In these two sets, we observe that the middlename and the lastname description are equal in OD(c
1
)
and OD(c
2
) and in total, we have four distinct descriptions. As a result, again
DescriptionJaccard(c
1
, c
2
) =
2
4
Based on this simple example, we point out several deciencies of the Jaccard similarity.
First, we observe that the fact that one name is simply missing a title value penalizes the similarity
signicantly. Intuitively, such a title (or the lack thereof ) should have less impact on similarity
than the last name or rst name, for instance. In the next section, we discuss the cosine similarity
measure that addresses this issue. Asecond drawback of Jaccard similarity is that it is very sensitive to
26 3. SIMILARITY FUNCTIONS
typographical errors in single tokens. For instance, Shean Conery and Sean Connery have a similarity
of zero. In Section 3.1.3, we show how token-based similarity measures can be adapted to penalize
typographical errors less. As a nal remark, note that if Sean would have been put in the rstname
attribute (which is likely to be the case, e.g., when a Web-form requires a rst name but no middle
name), DescriptionJaccard would yield a result of
1
5
, unless we specify that middlename and rstname
shall be considered equal. None of the similarity measures described in this chapter can explicitly
cope with this problem, and it is often assumed that such information is used when initializing
descriptions in order to avoid the problem.
An advantage of the Jaccard coefcient is that it is not sensitive to word swaps. Indeed, the
score of two names John Smith and Smith John would correspond to the score of exactly equal strings
because the Jaccard coefcient considers only whether a token exists in a string, not at which position.
3.1.2 COSINESIMILARITY USINGTOKENFREQUENCY ANDINVERSE
DOCUMENTFREQUENCY
The cosine similarity is a similarity measure often used in information retrieval. In general, given two
n-dimensional vectors V and W, the cosine similarity computes the cosine of the angle between
these two vectors as
CosineSimilarity(V, W) = cos() =
V W
||V|| ||W||
(3.4)
where ||V|| is the length of the vector V = [a, b, c, . . .] computed as
a
2
+b
2
+c
2
+. . .. In
duplicate detection, the vectors V and W can represent either tokens in a string or descriptions of a
candidate. Note that we made the same distinction in the discussion of the Jaccard coefcient. From
now on, we illustrate only the case where tokens arise from tokenizing string values, but readers
should keep in mind that this is not the only possible use of the token-based measures we discuss.
Assuming we tokenize two strings s
1
and s
2
using a tokenization function t okenize( ), the
question arises how we build the two vectors V and W from the tokens. We discuss the solution
in two steps: rst, we discuss the dimensionality d of these two vectors before we focus on what
numbers are lled in these vectors.
Essentially, the d dimensions of these vectors correspond to all d distinct tokens that appear
in any string in a given nite domain, denoted as D. In our scenario, we assume that s
1
and s
2
originate from the same relational attribute, say a. In this case, D corresponds to all distinct tokens
in all values of a. For large databases, this number of tokens may be large, so the vectors V and W
have high dimensionality d in practice.
At this point of the discussion, we know how large the vectors V and W are. Let us now
discuss what data these vectors contain. For this, we rst consider the term vector T of D, which, of
course, is d-dimensional as well. This term vector contains a weight for each of the d distinct tokens
in D. This weight reects how relevant a token is in D, relative to other tokens. To weigh tokens, the
token frequency-inverse document frequency (tf-idf) is commonly used both in information retrieval
3.1. TOKEN-BASEDSIMILARITY 27
and duplicate detection. To dene the tf-idf score, we rst need to dene its individual components,
namely tf and idf.
Essentially, the term frequency measures how often a token occurs in a given string value.
Intuitively, the term frequency reects the relevance of a token within a string value: the more often
a token occurs, the more relevant it is in the context of the string value. As an example, consider a
Web-page about the International Space Station ISS. The token ISS is likely to appear frequently
on this web page, denitely more frequently than on a web-page about the Airbus 380. This results
in a high term frequency of ISS on the ISS Web-page and a low term frequency of ISS on the Airbus
Web-page. Assuming that a token t appears in the value v of an object description of a candidate c
such that (a, v) OD(c), we denote its term frequency as tf
t,c
.
The intuition behind the inverse document frequency is that it assigns higher weights to
tokens that occur less frequently in the scope of all candidate descriptions. This is useful as it assigns
lowweights to common tokens in a domain, e.g., in a database listing insurance companies, the token
Insurance is likely to occur very frequently across object descriptions and the idf thus assigns it a lower
weight than to more distinguishing tokens such as Liberty or Prudential. Formally, we compute the
inverse document frequency of a token t occurring in the object description of a candidate c as
idf
t,c
=
n
|{c|(a, v) OD(c)} t t okenize(v)|
(3.5)
The tf-idf score combines both the term frequency and the inverse document frequency into
a single score, using the following formula:
tf-idf
t,c
= log
_
tf
t,c
+1
_
log
_
idf
t,c
_
(3.6)
As a reminder, n is the total number of candidates.
We compute the tf-idf score for every token t
i
D and set the i-th value in the term vector
T to this score. Finally, to create vectors V and W, we simply set the i-th value in V and W to the
i-th weight in T if the strings that V and W represent actually contain the i-th token, and to zero
otherwise. We can now compute the cosine similarity as described by Equation 3.4.
Example 3.2 Figure 3.1 shows ten US insurance companies, which we consider as candidates
whose type we denote as T = IC. We provide candidate identiers in Figure 3.1 for future reference.
These are not part of an insurance companys description that consists only of the company name.
We observe that any token occurs at most once in a value of attribute Name, so the token frequency
tf
Insurance,c
i
is either 0 or 1 for a given candidate c
i
. We further observe that among the ten candidates,
six contain the token Insurance so idf
Insurance,c
i
=
10
6
. Based on these values, we obtain, for instance,
tf -idf
Insurance,c
4
= log(1 +1) log
_
10
6
_
0.07. Similarly, we obtain tf-idf
Farmers,c
4
0.30 and
tf-idf
Liberty,c
7
= 0.30.
We now can compute the cosine similarity between the two strings s
1
= Farmers Insurance
and s
2
= Liberty Insurance. Based on the tokens that s
1
and s
2
contain, we obtain the following
vectors V and W as depicted in Figure 3.2, assuming the order of strings in Qshown.
28 3. SIMILARITY FUNCTIONS
CID Name
c
1
Allstate
c
2
American Automobile Association
c
3
American National Insurance Company
c
4
Farmers Insurance
c
5
GEICO
c
6
John Hancock Insurance
c
7
Liberty Insurance
c
8
Mutual of America Life Insurance
c
9
Safeway Insurance Group
c
10
Westeld
Figure 3.1: Sample table listing insurance compa-
nies
Q V W
Allstate 0 0
America 0 0
Automobile 0 0
Association 0 0
National 0 0
Insurance 0.07 0.07
Company 0 0
Farmers 0.30 0
GEICO 0 0
John 0 0
Hancock 0 0
Liberty 0 0.30
0 0
Figure 3.2: Example on computing the
cosine similarity between Farmers Insur-
ance and Liberty Insurance
Using Equation 3.6, we nally obtain CosineSimilarity(V, W) =
0.07
2
0.07
2
+0.30
2
2
0.05. We
observe that the similarity score is extremely low, although, at rst sight, the strings overlap in half
their tokens. However, the token they overlap in is Insurance, which has a very low weight because it
appears in a large fraction of the candidates. As a consequence, the similarity score is low, too. Note
that in this example, the Jaccard similarity yields a similarity of 0.33.
From the example above, we see that the cosine similarity better reects the distinguishing
power of tokens, compared to the Jaccard similarity due to different weights assigned to tokens.
These weights can be computed using tf-idf, but we could use any other weight function as well, for
instance, a function specically designed for a particular domain. Note that it is also possible to use
the weight function in combination with other similarity measure than the cosine similarity.
We further point out that it is often the case that, like in our example, attribute values used
as descriptions contain a token at most once, yielding sparse vectors when using tf-idf as weight
function. However, it is easy to see that the vectors can easily be compressed to include only the
positions where at least one of the two vectors has a non-zero value, which actually corresponds to
a vector containing only tokens of tokenize(s
1
) tokenize(s
2
).
Besides the advantage that the relevance of tokens is better reected using cosine similarity
with tf-idf weights, the cosine similarity is also not sensitive to word swaps. Adrawback of the cosine
similarity as described in this section is the fact that it cannot cope with typographical errors. Two
means to overcome this problem are the use of q-grams, discussed next, and the denition of the
softTFIDF, which we discuss in Section 3.3.3.
3.1. TOKEN-BASEDSIMILARITY 29
3.1.3 SIMILARITY BASEDONTOKENIZATIONUSINGq-GRAMS
In q-gram based similarity measures, tokens are not determined based on special characters such as
whitespace or punctuation. Instead, a string is divided into smaller tokens of size q. These tokens are
referred to as q-grams or n-grams. Another difference to tokens we discussed previously is that these
tokens overlap, i.e., one character in a string appears in several tokens (actually, exactly q tokens).
To generate q-grams of size q, we slide a window of size q over the string to be tokenized and
each sequence of characters within the window is a token. To obtain q tokens with the rst and last
characters of a string, we introduce a special character not in the initial alphabet and pad the string
with this character.
Example 3.3 Generating q-grams. Consider strings s
1
= Henri Waternoose and s
2
= Henry
Waternose. We generate 3-grams (also called trigrams) for the two strings that result in the sets of
3-grams below. Note that we use underscore (_) to represent a whitespace and the padding characters
at the beginning and the end of a string are denoted as #.
q-grams of s
1
= {##H, #He, Hen, enr, nri, ri_, i_W, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e##}
q-grams of s
2
= {##H, #He, Hen, enr, nry, ry_, y_W, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e##}
Given two sets of q-grams, it is possible to consider these tokens to compute the token-based
similarity measures described in Sections 3.1.1 and 3.1.2, which, as the following example illustrates
results in a similarity score that is less sensitive to typographical errors than the previously described
measures.
Example 3.4 q-grambased token similarity computation. Let us reuse the two strings s
1
and s
2
and their corresponding sets of q-grams described in Example 3.3. We observe that the two token
sets overlap in 13 q-grams, and we have a total of 22 distinct q-grams. Using the Jaccard similarity
(see Section 3.1.1), we obtain StringJaccard(s
1
, s
2
) =
13
22
= 0.59. Using the cosine similarity with
tf-idf weights, we obtain
CosineSimilarity(V, W) =
1.04
2
13
1.04
2
13 +1.34
2
5
1.04
2
13 +1.34
2
4
0.64
where V and W are the q-gram sets of s
1
and s
2
, respectively.
30 3. SIMILARITY FUNCTIONS
3.2 EDIT-BASEDSIMILARITY
Let us now focus on a second family of similarity measures, so called edit-based similarity measures.
In contrast to token-based measures, strings are considered as a whole and are not divided into sets
of tokens. However, to account for errors, such as typographical errors, word swaps and so on, edit-
based similarities allow different edit operations to transform one string into the other, e.g., insertion
of characters, character swaps, deletion of characters, or replacement of characters.
3.2.1 EDITDISTANCEMEASURES
In general, the edit distance between two strings s
1
and s
2
is the minimum cost of transforming
s
1
into s
2
using a specied set of edit operations with associated cost functions. The cost of the
transformation than simply is the sum of the costs of the individual edit operations. In this section,
we limit the discussion to a simple variant of the edit distance that is known as the Levenshtein
distance andwe showhowto compute it using a dynamic programming algorithm. Readers interested
in more details on the subject are invited to read the overviewarticle on approximate string matching
by Navarro [2001].
Given two strings s
1
and s
2
, the Levenshtein distance LevDist(s
1
, s
2
) is equal to the minimum
number of character insertions, deletions, and replacements necessary to transform s
1
into s
2
.
Example 3.5 Levenshtein distance. As an example, consider the two strings s
1
= Sean and s
2
= Shawn. The Levenshtein distance of these two strings is 2, as we need to (i) replace the e in s
1
by
an h and (ii) insert a w to s
1
to transform s
1
into s
2
. Obviously, there are innitely many possibilities
to transform s
1
into s
2
, e.g., we may delete all characters of s
1
and subsequently insert all characters
of s
2
. However, the number of edit operations in this case would be 9, which is not minimal.
Apopular algorithmto compute the Levenshtein distance is based on dynamic programming.
It starts by initializing a (|s
1
| +1) (|s
2
| +1) matrix M, where |s| denotes the length of a string s.
Once initialized, we ll the matrix Mwith values computed using the equations below. We denote
the value in the i-th column and j-th row of Mas M
i,j
, with 0 i |s
1
| and 0 j |s
2
|. The
i-th character in a string s
1
is denoted as s
1,i
.
M
i,0
= i (3.7)
M
0,j
= j (3.8)
M
i,j
=
_
M
i1,j1
if s
1,i
= s
2,j
1 +min
_
M
i1,j
, M
i,j1
, M
i1,j1
_
otherwise
(3.9)
Computation proceeds from the top left of the matrix to the bottom right. Once all values in the
matrix have been computed, the result of the algorithmthat corresponds to the Levenshtein distance
can be retrieved from M
|s
1
|,|s
2
|
.
Example 3.6 Computing the Levenshtein distance using dynamic programming. Consider
again the two strings s
1
= Sean and s
2
= Shawn. Figure 3.3(a) shows the matrix M after we
3.2. EDIT-BASEDSIMILARITY 31
S h a w n
0 1 2 3 4 5
S 1
e 2
a 3
n 4
S h a w n
0 1 2 3 4 5
S 1 0 1 2 3 4
e 2
a 3
n 4
S h a w n
0 1 2 3 4 5
S 1 0 1 2 3 4
e 2 1 1 2 3 4
a 3 2 2 1 2 3
n 4 3 3 2 2 2
(a) State after initializing (b) State after applying (c) Final state with edit distance
rst row and rst column Equation 3.9 to second row stored in bottom right corner
Figure 3.3: Computing the edit distance using a dynamic programming approach
initialized the matrix and computed the values of the rst row and rst column. Let us now focus on
Figure 3.3(b), which shows the state of the algorithm after processing the second row in the matrix.
At M
1,1
we see that the rst characters of s
1
and s
2
match, so we look at position M
0,0
to determine
the value we should ll in (rst line of Equation 3.9). All remaining elds in this row are lled using
the second line of Equation 3.9 because none of the other characters in s
2
match the rst character,
i.e., S of s
1
. It is worth pointing out that the rst line of Equation 3.9 essentially represents a diagonal
lookup of a value, whereas the second line looks up and left of the current position, determines the
minimum of all three elds, and adds one to this minimum cost. The nal result of the algorithm
corresponds to Figure 3.3(c)
1
. According to this matrix, LevDist(Sean, Shawn) = 2.
As mentioned at the beginning of this section, the Levenshtein distance is a special case of
an edit distance as it uses unit weight and three basic edit operators (insert, delete, and replace
character). Although widely used in practice for duplicate detection, the Levenshtein distance is
not a suitable similarity measure when whole segments of a string differ, e.g., when one string is a
prex of the second string (Prof. John Doe vs. John Doe) or when strings use abbreviations (Peter J
Miller vs. Peter John Miller). These problems are primarily due to the fact that all edit operations
have equal weight and that each character is considered individually. Further edit distance measures
have been proposed to cope with these problems, including the Smith-Waterman distance and the
use of afne gaps. We do not discuss these measures in detail (they all use concepts and algorithms
similar to those used to compute the Levenshtein distance) but highlight their effect on the nal
distance score.
Intuitively, the Smith-Waterman distance allows to identify the longest common subexpres-
sion between two strings [Smith and Waterman, 2001]. For instance, it determines that the longest
common subexpression between s
1
=Prof. John Doe and s
2
=John Doe is the string John Doe. Based
on this output of the Smith-Waterman distance, we can now argue that this common subexpression
actually corresponds to s
2
, which is simply missing a prex compared to s
1
. Conceptually, we divide
the strings into a prex, a common subexpression, and a sufx and can assign lower weights to the
1
It is possible to determine the actual operations to transform s
1
into s
2
with minimum cost; however, this is not pertinent to
duplicate detection, and we thus omit the discussion.
32 3. SIMILARITY FUNCTIONS
insertion or deletion of blocks of size l (a block corresponds to a prex or sufx) than to the insertion
of l individual characters. Hence, using the Smith-Waterman distance, the existence of a prex or a
sufx is less penalized than in the Levenshtein distance.
The Smith-Waterman distance divides the original strings into a prex, a common subexpres-
sion, and a sufx. This, however, is not sufcient to lower the penalty of gaps within the string, e.g.,
due to abbreviations, missing words, etc. Essentially, it does not reect the case where several common
subexpressions, divided by non-matching character sequences exist. This problemis addressed by the
afne gap distance [Waterman et al., 1976]: It allows edit operations, notably insertion and deletion
of complete blocks within a string and assigns these block insertions and deletions less weight than to
the insertion or deletion of the individual characters in a block. As an example, let s
1
= Peter J Miller
and s
2
= Peter John Miller. We see two common subexpressions, i.e., Peter J and _Miller, and one
block of characters that needs to be inserted in order to transforms
1
into s
2
, i.e., ohn. The cost of this
block can be computed as the cost of opening a block o plus the cost e of extending the block by one
character. Hence, the cost of inserting this block is, for instance, o +2e = 1 +2 0.1 = 1.2, given
o = 1 and e = 0.1. This cost is lower than the cost of inserting three characters in the Levenshtein
distance (the cost would be 3 using unit weights for character insertion). Consequently, the afne
gap distance gives the possibility to penalize less the non-matching blocks of characters within a
string, which is benecial when comparing strings that contain abbreviations or that are missing
some tokens.
3.2.2 JAROANDJARO-WINKLERDISTANCE
In this section, we discuss two similarity measures that account for mismatches in two strings by
allowing character transpositions, an edit operation we have not seen so far. Also, their computation
is not based on dynamic programming.
The Jaro similarity [Jaro, 1989] essentially compares two strings s
1
and s
2
by rst identifying
characters common to both strings. Two characters are said to be common to s
1
and s
2
if they are
equal and if their positions within the two strings, denoted as i and j, respectively, do not differ by
more than half of the length of the shorter string. Formally, |i j| 0.5 min(|s
1
|, |s
2
|). Once all
common characters have been identied, both s
1
and s
2
are traversed sequentially, and we determine
the number t of transpositions of common characters where a transposition occurs when the i-th
common character of s
1
is not equal to the i-th common character of s
2
. Given the set of common
characters and the number of transpositions t , the Jaro similarity is computed as
JaroSim =
1
3
_
||
|s
1
|
+
||
|s
2
|
+
|| 0.5t
||
_
Example 3.7 Let s
1
= Prof. John Doe and s
2
= Dr. John Doe. Thus, |s
1
| = 14 and |s
2
| = 12.
The maximum distance between two common characters then is 0.5 min(12, 14) = 6. The set
of common characters is = {r, ., _, J, o, h, n, _, D, o, e}, where _ denotes a space character. We
3.2. EDIT-BASEDSIMILARITY 33
P r o f . _ J o h n _ D o e
D r . _ J o h n _ D o e
Figure 3.4: Matching characters between Prof. John Doe and Dr. John Doe used by the Jaro distance
illustrate the commoncharacters inFigure 3.4. We see that none of the matching lines crosses another
matching line, which indicates that none of the common characters yields to a transposition, so we
have t = 0. The nal Jaro distance thus is
JaroSim(s
1
, s
2
) =
1
3
_
11
12
+
11
14
+
11 0
11
_
0.9
The Jaro similarity generally performs well for strings with slight spelling variations. However,
due to the restriction that common characters have to occur in a certain distance from each other,
the Jaro distance does not cope well with longer strings separating common characters. As a simple
example, consider s
1
= Professor John Doe and s
2
= John Doe. We can identify only two common
characters, as illustrated in Figure 3.5. This yields a low similarity score of 0.45 although s
1
is the
same as s
2
except for a (long) prex. The same exercise can be repeated for the case where one string
is equal to the other except for an additional sufx.
P r o f e s s o r _ J o h n _ D o e
J o h n _ D o e
Figure 3.5: Matching characters between Professor John Doe and John Doe used by the Jaro distance
In the domain of person names, which is a widely studied domain with highly specialized
measures, it is true that the problem described above mostly occurs for names with a common prex
but where one name has an additional sufx (e.g., Peter J vs. Peter John stored as a rst name). An
extension of the Jaro similarity, called the Jaro-Winkler similarity [Winkler and Thiboudeau, 1991],
considers this special case. Given two strings s
1
and s
2
with a common prex , the Jaro-Winkler
similarity is computed as
JaroWinklerSim(s
1
, s
2
) = JaroSim(s
1
, s
2
) +|| f (1 JaroSim(s
1
, s
2
))
where f is a constant scaling factor for how much the similarity is corrected upwards based on the
common prex .
Example 3.8 Let s
1
=Peter and s
2
=Peter Jonathan. These two strings have 5 common characters
that correspond to the common prex = Peter. Clearly, there are no permutations of characters
34 3. SIMILARITY FUNCTIONS
in so t = 0. It follows that JaroSim(s
1
, s
2
) = 0.78. Assuming a scaling factor f = 0.1 the Jaro-
Winkler similarity is equal to
JaroWinklerSim(s
1
, s
2
) = 0.78 +5 0.1 (1 0.78) = 0.89
3.3 HYBRIDFUNCTIONS
In Section 3.1, we discussed token-based similarity measures that divide the data used for com-
parisons into sets of tokens, which are then compared based on equal tokens. We further discussed
similarity measures that keep data as a whole in the form of a string and that compute the similarity
of strings based on string edit operations that account for differences in the compared strings.
In this section, we discuss similarity measures that combine both tokenization and string simi-
larity in computing a nal similarity score. We refer to these algorithms as hybrid similarity functions.
The measure covered in Section 3.3.1 extends Jaccard similarity to also include similar tokens in
the set of overlapping descriptive data. The second hybrid measure, discussed in Section 3.3.2, is
the Monge-Elkan measure. Finally, Section 3.3.3 covers an extension of the cosine similarity using
tf -idf for weight computation.
3.3.1 EXTENDEDJACCARDSIMILARITY
We rst describe two extensions to the Jaccard similarity that have been proposed throughout the
literature [Ananthakrishna et al., 2002; Weis and Naumann, 2005]. The rst extension accounts for
similar tokens and the second extension introduces weight functions.
Let s
1
and s
2
be two strings that can be divided into sets of tokens by a tokenization function,
denoted as tokenize( ). In the original denition of the Jaccard similarity (see Section 3.1.1), only
equal tokens are part of the intersection of the token sets of s
1
and s
2
. The idea behind the rst
extension of the Jaccard similarity is to also include similar tokens in the intersection to allow for
small errors, e.g., typographical errors between shared tokens.
Formally, let TokenSim(t
1
, t
2
) be a string similarity measure that compares two tokens t
1
tokenize(s
1
) and t
2
tokenize(s
2
). Any of the string similarity measures discussed in Section 3.2 can
be applied here. We dene the set of shared similar tokens between s
1
and s
2
as
Shared(s
1
, s
2
) = {(t
i
, t
j
)|t
i
tokenize(s
1
) t
j
tokenize(s
2
) : TokenSim(t
i
, t
j
) >
string
}
where
st ring
is a secondary similarity threshold. Similarly, we dene the tokens unique to s
1
as
Unique(s
1
) = {t
i
|t
i
tokenize(s
1
) (t
i
, t
j
) / Shared(s
1
, s
2
)}
Analogously, the tokens unique to s
2
are dened by
Unique(s
2
) = {t
j
|t
j
tokenize(s
2
) (t
i
, t
j
) / Shared(s
1
, s
2
)}
3.3. HYBRIDFUNCTIONS 35
A second extension of the Jaccard similarity that is commonly used in combination with
the previous one is to introduce a weight function w for matching and non-matching tokens. For
instance, token pairs in Shared may get a weight that corresponds to their similarity. An aggregation
function A then aggregates the individual weight.
Using the two extensions above, the hybrid Jaccard similarity is dened as
HybridJaccard =
A
(t
i
,t
j
)Shared(s
1
,s
2
)
w(t
i
, t
j
)
A
(t
i
,t
j
)Shared(s
1
,s
2
)
w(t
i
, t
j
) +A
(t
i
)Unique(s
1
)
w(t
i
) +A
(t
j
)Unique(s
2
)
w(t
j
)
Note that instead of a secondary string similarity measure, we may also use a string dis-
tance measure to identify similar tokens. In this case, we simply replace TokenSim(t
i
, t
j
) >
string
by TokenDist(t
i
, t
j
)
string
. As a nal remark, it is also possible to use different weight functions
for Shared(s
1
, s
2
), Unique(s
1
), and Unique(s
2
). For instance, if token similarity is used as weight
function for Shared(s
1
, s
2
), we have to use another function for both Unique sets, because we simply
have no similarity scores within these sets.
Example 3.9 Hybrid Jaccard similarity Let s
1
= Henri Waternoose and s
2
= Henry Peter Water-
nose. Using any of the token-based similarity measures we discussed, the similarity is zero assuming
tokenization based on whitespaces. Indeed, the two strings do not share any equal token. Using the
hybrid Jaccard similarity measure, we obtain
Shared(s
1
, s
2
) = {(Henri, Henry), (Waternoose, Waternose)}
when using the Levenshtein distance and
string
= 1 to determine similar tokens. It follows that
Unique(s
1
) = and Unique(s
2
) = {Peter}. Let us further assume that we use unit weights for Unique
sets, whereas the weight of similar tokens (t
i
, t
j
) is given by 1
LevDist (t
i
,t
j
)
max(|t
i
|,|t
j
|)
. The aggregation
function Asimply sums up individual weights. Based on these assumptions, the result of the hybrid
Jaccard similarity is
HybridJaccard(s
1
, s
2
) =
0.8 +0.9
0.8 +0.9 +0 +1
= 0.63
3.3.2 MONGE-ELKANMEASURE
Let us now discuss the Monge-Elkan similarity measure [Monge and Elkan, 1996]. Intuitively, it
rst tokenizes two strings s
1
and s
2
and matches every token t
i
from s
1
to the token t
j
in s
2
that
has the maximum similarity to t
i
, i.e., where TokenSim(t
i
, t
j
)is maximal. These maximum similarity
scores obtained for every token of s
1
are then summed up, and the sum is normalized by the number
of tokens in s
1
. Formally, the Monge-Elkan distance is dened by
36 3. SIMILARITY FUNCTIONS
MongeElkanSim(s
1
, s
2
) =
1
|tokenize(s
1
)|
|tokenize(s
1
)|
i=1
|tokenize(s
2
)|
max
j=1
TokenSim(t
i
, t
j
)
Example 3.10 Let us compare two strings s
1
= Henri Waternoose and s
2
= Henry Peter Waternose.
The token of s
2
with maximum similarity to Henri is obviously Henry, and the most similar string
to Waternoose is Waternose. Let us assume that the two maximum similarity scores equal 0.8 and 0.9,
respectively. Then, we obtain
MongeElkanSim(s
1
, s
2
) =
0.8 +0.9
2
= 0.85
3.3.3 SOFTTF/IDF
The nal hybrid similarity measure we discuss extends the cosine similarity based on tf-idf in the
same spirit as the extension of Jaccard: similar strings, as determined by a secondary string similarity
measure further increase the similarity. Again, let TokenSim(t
1
, t
2
) be a secondary string similarity
function used to compare tokens. Let Close(
string
, s
1
, s
2
) be the set of tokens t
i
tokenize(s
1
) such
that there is some t
j
tokenize(s
2
) where TokenSim(t
i
, t
j
) >
string
. That is,
Close(
string
, s
1
, s
2
) = {t
i
|t
i
tokenize(s
1
) t
j
tokenize(s
2
) : TokenSim(t
i
, t
j
) >
string
}
Note that opposed to the denition of Shared(s
1
, s
2
) that we use to compute extended Jaccard
similarity, Close(
string
, s
1
, s
2
) only includes tokens from string s
1
and not from s
2
. Tokens from
s
2
similar to tokens from s
1
included in Close are considered using the following equation that
determines the most similar token t
j
tokenize(s
2
) to any t
i
Close(
string
, s
1
, s
2
):
maxsim(t
i
, t
j
) = max
t
j
tokenize(s
2
)
TokenSim(t
i
, t
j
)
Then, the hybrid version of the cosine similarity measure, called softTFIDF is dened as
SoftTFIDF(s
1
, s
2
) =
t
i
Close(
sim
,s
1
,s
2
)
_
tf-idf
t
i
||V||
tf-idf
t
j
||W||
maxSim(t
i
, t
j
)
_
where V and W are the vector representations of s
1
and s
2
containing tf-idf scores described in
Section 3.1.2.
Example 3.11 SoftTF/IDF. Let us illustrate softTFIDF reusing the strings s
1
=Henri Waternoose
and s
2
= Henry Peter Waternose. As an example, assume that the vector representations of these two
3.4. MEASURES FORDATAWITHCOMPLEXRELATIONSHIPS 37
strings, respectively, are
V = {0.6, 0.6, 0, 0, 0}
W = {0, 0, 0.5, 0.3, 0.6}
We determine Close(
string
, s
1
, s
2
) = {Henri, Waternoose} because for both of these tokens, a
similar token exists in s
2
. Then, the hybrid cosine similarity is
softTFIDF(s
1
, s
2
) =
0.6
0.6
2
+0.6
2
0.5
0.5
2
+0.3
2
+0.6
2
0.8
+
0.6
0.6
2
+0.6
2
0.6
0.5
2
+0.3
2
+0.6
2
0.9
0.79
The main difference between the Monge-Elkan similarity and softTFIDF is that the Monge-
Elkan similarity assumes all tokens to have equal weight whereas softTFIDF uses tf-idf scores to
reect the distinguishing power of a token in a token weight. Another, more subtle difference is that
the SoftTFIDF only considers similarities between tokens that are above a given threshold . Such a
threshold is not present in the Monge-Elkan similarity where any token in s
1
is matched to a token
in s
2
, no matter how low the similarity between these two tokens may be.
To summarize, all hybrid measures combine the benets of both token-based and edit-based
measures for duplicate classication: edit-based measures account for errors within tokens whereas
token-based measures account for errors related to missing tokens and token swaps.
3.4 MEASURES FORDATAWITHCOMPLEX
RELATIONSHIPS
All measures discussed so far consider only the object description of candidates. As we have seen
in Section 2.3, it is possible to further consider relationships between candidates. In this section,
we provide an overview of similarity measures devised specically for duplicate detection in semi-
structured data, most notably, for XML data. More similarity measures that consider relationship
descriptions of candidates exist, but these usually are extensions of token-based similarity measures
that consider related candidates similarly to tokens [Weis, 2008].
The discussion of Section 2.3.2 shows that challenges for similarity measures arise when
dealing with semi-structured XML data. As a reminder, these challenges arise due to element op-
tionality, element context, and varying element cardinality. Throughout the literature, three similarity
measures have explicitly been dened for XML.
The structure-aware XML distance measure [Milano et al., 2006] computes the distance
between two XML elements based on their substructure, that is, the substructure stands in as a
38 3. SIMILARITY FUNCTIONS
description of the candidate that corresponds to the root XMLelement. To compare two candidates,
an overlay between their two subtrees is computed. Informally, an overlay computes a weighted 1:1
matching between the two subtrees, so that nodes or leaves are matched only if they have the same
path from the root. That is, it is not possible to match the same XML elements in different contexts.
The weight assigned to a match is based on a distance measure, e.g., edit distance for string values in
leaves. Several possible overlays may exist, and the goal is to determine an overlay with minimal cost
(sum of match weights) and such that the overlay is not a proper substructure of any other possible
overlay. Once such an overlay has been determined, the distance between the two candidates is
simply the cost of the overlay. We observe that the different semantics of both element optionality
and element context are not distinguished, but element cardinality is not a problem.
The approach presented by Leito et al. [2007] constructs a Bayesian network, taking two
XML elements as input, each rooted in the candidate element and having a subtree that corresponds
to the description. Nodes in the Bayesian network represent duplicate probabilities of (i) a set of
simple XML elements, (ii) a set of complex XML elements, (iii) a pair of complex elements, or
(iv) a pair of simple elements
2
. The algorithm that constructs the Bayesian network assumes that
each XML element occurs in exactly one context. Probabilities are propagated from the leaves of
the Bayesian network (that correspond to probabilities of pairs of simple elements) to the root and
can be interpreted as similarities. As nodes either represent pairs or sets of elements, the different
semantics of a missing element vs. a NULL value cannot be captured because the lack of an element
results in the probability node not being created at all.
The DogmatiXsimilarity measure [Weis and Naumann, 2005] is aware of the three challenges
described in Section 2.3.2 that arise when devising a similarity measure for XML data. However,
DogmatiX does not distinguish between the different semantics that both element optionality and
element context allow. On the other hand, DogmatiX distinguishes between XML element types
and real-world types so that all candidates of the same type are treated as such, even though they
may occur in a different context. Essentially, to compute the DogmatiX similarity, we form pairs
of descriptions (there is no distinction between object description and relationship description) and
divide them into two sets: similar description pairs and singleton descriptions. A similar description
pair is dened as a pair of descriptions whose pairwise string similarity, for instance, computed
based on the string edit distance, is above a given threshold. Any description that is not part of
any similar description pair is a singleton description. The set of singleton descriptions is further
pruned to account for the fact that some descriptions cannot have a similar partner due to different
cardinalities of elements of that type. Both similar and singleton descriptions are weighted based on
a variation of the inverse document frequency. The overall similarity is the ratio of the sumof weights
of all similar description pairs over the sum of weights of all descriptions (similar or singleton).
In summary, we observe that all similarity measures proposed for XML duplicate detection
cope with varying element cardinality, whereas only DogmatiX explicitly considers the problem
2
A simple XML element is an XML element that nests only a text node. Complex XML elements, on the other hand, nest only
XML elements and no text node.
3.5. OTHERSIMILARITY MEASURES 39
of element context when generating descriptions. None of the similarity measures distinguishes
between possibly different semantics caused by alternative representations of missing data or by
different element contexts when computing a similarity score. It remains an open research issue to
dene measures that make these distinctions and to investigate how these distinctions affect the
quality of duplicate detection. Besides specialized similarity measures for XML, we also point out
algorithms that improve efciency by pruning comparisons by exploiting properties of hierarchically
structured data in Section 4.2.1.
3.5 OTHERSIMILARITY MEASURES
The measures discussed so far are the most widely used for duplicate detection. However, none of
these measures are applicable to all conceivable scenarios. As we have seen, some of the measures
specialize on short strings with few typographical errors whereas others have been devised to be
insensitive to word swaps. In this section, we summarize similarity measures for three further special
cases, i.e., phonetic similarity, numeric similarity, and structural similarity.
Phonetic similarity. Whereas previously discussed measures focus on string similarity, phonetic
similarity focuses on the sounds of spoken words, which may be very similar despite large
spelling differences. For instance, the two strings Czech and cheque are not very similar; however,
they are barely distinguishable phonetically. Therefore, they have a large phonetic similarity.
Soundex is a very common phonetic coding scheme, and the idea behind the computation of
phonetic similarity is to rst transform strings (or tokens) into their phonetic representation
and to then apply the similarity measures on strings or tokens on the soundex representa-
tion [Bourne and Ford, 1961].
Numeric similarity. None of the discussed measures are of much use when comparing numerical
data. Typically, numbers are simply considered as strings, which yields unsatisfactory results,
for instance, when comparing 1999 and 2000. A solution is to measure the difference of two
numbers compared, e.g., by computing |1999 2000|. However, in different domains, the
difference in numbers has different meanings. For instance, when measuring differences on
a microscopic scale, a difference of 1 mm is a large difference, whereas on a macroscopic
scale 1 mm is almost nothing. A possible way to normalize such a difference is to take the
distribution of values in the domain into account.
Structural similarity. As a nal remark, we point out that none of the similarity measures discussed
so far considers the structure of the data; they all focus on content. However, considering the
structure may also be relevant, e.g., when comparing trees that correspond to XML data.
The most widely known structural similarity measure is the tree edit distance and variations
thereof [Shasha et al., 1994]. Essentially, the tree edit distance is an edit-based distance mea-
sure (such as the Levenshtein distance), but instead of allowing edit operations on characters of
40 3. SIMILARITY FUNCTIONS
a string, it considers edit operations on nodes of a tree structure. Due to its high computational
complexity, it is rarely used for duplicate detection.
3.6 RULE-BASEDRECORDCOMPARISON
Up to this point, we have discussed similarity measures and distance measures that compute a real-
valued score. This score is then input to a duplicate classier as described at the beginning of this
chapter (see p. 23). As a reminder, if the similarity score, as returned by a similarity measure, is above a
given threshold , the compared pair of candidates is classied as a duplicate and as a non-duplicate,
otherwise.
In this section, we discuss a complementary method to similarity measures for classifying
candidate pairs as duplicates or non-duplicates: Rule-based record comparison approaches build
rules on attributes (or combinations of attributes) to make a classication. These rules may use
similarity measures for attribute comparisons. An example of such a rule for two Person candidates
c
1
and c
2
is:
rst name of c
1
is similar to rst name of c
2
last name of c
1
equals last name of c
2
c
1
and c
2
are duplicates
We note that in contrast to similarity measures that apply to any string data from any domain,
rule-based approaches make use of domain knowledge and are thus domain-specic approaches
to duplicate classication. We discuss two variants of rule-based record comparisons, namely an
approach that builds on equational theory [Hernndez and Stolfo, 1998], and prole-based com-
parisons [Doan et al., 2003; Weis et al., 2008].
3.6.1 EQUATIONALTHEORY
In this section, we describe an approach that compares two candidates c
1
and c
2
using implications
of the form
P c
1
c
2
where P is a complex predicate over attributes in the candidates object descriptions, i.e., inOD(c
1
)
OD(c
2
), and the equivalence relationship between c
1
and c
2
signies that these candidates are
considered duplicates. A set of such rules composes a domain-specic equational theory that allows
to classify candidates based on a set of rules.
Let us consider the predicate P more closely: P is a boolean expression that can be written
in conjunctive normal form, i.e., as a conjunction of (disjunctions of ) terms:
P = (term
1,1
term
1,2
, . . .) (term
2,1
term
2,2
. . .) . . . (term
n,1
term
n,2
. . .)
Zooming in on the actual terms, the equational theory allows virtually any comparison be-
tween attribute values. However, to make sensible comparisons, these comparisons apply to common
3.6. RULE-BASEDRECORDCOMPARISON 41
attributes of c
1
and c
2
. Indeed, it would not make much sense to use gender information for com-
parison if only c
1
(and not c
2
) contained gender information. We introduce an example that claries
the concept of an equational theory.
Example 3.12 Equational theory. Consider a person database PersDB1 that contains the records
depicted in Figure 3.6(a) and a second person database PersDB2 represented in Figure 3.6(b). We
observe that not all information is represented in both databases. Indeed, whereas name and social
security number (ssn) are represented in both sources (albeit using different representations in the
case of person names), age is present only in PersDB1 and salary is stored only in PersDB2.
ssn fname mname lname age
123 Peter John Miller 46
345 Jane Smith 33
678 John Jack Doe 9
ssn name salary
123 Peter Miller 80k
345 Jane B. Smith 60k
679 Jack John Doe 100k
(a) PersDB1 (b) PersDB2
Figure 3.6: Two sample Person databases
To identify duplicate persons among PersDB1 and PersDB2, suppose we use the following
three rules to compare two candidate persons c
1
and c
2
:
c
1
.ssn = c
2
.ssn concat(c
1
.f name, c
1
.mname, c
1
.lname) c
2
.name c
1
c
2
c
1
.ssn = c
2
.ssn substr(c
1
.lname, c
2
.name) substr(c
1
.f name, c
2
.name) c
1
c
2
substr(c
1
.f name, c
2
.name) substr(c
1
.mname, c
2
.name) substr(c
1
.lname, c
2
.name) c
1
c
2
We observe that terms may use complex comparison functions: for instance, in the rst rule, we rst
concatenate the three name components of c
1
and then compare the concatenated result to the name
of c
2
, using similarity () as comparison operator. This operator indicates that the two strings being
compared need to be similar, which is, for instance, determined using one of the previously dened
similarity measures. In the remaining two rules, substr(l, r) denotes that the string l is a substring
of r.
Let us now compare all persons in PersDB1 with all persons in PersDB2. We observe that
when comparing both records with ssn = 123, Rule 1 is not satised because PersDB2 is missing
the middle name. However, the second rule identies these two representations to be duplicates
because the social security numbers are equal and the rst name and the last name in PersDB1 occur
in the corresponding name eld in PersDB2. Finally, Rule 3 fails to identify the duplicate again,
because PersDB2 does not contain the middle name of Peter Miller. Applying these rules, we nd
two more duplicates: Rule 2 classies both tuples with ssn = 345 as duplicates and Rule 3 determines
the tuples with ssn = 678 and ssn = 679 to be duplicates.
42 3. SIMILARITY FUNCTIONS
3.6.2 DUPLICATEPROFILES
In the equational theory discussed in Section 3.6.1 rules are used exclusively to classify duplicates.
If none of the rules qualies two candidates as duplicates, they are implicitly classied as non-
duplicates. Let us refer to the rules of equational theory as positive rules. In addition to these rules, it
is possible to add so called negative rules to the equational theory [Weis et al., 2008]. These classify
pairs of candidates as non-duplicates and have the form:
P c
1
c
2
where P is again a complex predicate as described in Section 3.6.1.
Example 3.13 Negative rule. Reusing the sample tables of Figure 3.6, a possible negative rule is
c
1
.ssn = c
2
.ssn c
1
c
2
This rule excludes candidates c
1
and c
2
being duplicates if their social security numbers do not
match.
We can combine positive and negative rules as a sequence of classiers to form a duplicate
prole, such that pairs not classied by classier i are input to the subsequent classier i +1. It is
interesting to note that the order of rules in a prole that mixes positive and negative rules affects
the nal output, in contrast to equational theory where the output is always the same no matter the
order of the rules. As a simple example, assume we had applied the negative rule of Example 3.13
before Rule 3 of Example 3.12. Then, the pair of Persons with respective ssn of 678 and 679 would
be classied as non-duplicates and they would not be passed on to the classier using Rule 3, so they
are not classied as duplicates.
Both positive rules and negative rules only use information that is present in both candidates.
In the case where both c
1
and c
2
contribute additional, non-shared information, it is possible to
further rene a duplicate prole based on this information [Doan et al., 2003]. Intuitively, the idea
is to make a plausibility check of a duplicate classication to possibly revoke a previous classication
based on the combined information of c
1
and c
2
. This plausibility check can be done using hard
constraints similar to the negative rules that revoke a preliminary duplicate classication. A second
possibility is to further use soft constraints that do not decide by themselves, but rather output their
condence in a pair being a duplicate. The individual condence scores are nally combined to yield
a nal classication result.
43
C H A P T E R 4
Duplicate Detection Algorithms
As presented in Chapter 2, the problemof duplicate detection needs two components for its solution.
After reviewing similarity measures to decide upon duplicity of candidate pairs we now turn to the
second component algorithms that decide which candidates to compare.
With Figures 2.2 we had motivated the quadratic search space as a matrix of candidate pairs.
The goal of the algorithms presented in Section 4.1 is to reduce the number of comparisons while
not compromising the quality of the result. If not all candidates are compared, there is the danger of
missing some duplicates. Chapter 5 elaborates on this tradeoff. In Section 4.2, we discuss algorithms
that are suited for data with complex relationships. Those algorithms have in common that they
detect pairs of duplicates and form duplicate partitions by simply computing the transitive closure
based on pairwise classication results. In Section 4.3, we discuss more sophisticated clustering
algorithms to obtain duplicate partitions.
4.1 PAIRWISECOMPARISONALGORITHMS
To avoid a prohibitively expensive comparison of all pairs of records, a common technique is to
carefully partition the records into smaller subsets. If we can assume that duplicate records appear
only within the same partition, it is sufcient to compare all record-pairs within each partition.
Two competing approaches are often cited: Blocking methods strictly partition records into disjoint
subsets, for instance, using zip codes as partitioning key. Windowing methods, in particular the
Sorted-Neighborhood method, sort the data according to some key, such as zip code, and then
slide a window of xed size across the sorted data and compare pairs only within the window. Both
methods can be enhanced by running multiple partitioning/windowing passes over the data.
4.1.1 BLOCKING
Blocking methods pursue the simple idea of partitioning the set of records into disjoint partitions
(blocks) and then comparing all pairs of records only within each block [Ananthakrishna et al., 2002;
Baxter et al., 2003; Bilenko et al., 2006]. Thus, the overall number of comparisons is greatly reduced.
Given n records and b partitions, the average size of each partition is
n
b
. In each partition each record
pair must be compared, which yields a total number of pairwise comparisons of
b
n
b
(
n
b
1)
2
=
n(
n
b
1)
2
=
1
2
_
n
2
b
n
_
44 4. DUPLICATEDETECTIONALGORITHMS
assuming all partitions are of equal size.Table 4.1 (see page 48) gives anoverviewof the computational
complexity of the different methods compared to the exhaustive approach of comparing all pairs of
records.
Figure 4.1 repeats the matrix from Figure 2.2 (p. 17). Assume that records 1 20 are sorted
by the partitioning key, both horizontally and vertically. The candidate pairs after partitioning are
shaded. Clearly, there are much fewer candidates than before, namely only 47 compared to the
complete matrix with 190 pairs.
1234567891
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
r
1
r
2
r
3
r
4
r
5
r
6
r
7
r
8
r
9
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
2
r1
r2
r3 r3
r4
r5
r6 r6
r7
r8
r9
r10
r11
r12
r13
r14
r15
r16
r17
r18
r19
r20
Figure 4.1: Matrix of duplicate candidates with blocking algorithm.
An important decision for the blocking method is the choice of a good partitioning predicate,
which determines the number and size of the partitions. They should be chosen in a manner that
potential duplicates appear in the same partition. For example, for CRM applications a typical
partitioning is by zip code or by the rst few digits of zip codes. The underlying assumption is that
duplicates have the same zip code, i.e., there is no typo in the zip code and the customer has not
moved from one zip code to another. If two duplicate records have the same zip code, they appear
in the same partition and thus can be recognized as duplicates. Other partitionings might be by last
name or some xed-sized prex of them, by street name, by employer, etc. In general, partitions of
roughly the same and predictable size are preferable. For instance, partitioning by the rst letter of
4.1. PAIRWISECOMPARISONALGORITHMS 45
the last name yields several very small partitions (q, x, ) and some very large partitions (m, s, ).
Especially comparing all pairs in these large partitions can make the problem infeasible.
To drop the assumption that duplicates have the same partitioning key, a multi-pass method
can be employed. That is, the blocking algorithm is run multiple times, each time with a different
partitioning key. The chance that a duplicate pair does not appear together in at least one of the
partitions is very low. It would have to contain errors in every attribute that is used for partitioning.
After the multiple runs, the transitive closure is formed over all detected duplicates because
duplicity is inherently a transitive relation, and thus more correct duplicate pairs can be reported.
Even within a single run, the transitive closure returns duplicates that were missed by the similarity
measure: With an edit distance threshold of 1, Maine St. is close to Main St., which, in turn, is close
to Moin St.. Thus, even though all three strings might appear within the same block, only two pairs
are recognized as duplicates, and the third pair is discovered only through transitivity.
4.1.2 SORTED-NEIGHBORHOOD
Windowing methods, such as the Sorted-Neighborhood method, are slightly more elaborate than
blocking methods. Hernndez and Stolfo [1995, 1998] describe the Sorted-Neighborhood Method
(SNM), which is divided into three phases. In the rst phase, a sorting key is assigned to each record.
The key does not have to be unique and can be generated by concatenating characters (or substrings
of values) from different attributes. It is useful to carefully select values based on the probability of
errors. For instance, it is more likely to err in a vowel than in a consonant, and it is less likely to err
in the rst letter of a name. Thus, a key for a customer record could be dened as:
rst 3 constants of last name | rst letter of last name | rst 2 digits of zip code
In the second phase, all records are sorted according to that key. As in the blocking method,
the assumption is that duplicates have similar keys and are thus close to each other after sorting. The
rst two phases are comparable to the selection of a partitioning predicate and the actual partitioning
in the blocking method.
The nal, third phase of SNM slides a window of xed size w across the sorted list of records.
All pairs of records that appear inthe same windoware compared.Thus, whensliding the window, one
record moves out of the window and a new record enters it. Only this new record must be compared
with the remaining w 1 records. The size of the windowrepresents the trade-off between efciency
and effectiveness; larger windows yield longer runtimes but detect more duplicates. In experiments,
window sizes between 10 and 30 have been reported to be effective. Figure 4.2 shows the candidate
pair matrix with those pairs shaded that are compared by SNM with a windows size of 4. Again,
the number of comparisons is reduced from 190 to 54.
The sorting key should be chosen distinct enough so that the number of records with the same
key is not greater than the window size. Otherwise, not all records with the same key are compared
and duplicates may be overlooked. A more distinct key enable a more ne-tuned sorting. Also, the
rst few characters of the key are obviously more important than the last few. Thus, one should
choose attributes that are likely to contain few errors for the rst characters of the sorting key.
46 4. DUPLICATEDETECTIONALGORITHMS
1234567891
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
r
1
r
2
r
3
r
4
r
5
r
6
r
7
r
8
r
9
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
1
r
2
r1
r2
r3 r3
r4
r5
r6 r6
r7
r8
r9
r10
r11
r12
r13
r14
r15
r16
r17
r18
r19
r20
Figure 4.2: Matrix of duplicate candidates with the Sorted-Neighborhood algorithm.
As opposed to the blocking method, SNMrequires a transitive closure step not only due to the
nature of the similarity measure but also because duplicates appear in different partitions/windows:
Figure 4.3 shows a situation where two duplicate pairs are found within different windows and only
the transitive closure produces the complete set of duplicate pairs. Records r3, r3, and r3 are all
duplicates, but r3 and r3 never appear within the same window of size 4. Only the transitive closure
using their mutual duplicity with r3 reveals that they too are duplicates.
r1 r2 r3 r4 r5 r3 r7 r3 r9 r10 r11 r12 r13 r14 r15
Figure 4.3: Transitive closure for SNM even after a single pass.
As for the blocking method, there is a chance that the sorting characters contain errors. To
avoid mis-sorts, multi-pass variants of SNM produce multiple keys and perform the sorting and
windowing multiple times. As with the blocking method, the transitive closure is nally calculated.
The Multipass Sorted-Neighborhood method is summarized here:
1. Choose set of keys K.
4.1. PAIRWISECOMPARISONALGORITHMS 47
2. For each key k K:
(a) Sort records according to k.
(b) Slide window over records comparing all pairs within a window.
3. Determine transitive closure
The implementation of SNM is usually done with the help of a DBMS: Using a surrogate key
for the records, key creation generates a temporary table with the surrogate key column along with
one column for each sorting key. In each pass, this temporary table, is sorted by the corresponding
sorting key column and joined with the main table to fetch the records to be compared. Thus, it
can be expected that each run needs only three read-passes over the data one to create the keys, a
second for sorting the temporary table and a third pass to join and compare the records within the
window.
Compared to the exhaustive method, the number of comparisons is greatly reduced, namely to
w n. If w < log n, the comparisons are dominated by the sorting phase, which requires O(n log n)
comparisons. As opposed to the blocking method, the number of comparisons is accurately pre-
dictable.
Research has produced many variants of SNM: Monge and Elkan [1997] adopt the general
SNM approach, but propose both a domain-independent similarity measure and the union-nd
data structure to efciently manage candidate pairs and detected duplicate groups. It denes a
representative of each already detected duplicate group. Records are rst compared only to that
representative, and thus many comparisons are avoided. Only if similarity to the representative is
high enough, does a more complete comparison with all member of the duplicate group commence.
An SNM variant for nested XML data is presented by Puhlmann et al. [2006] and described among
other algorithms for non-relational data in Section 4.2.1.
4.1.3 COMPARISON
Blocking and Sorted-Neighborhood have much in common. Both aim at reducing the number
of comparisons by making intelligent guesses as to which pairs of records have a chance of being
duplicates. Both rely on some intrinsic ordering of the data and the assumption that records that
are close to each other with respect to that order have a higher chance of being duplicates than
other pairs of records. Their closeness is maybe best characterized by the work of Yan et al. [2007] in
which they present an adaptive sorted neighborhood method, which in fact (and inadvertently?)
turns out to be a blocking method. A deeper comparison and a generalization of both methods is
presented by Draisbach and Naumann [2009]. Table 4.1 shows the computational complexities of
the different methods and compares them to the full enumeration of all pairs of records.
Finally, it has been noted that a more effective way of reducing the search space is to prune
entire records, not just pairs. If it can be easily shown for a particular record that its similarity to all
other records is below the threshold, all candidate pairs that involve this record can be eliminated.
However, checking this condition is not trivial and only applicable in few occasions.
48 4. DUPLICATEDETECTIONALGORITHMS
Table 4.1: Complexity analysis with number of partitions b, window size w and
number of records n (from Draisbach and Naumann [2009])
Blocking Windowing Full enum.
Number of comparisons
1
2
(
n
2
b
n) (w 1)(n
w
2
)
n
2
n
2
Key generation O(n) O(n) n/a
Sorting O(n log n) O(n log n) n/a
Detection O(n
2
/b) O(wn) O(n
2
)
Overall O(n (n/b +log n)) O(n(w +log n)) O(n
2
)
4.2 ALGORITHMS FORDATAWITHCOMPLEX
RELATIONSHIPS
In the previous section, we discussed algorithms that efciently determine duplicate pairs. They have
in common that they classify a pair of candidates as duplicate or non-duplicate solely by using the
object description of the two candidates. That is, pairs of candidates are compared and classied
independently of other candidates. In Section 2.3, we discussed that for comparisons, we can use
relationship descriptions in addition to object descriptions during comparisons.
In this section, we describe algorithms that consider relationships. In Section 4.2.1, we discuss
algorithms specialized to hierarchical relationships, including XML data. These algorithms not only
use relationship descriptions to potentially improve the effectiveness of duplicate detection, but
they also prune pairwise comparisons based on relationships. In Section 4.2.2, we then show how
relationships between candidates that result in a graph structure potentially improve effectiveness.
4.2.1 HIERARCHICAL RELATIONSHIPS
All algorithms that exploit hierarchical relationships have in common that they assume that a given
candidate type can occur at exactly one level of the hierarchy (although several candidate types may
appear on the same level), i.e., all addresses of a customer appear in the same position in a hierarchy,
but at that same level, there may also be other data, such as phone numbers.
One possibility to exploit the hierarchical structure is to traverse the tree in a top-down
fashion [Ananthakrishna et al., 2002; Weis and Naumann, 2004]. That is, candidates at the top-
most level l
1
are compared before we proceed to level l
2
and so on. To prune comparisons, algorithms
proceeding in top-down order assume that the parent-child relationships reect 1:N relationships
of the real world. For instance, assume that cities are nested under countries. This nesting reects
a 1:N relationship of the real world, as a city is usually not split over several countries, whereas a
country in general has more than one city. Based on the 1:N relationship assumption, it is true that
two candidates on level l
i+1
may only be duplicates if their parents on level l
i
are duplicates (or are
the same parent). Indeed, it does not make sense to compare cities from different countries. As a
consequence, when traversing the data in a top-down fashion, we can prune comparisons based on
duplicate classications previously performed on ancestors.
4.2. ALGORITHMS FORDATAWITHCOMPLEXRELATIONSHIPS 49
The SXNM algorithm [Puhlmann et al., 2006] does not assume a 1:N relationship between
parent and child elements, and, therefore, does not apply the same pruning principle. Instead, it
traverses the hierarchy frombottomtotopto, for instance, alsodetect duplicate authors that are nested
under non-duplicate books. Comparisons are pruned using an extension of the Sorted Neighborhood
method. When proceeding fromlevel l
i
to l
i1
, SXNMuses the knowledge of duplicates in children,
which are part of the relationship description of candidates on level l
i1
to compute similarity, for
instance, using one of the specialized XML similarity measures presented in Section 3.4.
4.2.2 RELATIONSHIPS FORMINGAGRAPH
As the algorithms on hierarchical data show, relationship descriptions allowus to propagate compar-
ison results from one pairwise comparison to another, related comparison. Indeed, in the top-down
traversal, the fact that two candidates are not duplicates is propagated to the child candidates, whereas
in SXNM, children similarities are propagated to the parent level.
In the general case, relationships between candidates can form a graph. Therefore, we refer to
algorithms performing duplicate detection on such relationship graphs as graph algorithms. In this
section, we discuss a general framework for graph algorithms [Weis, 2008] that applies to several
algorithms proposed in the literature [Bhattacharya and Getoor, 2007; Dong et al., 2005].
To better illustrate the concepts underlying graph algorithms, we use the following example.
Example 4.1 Relationship graph. Assume we have candidates of type author, paper, and venue.
Relationship descriptions translate the following relationships:
The set of papers appearing in a venue is descriptive of that venue.
The set of authors of a paper is descriptive of a paper.
In Figure 4.4(a), we show data for two papers with the associated author and venue information.
This data may have been extracted from two services that use different representations of authors
and venues. We added labels to uniquely identify publications, authors, and venues. Note that a3 is
recognized as shared author between both publications because the author name is exactly equal. A
sample graph that illustrates how the candidates of Figure 4.4(a) relate to each other based on their
relationship descriptions is depicted in Figure 4.4(b). In this graph, an edge points from candidate
c
1
to candidate c
2
if c
2
is part of the relationship description of c
1
.
In the above example, the relationship description of a candidate of a given type T consists
of neighboring candidates in the relationship graph that may help identify duplicates of a candidate
of type T . For instance, the relationship description of candidates of type venue consists of papers
appearing in that venue. If papers are detected to be duplicates or if they are very similar, then the
likelihood increases that their respective venues are duplicates as well. As discussed in Section 2.3.1,
dening relationship descriptions requires domain-knowledge, and we assume that a domain expert
50 4. DUPLICATEDETECTIONALGORITHMS
Information extracted for a publication P1
p1: Duplicate record detection: A survey
a1: A.K. Elmagarmid
a2: P.G. Ipeirotis
a3: V.S. Verykios
v1: Transactions on knowledge and data engineering
Information extracted for a publication P2
p2: Duplicate Record Detection
a4: Elmagarmid
a3: P.G. Ipeirotis
a5: Verykios
v1: TKDE
v1
p1 p2
v2
a1 a2 a3 a4 a5
(a) Sample data (b) Sample relationship graph
Figure 4.4: Relationships between candidates of type publication (p), author (a), & venue (v)
was able to dene dependencies at schema level. That is, the dependency is dened by knowledge
of the form venue candidates depend on paper candidates that are published in that venue.
Based on relationship descriptions dened at schema level, we can determine relationships at
instance level to obtain a relationship graph similar to the graph depicted in Figure 4.4(b). Based
on this graph, graph algorithms for duplicate detection then generally start with an iterative process
that compares (all) pairs of candidates. This iterative phase can be divided into three individual steps,
namely the retrieval step, the classication step, and the update step. Before we formally describe
the process, we illustrate the general idea by continuing our example from the publications domain.
Example 4.2 Iterative phase of graph algorithms. Given all candidates depicted in Figure 4.4,
we need to classify the following candidate pairs as duplicates or non-duplicates:
(v
1
, v
2
), (p
1
, p
2
), (a
1
, a
2
), (a
1
, a
3
), (a
1
, a
4
), (a
1
, a
5
),
(a
2
, a
3
), (a
2
, a
4
), (a
2
, a
5
), (a
3
, a
4
), (a
3
, a
5
), (a
4
, a
5
)
As candidates depend on related candidates in their relationship description, pairs of candidates
depend on pairs of related candidates. For instance, the candidate pair that consists of venues v
1
and v
2
depends on the comparison results of candidate pair (p
1
, p
2
). Let the pairs be processed in
the order listed above. Then, when (v
1
, v
2
) are compared, we do not have enough evidence that the
two venue candidates are duplicates the strings Transactions on knowledge and data engineering and
TKDE are quite different. However, once authors and papers have been classied as duplicates, we
have enough evidence to classify the venues as duplicates too. So, as we compare paper candidates,
we propagate their similarities and duplicate classications to venues and we need to compare venues
again.
An interesting fact of all graph algorithms for duplicate detection is that they may compare a
pair of candidates more thanonce, based onadditional knowledge gained during related comparisons.
4.2. ALGORITHMS FORDATAWITHCOMPLEXRELATIONSHIPS 51
This propagation of similarities or duplicate classications allows to detect hard cases where object
descriptions alone do not sufce to detect duplicates. For instance, no similarity measure based on
object descriptions alone is capable of identifying that the venue TKDE is the same as Transactions
on knowledge and data engineering. However, comparing a pair more than once may compromise
efciency. Therefore, different algorithms devise different heuristics to reduce the number of pair-
wise comparisons by choosing a smart comparison order. For instance, if we change the order of
comparisons described in Example 4.2 to
(a
1
, a
2
), (a
1
, a
3
), (a
1
, a
4
), (a
1
, a
5
), (a
2
, a
3
), (a
2
, a
4
),
(a
2
, a
5
), (a
3
, a
4
), (a
3
, a
5
), (a
4
, a
5
), (p
1
, p
2
), (v
1
, v
2
)
we can verify that no comparison needs to be performed more than once, even though we propagate
similarities. To avoid re-comparisons, the order may need to be updated before the next pair is
compared, so we essentially maintaina priority queue that needs to be updatedevery time information
is propagated.
Let us now dene the general process of pairwise comparisons in graph algorithms more
formally. The input to the iterative phase of graph algorithms is a relationship graph and an initial
priority queue PQthat contains all candidate pairs we want to compare. The order of PQis dened
by an algorithm-specic heuristic that aims at avoiding re-comparisons of candidate pairs.
Step 1: Retrieval. Retrieve the rst pair in PQ as determined by the heuristic used for ordering.
Step 2: Classication. The retrieved pair of candidates (c, c
)
whose relationship descriptions RD(d) and RD(d
) contain c and c
, respectively. Essentially,
this means that we propagate the information c is a (non-) duplicate of c
to the parents of c
and c
in our graph representation. Alternatively, we may also propagate the information The
similarity between c and c
later.
Step 3: Update. When a similarity or a duplicate classication needs to be propagated after the
comparison of (c, c
RD(d
is assigned a
cluster, it is never considered again.
Example 4.6 Consider again Cluster 3 of Figure 4.6(a). When sorting edges in ascending order of
their similarity, the rst similarity score to be considered in the subsequent scan is 0.9. Assume that
edge (11, 10) happens to be scanned rst, and that we decide that candidate 11 is the new center.
Then 10 is obviously part of the cluster centered at 11. The next edge to be considered is (10, 9).
Candidate 10 is already part of a cluster, so we just leave it there. Candidate 9, on the other hand,
occurs for the rst time, so it is set to be the center of the next cluster. The next edge we scan is
(11,9). Both candidates it connects are already part of a cluster. Next, edge (10, 12) is processed,
yielding the next center, i.e., candidate 12. We note that all candidates are now part of a cluster
with a center, so we can stop the computation. Figure 4.7(a) shows the nal result of processing
all connected components of Figure 4.6(a) using the center approach. We observe that the initial
Cluster 2 is now split into three separate clusters (2.1, 2.2, and 2.3). The same is true for Cluster 3
that is now divided into clusters 3.1, 3.2, and 3.3.
1
2
3
4
5
6 7
8
9
10 11 12
13
0.9
0.8
0.7
0.9
0.9
0.7
0.9
0.9
0.8 0.7
0.8
Cluster 1
Cluster 2.1
Cluster 3.2 Cluster 4 Cluster 3.3
Cluster 3.1
Cluster 2.2 Cluster 2.3
1
2
3
4
5
6 7
8
9
10 11 12
13
0.9
0.8
0.7
0.9
0.9
0.7
0.9
0.9
0.8 0.7
0.8
Cluster 1 Cluster 2
Cluster 4 Cluster 3
(a) center (b) merge-center
Figure 4.7: Clusters using center (a) and merge-center (b)
In the example above, we observe that the order of edges with equal similarities and the
candidate selected as a center when processing a pair potentially affects the selection of center nodes
as well as the nal partitioning. Indeed, when processing the rst pair (10, 11) of Cluster 3, had we
chosen candidate 10 as the center instead of candidate 11, the nal result would have been equal to
the initial Cluster 3. An extension of the simple center-based algorithmjust described that attenuates
these effects is to merge two clusters whenever the center of one cluster is similar to the center of the
other cluster. A possible nal result of this variant, called merge-center, is depicted in Figure 4.7(b).
56 4. DUPLICATEDETECTIONALGORITHMS
In this section, we discussed several algorithms that determine duplicate clusters based on
a previously computed duplicate pair graph. These methods can be used as a post-processing step
after pairwise duplicate classication. Which of these algorithms is best suited highly depends on the
application domain, the distribution of duplicates, and how dirty these duplicates are. One reason
for this is that these approaches are solely based on global thresholds and parameters, and they do not
adjust to changing characteristics of the data or the initial clusters formed by pairwise classications.
For instance, using the merge-center algorithm, we see that both Cluster 2 and Cluster 3 remain
connected, although the shape of the clusters signicantly differs: In Cluster 3, the shortest path
from any node to the other is at most 2, whereas in Cluster 2, it varies between 1 and 4. In the
next section, we describe an algorithm that integrates both pairwise comparisons and clustering but
adjusts to local characteristics of data and clusters to potentially better partition the set of candidates
into duplicate clusters.
4.3.2 CLUSTERINGADJUSTINGTODATA&CLUSTERCHARACTERISTICS
The algorithm we present in this section determines duplicate clusters based on the observation
that duplicates of a same object usually have a small distance to each other, and they have only a
small number of other candidates within a small distance [Chaudhuri et al., 2005]. To capture these
properties, the criteria of compact set and sparse neighborhood are introduced.
We illustrate these two concepts in Figure 4.8. Intuitively, these concepts dene two growth
spheres: in the inner sphere, which corresponds to the compact set, we nd only candidates that
are all mutually closest to each other; in second sphere, we nd very few candidates that are close
to candidates in the inner sphere, thereby forming a sparse neighborhood. Note that the closeness
to a candidate is dened by a radius of p times the minimal distance of a candidate to its closest
candidate.
We can interpret the radius of each sphere as a threshold that separates duplicates from non-
duplicates. However, in contrast to algorithms performing pairwise classications that use a single
xed threshold, we can have different thresholds for each cluster! This makes the algorithm more
robust to the chains of pairwise matches that yield two very dissimilar candidates as duplicates as
we observed in Section 4.3.1.
Before we describe the algorithm that determines duplicate clusters based on the compact
set and sparse neighborhood properties, we need to dene these two concepts more precisely and
require some notation. Let us rst dene a compact set, denoted as CS. Essentially, a set of candidates
is a compact set iff for every candidate c CS, the distance dist (c, c
is
among the I most similar candidates to c (and vice versa). Alternatively, we may also use a similarity
threshold to dene the minimum similarity that has to exist between any two candidates in a group.
In the following, we only consider the rst variant where the cardinality of a group is limited by I.
The problem is solved in a two-phase algorithm, which we briey describe next.
Phase 1: Nearest-neighbor list computation. The goal of this phase is to determine, for each can-
didate c, the set of nearest neighbors of c as well as its neighborhood growth. For each candi-
date c, we thus determine a triple cid, nnList , ng, where cid uniquely identies candidate
c, nnList is a set of nearest neighbors of c, and ng is the neighborhood growth of c. The
nearest-neighbor list of c consists of the I closest candidates to c.
The fact that candidates originate from records stored in a relational database allows the use
of indices to process queries of the form: given any candidate (record) c, fetch its nearest
neighbors. If no indices are available, nested loops join methods are used in this phase.
58 4. DUPLICATEDETECTIONALGORITHMS
Phase 2: Partitioning phase. The second phase uses the output of Phase 1 to partition the original
set of candidates into a minimum number of groups that correspond to compact sets with
sparse neighborhoods. The resulting partitions thus correspond to the nal result of duplicate
detection. The nal result is obtained in two steps, namely the construction step of compact
sets and the partitioning step.
Step 1: Compact pair construction. First, we determine if the neighbor sets of varying sizes
between 1 and I of two candidates c and c
). Essentially, CS
i
signies that the i closest
neighbor sets of c and c