Fuzzy Logic Assignment 1
Fuzzy Logic Assignment 1
Fuzzy Logic Assignment 1
Fuzzy
Logic
Assignment 1:
Practical
Tutors: Prof. Francisco Chiclana Parrilla, Dr Jenny Carter
Samuel Keays
1-1-2014
Contents
Background............................................................................................................ 2
Input Variables.................................................................................................... 4
Name delta...................................................................................................... 4
Address delta................................................................................................... 4
NI Number edit distance.................................................................................. 4
Home phone number edit distance................................................................. 5
Geographical location...................................................................................... 5
Gender............................................................................................................ 5
Output Variable................................................................................................... 6
Rules:.................................................................................................................. 6
Defuzzification.................................................................................................... 8
Experiments and Tweaking:................................................................................ 8
Conclusion........................................................................................................ 10
Bibliography......................................................................................................... 10
Background
The aim of this report is to present a fuzzy logic system which enables one to
produce reasonable estimates of how closely two addresses match by defining
rules that allow matching to take place.
Master Data Management (Haselden, 2006) is a type of software solution that
has developed rapidly and has been invested in by many large corporate and
government agencies. In essence it is a process whereby multiple database
sources often from widely different systems such as CRMs and product
catalogues are mapped together to form one single view of a customer or a
product. This can either be a registry system where the master database merely
points to records it is mastering (and sends messages to inform them of the need
to update or change), and physical (or repository) models where an actual
database model is constructed that is used as the underlying database models
for all other systems after the outlying systems have been processed in. (Wolter,
2007) This can either be in the form of a single batch file (initial load) or deltas
which are difference files between times (which is an especially important use
case in registry style MDM).
As an example of how this works: there may be an individual called Joseph
Ethelbert Bloggs. At a particular bank he has a current account and a loan. On
his current account he may be called Joe E Bloggs. On his loan he is simply
Joseph Bloggs. The task of the system is to match such entries to inform the
bank they are one and the same people. Naturally there are many more
datapoints that are taken into account when matching. National insurance
numbers for example may be particularly relevant. Names are often standardised
with the use of tables that map from common nicknames to their fullname.
Auxiliary tasks include linking households of people with the same address and
so forth. Most systems also have a degree of human interaction. So called data
stewards take those entries whose matching is uncertain by the system and
make a decision. Naturally any match can full into three tiers: automatic
matching, review matching and no match.
The matching itself often involves making decisions about who or what will be
matched together into one source. There are two main types of matching engine,
one uses rules based system, as for example Informaticas MDM solution. (Lira,
2013) Others such as IBMs initiate use probabilistic matching which usually uses
the mutual information of data in the system that matches to conclude how
important the overlap is. (Whei-Jen Chen, 2014, pp. 167-185) My concern will
primarily be with the rules matching, because this is where, to my mind there is
some overlap. Obviously this is a huge project potentially with many research
possibilities. For this report I want to demonstrate that some kind of simple fuzzy
logic rule can be used to match potential duplicates together with different
degrees of certainty based on five different correlations between the data in two
entries such as edit distance or number of fields for (say) address that match.
Note this is different to fuzzy string matching, which is already used in such
systems. Instead it will be necessary to create a rule set that defines how close
two names are by virtue of their edit distances, among other things. The system
will largely consist of rules based on these various metrics which will feed to a
fuzzy set which is a score of how closely they are matched.
An interesting question arises: is fuzzy logic necessary? The key benefit of fuzzy
logic is that it is possible to define uncertain concepts in a mathematical model.
In this case, the linguistic variable is close is by its very nature uncertain.
Existing systems use complex rule bases with detailed rules about edit distances
and various compound conditions. This therefore is an attempt to see if a small
subset of that rules base the names is reducible to a fuzzy system whose
membership functions can be defined which replicates these rules. If they can,
then there is the possibility that these very rules can be captured by some kind
of Computation with Words methodology. Instead of a user specifying that a
name must be within 2 edit distance, say, the user could simply state that the
two names should be very close with an appropriate relation being formed as a
consequence from the fuzzy system. This could potentially speed up the time it
takes to develop the rules set.
Input Variables
I have chosen 6 separate input variables:
Name delta
Address delta
NI Number edit distance
Home phone number edit distance
Geographical location
Gender
Note the assumption is that inputs have already been standardised, for examples
nicknames to names. I tried to stick to the principle that where possible the total
membership grade should equal one so as not to produce strange results with
the output function where parameters are needlessly truncated (Lilly, 2010, pp.
15-16). However, for the output function itself this rule had to be broken in order
to produce membership functions that gave useful centroids for the final results
(that is narrow Gaussians at the edge that give numbers close to 1).
Name delta
For this input I am specifying the number of different entries between two
aggregate names, made up of first name, middle name and last name. For
example, the name Samuel John Keays would produce a value of 1 with
Samuel Keays a value of 2 with Joseph Peter Keays and a value of 4 with
Elizabeth Mary Cook.
The initial membership functions for this will be a sigmoidal function with slope
value a = 3.75 and the intercept point c = 1.75. This is because having 1 name
match which in this case will produce a very weak membership grade, is not
particularly insightful. A lot of people share first names, middle names and last
names. When two names match there is a good possibility of a match (especially
if other inputs are fired). 3 names matching is a definite name match. The
sigmoidal function maps this as shown in the appendix. I took some
experimenting to get a slope with the desired degree of curvature but this was
eventually achieved. The second membership function is no match, which
naturally is symmetric around 2 but in the opposite direction.
Address delta
For this input I am specifying the number of different entries between two
aggregate addresses, made up of the address line 1, address line 2, address line
3, city and postcode.
The initial membership functions for this will be a sigmoidal function with slope
value a = 1 and the intercept point c = 2.5 to get the curves to sit flush with 1.
Again experimental evidence needed to produce the appropriate curve, and the
no match membership function was again symmetrical around 3 the mid-. The
curve has been defined as much smaller because even one match between
different addresses is likely to suggest a reasonable chance of a match, although
it may be a town. In a real system there would be more input values for specific
values of the address which would have additional rules, i.e. postcode would give
a high chance of matching, whereas town would give a low chance. For the sake
of this though only the address delta will be of concern.
Geographical location
Many systems have geolocation data. Great circle arc distances of addresses can
be used to match people. There will be three separate membership functions, a
sigmoid for identical from 0km to 3km to account for measure measure issues. I
will create another which is potentially moved based on the assumption that
people move typically within 25km or so, with a standard deviation of 20 km. For
this it seems to make sense to use a Gaussian membership function. Finally
there will be a sigmoid from identical to the end of the range which is different
location. Given I am assuming UK data a maximum distance of 1000km seems
reasonable.
Gender
This is categorical data which either has the value -1 (not match) or 1 (match),
with 0 as uncertain in cases where the data is unknown. This is a sigmoid with its
match in the centre with membership grade 0 for both at 0.5 for not sure, and 1
for sure on either side.
Output Variable
The output is a match decision. This is either no match, that it should be sent to
a human data steward or match. Initially these are set as three evenly spaced
Gaussians, on the assumption that the application of various rules should
produce nice evenly spaced output sets that will defuzzify in a relatively
straightforward manner due to their balanced nature, especially under centroid
systems where the height of the various Gaussian should linearly move the
defuzzified value around.
Rules:
The number of potential rules are:
R=l n
Where l is the number of linguistic inputs for each label (Ross, 2004, p. 275). This
produces:
2 x 2 x 2 x 2 x 3 x 2 = 96 rules, which whilst tractable can be reduced.
In order to reduce the number of rules necessary to calculate this it is necessary
instead to relate expert judgement of when matches take place. In other words
the specific scenarios that generate the three types of outcome are deduced
from my domain knowledge, translated into logical rules and processed as such.
Then it should be checked that every combination has some kind of effect, even
if this is deliberately to ignore inputs that is are not useful unless in conjunction
with another input.
Due to the relatively small number of membership functions in each (2 mainly)
the number of combinations is lower and it is not necessary to use and so much
so methods such as the Comb method or SVD decomposition are not necessary
and apply to Sugeno type inference systems typically in any case. However,
once the two main scenarios have been defined, the use of the Or operator
helps reduce to a smaller size the number of conditions that fail to match, which
would otherwise take up a large bulk of the defined rules. Primarily the rules
have been minimised by considering the conditions under which expert opinion
would categorise certain data, and then excluding the negative cases later on.
The following scenarios are regarded as being appropriate for an automatic
match, based on my own experience with configuring MDM systems:
There are four situations in which automatic matching should take place:
1. When name delta is match and address delta is match and NI edit
distance is match then Matching is Automatch
When name, address and NI number all match then it is almost certainly the
same person.
2. When name delta is match and address delta is match and
Geographical distance is identical then Matching is Automatch
When name, address and the geolocation data all match then it is almost
certainly the same person the geolocation data corroborates a likely match.
3. When name delta is match and address delta is match and Gender
is match then Matching is Automatch
When name, address and the gender data all match then it is almost certainly
the same person the gender data corroborates a likely match.
4. When name delta is match and NI edit distance is match and
Gender is match then Matching is Automatch
Both NI edit distance and gender suggest a match, because this combination
of three entries is picks up identical individuals who have moved addressed
using the NI number as a unique identifier
There are also five situations where it should be sent to the data steward:
5. When name delta is match then matching is datasteward
Any names that match fully should be investigated as they are highly
likely to be duplicates.
6. If address delta is match and Home Phone Number is match then
matching is data steward
7. If address delta is match and Geographical Distance is not
different location then matching is data steward
8. If Home Phone Number and Geographical distance match then
matching is data steward
These are all cases of someone possibly else living at the same address. But
it will need to be investigated by a human data steward to determine if they
are different people and that it is not due to data errors
9. NI number matches then matching is data steward
NI number is a unique identifier. Any match should be investigated by a
human user if everything else doesnt match.
12.
Geographical distance is not identical and home phone
number then Matching is no match
13.
NI Edit distance is no match and home phone number then
Matching is no match
These all demand that the person in question is identified uniquely or has the
same address before a penalty is incurred for the home phone number.
14.
Address delta is no match and geographical distance is not
identical then Matching is no match
This demands that the incorrect geographical distance only gives a penalty if
the address is not the same. Otherwise no rule is fired, which therefore acts
as a filter against otherwise useless information.
Defuzzification
I initially went with a centroid method, as ultimately I want the various peaks of
the Gaussians to be picked up and utilised.
The testing of the defuzzificaiton was relatively easy in the end. Because I have
membership functions that reach maximum at the beginning and the end and a
balancing middle membership function that rarely reaches maximum, I found
that smallest of maximums, mean of maximums and largest of maximums
produces sharp shifts between 0 and 1 as one or the other membership function
on the edges of the output range jumped to one, which in some sense rendered
the defuzzification far too coarse to the point of being meaningless. Because of
my triangular or Gaussian curves there was little to choose from between
centroid and bisector, so I used the centroid for its simplicity and because the
definition of a bisector negates some of what I was trying to achieve with the
narrow Gaussians by ignoring their moment contribution at their tips.
Conclusion
This was designed as a proof of concept on a small set of data. I feel that it has
been sufficiently useful that potentially it could be the basis for further
development and it offers a useful means to score matching in rules based
MDM systems without necessarily having to resort to complex and
computationally expensive probabilistic calculations and matching. Whether or
not one classifies two items as matching is essentially a linguistic variable that
will display a good deal of fuzzy uncertainty between people and seems ideally
suited to fuzzy modelling.
To implement it in reality it would be critical to have a way of mapping from user
requests or rules to fuzzy rules. This would not be a simple task, but basic
models could be mapped into member functions - these, for what it is worth,
could also be enhanced by linguistic hedges which would give the user finer
grained control over the matching rules than they have now - and simple user
requests into user rules. As I found calibrating the rules themselves is difficult
and this would be a major challenge for automating the system. Nonetheless it is
a least conceivably possible, especially considering there are libararies in Java
such as jFuzzyLogic whose basic features are not so different to MATLAB - which
could integrate into the necessary softwares API and be used as a plug-in.
Bibliography
Black, P. E. (2013). Levenshtein distance (Vol. Dictionary of Algorithms and Data
Structures ). U.S. National Institute of Standards and Technology.
Haselden, R. W. (2006, November). The What, Why, and How of Master Data
Management. Retrieved from Microsoft Developer Network:
http://msdn.microsoft.com/en-us/library/bb190163.aspx
Lilly, J. H. (2010). Fuzzy Control and Identification. Wiley.
Lira, J. (2013). External Match. Informatica University.
Ross, T. J. (2004). Fuzzy Logic: with Engineering Applications (2nd ed.).
Whei-Jen Chen, B. A. (2014). Building 360-Degree Information Applications. IBM
Redbooks.
Wolter, R. (2007, April). Master Data Management (MDM) Hub Architecture.
Retrieved from Microsoft Development Network:
http://msdn.microsoft.com/en-us/library/bb410798.aspx
match
1
nomatch
Degree of membership
0.8
0.6
0.4
0.2
0
0
0.5
1.5
name delta
2.5
Address Delta:
match
1
nomatch
Degree of membership
0.8
0.6
0.4
0.2
0
0
0.5
1.5
NI Edit distance
2
2.5
3
address delta
3.5
4.5
nomatch
1
match
Degree of membership
0.8
0.6
0.4
0.2
0
0
10
20
30
40
50
60
NI Edit distance
70
80
90
100
nomatch
1
match
Degree of membership
0.8
0.6
0.4
0.2
0
0
4
5
NI Edit distance
Geographical distance
differntlocation
identical
identical
Degree of membership
0.8
0.6
0.4
0.2
100
200
300
400
500
Geographical distance
600
700
800
900
1000
Gender
nomatch
match
Degree of membership
0.8
0.6
0.4
0.2
-1
-0.8
-0.6
-0.4
-0.2
0
Gender
0.2
0.4
Match decision:
nomatch
1
datasteward
automatch
Degree of membership
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Matching
0.7
0.8
0.9
0.6
0.8
match
1
nomatch
1
nomatch
0.8
Degree of membership
0.8
Degree of membership
match
0.6
0.4
0.2
0.6
0.4
0.2
0
0
0.5
1.5
name delta
2.5
match
1
nomatch
3
4
5
6
7
8
Home Phone Number Edit distance
10
nomatch
11
match
0.8
0.6
Degree of membership
Degree of membership
0.8
0.4
0.2
0
0
0.5
1.5
2
2.5
3
address delta
3.5
4.5
0.6
0.4
0.2
0
-1
nomatch
1
-0.6
-0.4
-0.2
0
Gender
identical
moved
match
0.2
0.4
0.6
0.8
differntlocation
0.8
Degree of membership
0.8
Degree of membership
-0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
4
5
NI Edit distance
100
200
300
400
500
Geographical distance
600
700
800
900
1000
nomatch
1
datasteward
automatch
Degree of membership
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Matching
0.7
0.8
0.9
=
0
0
0
0
0
0
2
0
0
2
0
1
1
1
0
0
0
0
0
[
0 1 3 1 1
3 1 3 1 1
0 1 3 1 1
0 1 3 1 1
0 0 2 1 1
0 1 2 1 1
0 1 2 1 1
3 0 2 1 1
0 1 2 1 1
-1 0 2 1 1
0 2 1 1 2
0 0 1 1 1
-3 2 2 1 1
0 2 1 1 1
-3 0 1 1 2
3 0 2 1 1
3 0 2 1 1
0 2 2 1 1
2 1 2 1 1];
Appendix F Testing:
Sample
Matching
fields
Result in
system?
Matched with
fuzzy
inference?
Yes
James
Frederick
Abrhams
James
Frederick
Abrhams
Name 3
characters, NI
Number,
Address,
Gender
Automatch
James
Madrigalson
Fairbuck
James
Modrigalson
Fairbuck
Elizabeth
Podrigal
Ginzer
James Toons
Ginzer
Name 2
characters,
Address, ,
Gender
Data steward
referral
Yes
Name 1
character
Fail
No
Successful?
No
This example
led to the
rejigging of
the
membership
function to
peak at the
edges
Yes
Yes. Although
it did trigger
one of the
rules for data
stewardship
the
membership
value when
only 1 name
is different is
too low to
trigger which
is the desired
result