Guide 2 Data Mining
Guide 2 Data Mining
Guide 2 Data Mining
Data Mining
Thanks to ...
my son Adam
my wife Cheryl
Roper
also a huge thanks to all the photographers who put their work in the Creative Commons
ii
INTRO
Preface
If you continue this simple practice every day, you
will obtain some wonderful power. Before you
attain it, it is something wonderful, but after you
attain it, it is nothing special.
Shunryu Suzuki
Zen Mind, Beginner's Mind.
Before you work through this book you might think that systems like Pandora, Amazon's
recommendations, and automatic data mining for terrorists, must be very complex and the
math behind the algorithms must be extremely complex requiring a PhD to understand. You
might think the people who work on developing these systems are like rocket scientists. One
goal I have for this book is to pull back this curtain of complexity and show some of the
rudimentary methods involved. Granted there are super-smart people at Google, the
National Security Agency and elsewhere developing amazingly complex algorithms, but for
the most part data mining relies on easy-to-understand principles. Before you start the book
you might think data mining is pretty amazing stuff. By the end of the book, I hope you will
be able to say nothing special.
The Japanese characters above, Shoshin, represent the concept of Beginner's Mindthe idea
of having an open mind that is eager to explore possibilities. Most of us have heard some
version of the following story (possibly from Bruce Lee's Enter the Dragon). A professor is
seeking enlightenment and goes to a wise monk for spiritual direction. The professor
dominates the discussion outlining everything he has learned in his life and summarizing
papers he has written. The monk asks tea? and begins to pour tea into the professor's cup.
And continues to pour, and continues to pour, until the tea over pours the teacup, the table,
and spills onto the floor. What are you doing? the professor shouts. Pouring tea the monk
says and continues: Your mind is like this teacup. It is so filled with ideas that nothing else
will go in. You must empty your mind before we can begin.
iii
To me, the best programmers are empty cups, who constantly explore new technology
(noSQL, node-js, whatever) with open minds. Mediocre programmers have surrounded their
minds with cities of delusionC++ is good, Java is bad, PHP is the only way to do web
programming, MySQL is the only database to consider. My hope is that you will find some of
the ideas in this book valuable and I ask that you keep a beginner's mind when reading it. As
Shunryu Suzuki says:
In the beginner's mind there are many possibilities,
In the expert's mind there are few.
iv
People knew your likes and dislikes, your health, the state of your marriage. For better or
worse, it was a personalized experience. And this highly personalized life in the community
was true throughout most of the world.
Let's jump ahead one hundred years to the 1960s. Personalized interactions are less likely but
they are still present. A regular coming into a local bookstore might be greeted with "The new
James Michener is in"-- the clerk knowing that the regular loves James Michener books. Or
the clerk might recommend to the regular The Conscience of a Conservative by Barry
Goldwater, because the clerk knows the regular is a staunch conservative. A regular customer
comes into a diner and the waitress says "The usual?"
Even today there are pockets of personalization. I go to my local coffee shop in Mesilla and
the barista says "A venti latte with an extra shot?" knowing that is what I get every morning. I
take my standard poodle to the groomers and the groomer doesn't need to ask what style of
clip I want. She knows I like the no frills sports clip with the German style ears.
But things have changed since the small towns of 100 years ago. Large grocery stores and big
box stores replaced neighborhood grocers and other merchants At the start of this change
choices were limited. Henry Ford once said "Any customer can have a car painted any color
that he wants so long as it is black." The record store carried a limited number of records; the
bookstore carried a limited number of books. Want ice cream? The choices were vanilla,
chocolate, and maybe strawberry. Want a washing machine? In 1950 you had two choices at
the local Sears: the standard model for $55 or the deluxe for $95.
1-2
over 100,00
0 titles
I want to buy a laptop? When I type in laptop into the Amazon search box I get 3,811 results
1-3
LObscurityN
But how to find stuff?
Years ago, in that small town, our friends helped us find
stuff. That bolt of fabric that would be perfect for us; that
new novel at the bookstore; that new 33 1/3 LP at the
record store. Even today we rely on friends to help us
find some relevant stuff.
We used experts to help us find stuff. Years ago Consumer Reports could evaluate all the
washing machines soldall 20 of themor all the rice cookers sold-- all 10 of them and make
recommendations. Today there are hundreds of different rice cookers available on Amazon
and it is unlikely that a single expert source can rate all of them. Years ago, Roger Ebert
would review virtually all the movies available. Today about 25,000 movies are made each
year worldwide. Plus, we now have access to video from a variety of sources. Roger Ebert, or
any single expert, cannot review all the movies that are available to us.
We also use the thing itself to help us find stuff. For example, I owned a Sears washing
machine that lasted 30 years, I am going to buy another Sears washing machine. I liked one
album by the BeatlesI will buy another thinking chances are good I will like that too.
1-4
These methods of finding relevant stufffriends, experts, the thing itselfare still present
today but we need some computational help to transform them into the 21st century where
we have billions of choices. In this book we will explore methods of aggregating people's likes
and dislikes, their purchasing history, and other dataexploiting the power of social net
(friends)to help us mine for relevant stuff. We will examine methods that use attributes of
the thing itself. For example, I like the band Phoenix. The system might know attributes of
Phoenixthat it uses electric rock instrumentation, has punk influences, has a subtle use of
vocal harmony. It might recommend to me a similar band that has similar attributes, for
example, The Strokes.
1-5
My father belonged to the United Auto Workers' Union. Around election time I remember
the union representative coming to our house to remind my father what candidates to vote
for:
Hey Syl, how are the wife and kids? Now let me tell you why you
should vote for Frank Zeidler, the Socialist candidate for mayor...
This individualized political message changed to the homogenous ads
during the rise of television. Everyone got the exact same message. A
good example of this is the famous Daisy television ad in support of
Lyndon Johnson ( a young girl pulling petals off a daisy while a
nuclear bomb goes off in the background). Now, with elections
Frank Zeidler was
the
determined by small margins and the growing use of data mining,
Socialist mayor of
Milwaukee from
individualization has returned. You are interested in a women's
1948
to 1960.
right to choose? You might get a robo-call directed at that very
issue.
1-6
The Matrix.
1-7
A friend is visiting from Europe. I know she is a vegetarian and I can use that information to
predict she would not like the local rib joint. People are good at making models and making
predictions. Data mining expands this ability and enables us to handle large quantities of
informationthe 150 million people in the Baker quote above. It enables the Pandora Music
Service to tailor a music station to your specific musical preferences. It enables Netflix to
make specific personalized movie recommendations for you.
1-8
Today it is not unusual to be doing data mining on terabytes of information. Google has over
5 petabytes (that's 5,000
terabytes) of web data. In 2006
Google released a dataset to the
research community based on
one trillion words. The National
Security Agency has call records
for trillions of phone calls.
Acxiom, a company that collects
information (credit card
purchases, telephone records,
medical records, car
registrations, etc) on 200
million adults in the US, has
amassed over 1 petabyte of
data.
a 1 petabyte server shipping container
Robert O'Harrow, Jr., author of No Place to Hide, in an effort to help us grasp how much
information is 1 petabyte says it is the equivalent of 50,000 miles of stacked King James
Bibles. I frequently drive 2,000 between New Mexico and Virginia. When I try to imagine
bibles stacked along the entire way that seems like an unbelievable amount of data.
1-9
The Library of Congress has around 20 terabytes of text. You could store the entire collection
of the Library of Congress on a few thousand dollar's worth of hard drives! In contrast,
Walmart has over 570 terabytes of data. All this data just doesn't sit thereit is constantly
being mined, new associations made, patterns identified. Tera-mining.
Throughout this book we will be dealing with small datasets. It's good thing. We don't want
our algorithm to run for a week only to discover we have some error in our logic. The biggest
dataset we will use is under 100MB; the smallest just tens of lines of data.
I try to strike a balance between hands-on, nuts-and-bolts discussion of Python data mining
code that you can use and modify, and the theory behind the data mining techniques. To try
to prevent the brain freeze associated with reading theory, math, and Python code, I tried to
stimulate a different part of your brain by adding drawings and pictures.
1-10
Peter Norvig, Director of Research at Google, had this to say in his great Udacity course.
Design of a Computer Program:
Ill show you and discuss my solution. Its important to note, there is
more than one way to approach a problem. And I dont mean that my
solution is the ONLY way or the BEST way. My solutions are there to
help you learn a style and some techniques for programming. If you
solve problems a different way, thats fine. Good for you.
All the learning that goes on happens inside of your head. Not inside of
my head. So whats important is that you understand the relation
between your code and my code, that you get the right answer by
writing out the solution yourself and then you can examine my code and
maybe pick out some pointers and techniques that you can use later.
1-11
Part of the usefulness of this book is the accompanying Python code and the datasets. I think
the inclusion of both these make it easier for the learner to understand key concepts, but at
the same time, not shoe-horn the learner into a scripted exploration.
1-12
comprehensive and provide an in-depth analysis of data mining theory and practice. Again, I
recommend books in this category. I wrote this book to fill a gap. It's a book designed for
people who love to programhackers.
Eeeks!
The book has math formulas but I try to
explain them in a way that is intelligible
to average programmers, who may have
forgotten a hunk of the math they took in
college.
s(i, j) =
(R
u,i
Ru )(Ru, j Ru )
uU
(R
u,i
uU
R u )2
(R
u, j
R u )2
uU
If that doesn't convince you, this book is also free (as in no cost) and free as in you can share
it.
1-13
What's with the Ancient Art of the Numerati part of the title
In June of 2010 I was trying to come up with a title for this book. I like clever titles, but
unfortunately, I have no talent in the area. I recently published a paper titled Linguistic
Dumpster Diving: Geographical Classification of Arabic Text (yep, a data mining paper). I
like the title and it is clever because it fits with the content of the paper, but I have to confess
my wife came up with the title. I co-wrote a paper Mood and Modality: Out of the theory and
into the fray. My co-author Marjorie McShane came up with the title. Anyway, back to June,
2010. All my clever title ideas were so vague that you wouldn't have a clue what the book was
about. I finally settled on A Programmer's Guide to Data Mining as part of the title. I believe
that bit is a concise description of the content of the bookI intend the book be a guide for
the working programmer. You might wonder what is the meaning of the part after the colon:
The Numerati is a term coined by Stephen Baker. Each one of us generates an amazing
amount of digital data everyday. credit card purchases, Twitter posts, Gowalla posts,
Foursquare check-ins, cell phone calls, email messages, text messages, etc.
You get up. The Matrix knows you boarded the subway at the Foggy Bottom Station at 7:10
and departed the Westside Station at 7:32. The Matrix knows you got a venti latte and a
blueberry scone at the Starbucks on 5th and Union at 7:45; you used Gowalla to check-in at
work at 8:05; you made an Amazon purchase for the P90X Extreme Home Fitness Workout
Program 13 DVD set and a chin-up bar at 9:35; you had lunch at the Golden Falafel.
1-14
The only folks who can make sense of the data we create are crack
mathematicians, computer scientists, and engineers. What will these Numerati
learn about us as they run us into dizzying combinations of numbers? First they
need to find us. Say you're a potential SUV shopper in the northern suburbs of
New York, or a churchgoing, antiabortion Democrat in Albuquerque. Maybe
you're a Java programmer ready to relocate to Hyderabad, or a jazz-loving,
Chianti-sipping Sagittarius looking for walks in the country and snuggles by the
fireplace in Stockholm, orheaven help usmaybe you're eager to strap bombs
to your waist and climb onto a bus. Whatever you areand each of us is a lot of
thingscompanies and governments want to identify and locate
you.
Baker
As you can probably guess, I like this term Numerati and Stephen Baker's description of it.
1-15
In the Amazon example, above, Amazon combines two bits of information to make a
recommendation. The first is that I viewed The Lotus Sutra translated by Gene Reeves; the
second, that customers who viewed that translation of the Lotus Sutra also viewed several
other translations.
The recommendation method we are looking at in this chapter is called collaborative
filtering. It's called collaborative because it makes recommendations based on other people
in effect, people collaborate to come up with recommendations. It works like this. Suppose
the task is to recommend a book to you. I search among other users of the site to find one
that is similar to you in the books she enjoys. Once I find that similar person I can see what
she likes and recommend those books to youperhaps Paolo Bacigalupi's The Windup Girl.
2-2
COLLABORATIVE FILTERING
Amy
Bill
Jim
Snow Crash
5
2
1
I would like to recommend a book to the mysterious Ms. X who rated Snow Crash 4 stars and
The Girl with the Dragon Tattoo 2 stars. The first task is to find the person who is most
similar, or closest, to Ms. X. I do this by computing distance.
2-3
Manhattan Distance
The easiest distance measure to compute is what is called Manhattan Distance or cab driver
distance. In the 2D case, each person is represented by an (x, y) point. I will add a subscript
to the x and y to refer to different people. So (x1, y1) might be Amy and (x2, y2) might be the
elusive Ms. X. Manhattan Distance is then calculated by
| x1 - x2| + | y1 - y2 |
(so the absolute value of the
difference between the x values plus
the absolute value of the difference
between the y values). So the
Manhattan Distance for Amy and
Ms. X is 4:
Computing the distance between Ms. X and all three people gives us:
Amy
Bill
Jim
2-4
COLLABORATIVE FILTERING
Amy is the closest match. We can look in her history and see, for example, that she gave five
stars to Paolo Bacigalupi's The Windup Girl and we would recommend that book to Ms. X.
Euclidean Distance
One benefit of Manhattan Distance is that it is fast to compute. If we are Facebook and are
trying to find who among one million users is most similar to little Danny from Kalamazoo,
fast is good.
Pythagorean Theorem
You may recall the Pythagorean Theorem from your distant educational past. Here, instead
of finding the Manhattan Distance between Amy and Ms. X (which was 4) we are going to
figure out the straight line, as-the-crow-flies, distance
2-5
c=
a 2 + b2
b
This straight-line, as-the-crow-flies distance we are calling Euclidean Distance. The formula
is
(x1
x2 )2 + (y1
y2 ) 2
Recall that x1 is how well person 1 liked Dragon Tattoo and x2 is how well person 2 liked it;
y1 is how well person 1 liked Snow Crash and y2 is how well person 2 liked it.
Amy rated both Snow Crash and Dragon Tattoo a 5; The elusive Ms. X rated Dragon Tattoo
a 2 and Snow Crash a 4. So the Euclidean distance between
(5
2)2 + (5
4)2 =
32 + 12 =
10 = 3.16
Amy
Bill
Jim
2-6
COLLABORATIVE FILTERING
N-dimensional thinking
Let's branch out slightly from just looking at rating two books (and hence 2D) to looking at
something slightly more complex. Suppose we work for an online streaming music service
and we want to make the experience more compelling by recommending bands. Let's say
users can rate bands on a star system 1-5 stars and they can give half star ratings (for
example, you can give a band 2.5 stars). The following chart shows 8 users and their ratings
of eight bands.
Blues Traveler
Broken Bells
Deadmau5
Norah Jones
Phoenix
Slightly Stoopid
The Strokes
Vampire Weekend
Angelica
3.5
2
4.5
5
1.5
2.5
2
Bill
2
3.5
4
2
3.5
3
Chan
5
1
1
3
5
1
-
Dan
3
4
4.5
3
4.5
4
2
Hailey
4
1
4
4
1
Jordyn
4.5
4
5
5
4.5
4
4
Sam
5
2
3
5
4
5
-
Veronica
3
5
4
2.5
3
-
The hyphens in the table indicate that a user didn't rate that particular band. For now we are
going to compute the distance based on the number of bands they both reviewed. So, for
example, when computing the distance between Angelica and Bill, we will use the ratings for
Blues Traveler, Broken Bells, Phoenix, Slightly Stoopid, and Vampire Weekend. So the
Manhattan Distance would be:
2-7
Blues Traveler
Broken Bells
Deadmau5
Norah Jones
Phoenix
Slightly Stoopid
The Strokes
Vampire Weekend
Manhattan Distance:
Angelica
3.5
2
4.5
5
1.5
2.5
2
Bill
2
3.5
4
2
3.5
3
Difference
1.5
1.5
3
2
1
9
The Manhattan Distance row, the last row of the table, is simply the sum of the differences:
(1.5 + 1.5 + 3 + 2 + 1).
Computing the Euclidean Distance is similar. We only use the bands they both reviewed:
Blues Traveler
Broken Bells
Deadmau5
Norah Jones
Phoenix
Slightly Stoopid
The Strokes
Vampire Weekend
Sum of squares
Euclidean Distance
2-8
Angelica
3.5
2
4.5
5
1.5
2.5
2
Bill
2
3.5
4
2
3.5
3
Difference
1.5
1.5
Difference2
2.25
2.25
3
2
1
9
4
1
18.5
4.3
COLLABORATIVE FILTERING
Got it?
Try an example on your own...
2-9
Blues Traveler
Broken Bells
Deadmau5
Norah Jones
Phoenix
Slightly Stoopid
The Strokes
Vampire Weekend
Angelica
3.5
2
4.5
5
1.5
2.5
2
Bill
2
3.5
4
2
3.5
3
Chan
5
1
1
3
5
1
-
Dan
3
4
4.5
3
4.5
4
2
Hailey
4
1
4
4
1
Jordyn
4.5
4
5
5
4.5
4
4
Sam
5
2
3
5
4
5
-
2-10
Veronica
3
5
4
2.5
3
-
COLLABORATIVE FILTERING
(4
5)2 + (4
3)2 =
1+1=
2 = 1.414
=
=
(4
4.5)2 + (1
4)2 + (4
5)2 + (4
4)2 + (1
4)2
.25 + 9 + 1 + 0 + 9 =
19.25 = 4.387
A flaw
It looks like we discovered a flaw with using these distance measures. When we computed the
distance between Hailey and Veronica, we noticed they only rated two bands in common
(Norah Jones and The Strokes), whereas when we computed the distance between Hailey
and Jordyn, we noticed they rated five bands in common. This seems to skew our distance
measurement, since the Hailey-Veronica distance is in 2 dimensions while the Hailey-Jordyn
2-11
distance is in 5 dimensions. Manhattan Distance and Euclidean Distance work best when
there are no missing values. Dealing with missing values is an active area of scholarly
research. Later in the book we will talk about how to deal with this problem. For now just be
aware of the flaw as we continue our first exploration into building a recommendation
system.
A generalization
We can generalize Manhattan Distance and Euclidean Distance to what is called the
Minkowski Distance Metric:
n
1
r r
d(x, y) = ( | xk yk | )
k=1
When
r = 1: The formula is Manhattan Distance.
r = 2: The formula is Euclidean Distance
r = : Supremum Distance
Arghhhh Math!
2-12
COLLABORATIVE FILTERING
Many times youll find the formula quite understandable. Lets dissect it now. When r = 1 the
formula reduces to Manhattan Distance:
d(x, y) = k=1 | xk yk |
n
So for the music example we have been using throughout the chapter, x and y represent two
people and d(x, y) represents the distance between them. n is the number of bands they both
rated (both x and y rated that band). Weve done that calculation a few pages back:
Blues Traveler
Broken Bells
Deadmau5
Norah Jones
Phoenix
Slightly Stoopid
The Strokes
Vampire Weekend
Manhattan Distance:
Angelica
3.5
2
4.5
5
1.5
2.5
2
Bill
2
3.5
4
2
3.5
3
Difference
1.5
1.5
3
2
1
9
That difference column represents the absolute value of the difference and we sum those up
to get 9.
When r = 2, we get the Euclidean distance:
d(x, y) =
n
k=1
(xk yk )2
2-13
Remember,
All the code for the book is available at
www.guidetodatamining.com.
2-14
COLLABORATIVE FILTERING
"Chan":
"Dan":
"Hailey":
"Jordyn":
"Sam":
2-15
Now a function to find the closest person (actually this returns a sorted list with the closest
person first):
def computeNearestNeighbor(username, users):
"""creates a sorted list of users based on their distance to
username"""
distances = []
for user in users:
if user != username:
distance = manhattan(users[user], users[username])
distances.append((distance, user))
# sort based on distance -- closest first
distances.sort()
return distances
2-16
COLLABORATIVE FILTERING
Finally, we are going to put this all together to make recommendations. Let's say I want to
make recommendations for Hailey. I find her nearest neighborVeronica in this case. I will
then find bands that Veronica has rated but Hailey has not. Also, I will assume that Hailey
would have rated the bands the same as (or at least very similar to) Veronica. For example,
Hailey has not rated the great band Phoenix. Veronica has rated Phoenix a '4' so we will
assume Hailey is likely to enjoy the band as well. Here is my function to make
recommendations.
That fits with our expectations. As we saw above, Hailey's nearest neighbor was Veronica and
Veronica gave Phoenix a '4'. Let's try a few more:
>>> recommend('Chan', users)
2-17
We think Chan will like The Strokes and also predict that Sam will not like Deadmau5.
Hmm. For Angelica we got back an empty set meaning we have no recommendations for her.
Let us see what went wrong:
>>> computeNearestNeighbor('Angelica', users)
[(3.5, 'Veronica'), (4.5, 'Chan'), (5.0, 'Hailey'), (8.0, 'Sam'), (9.0,
'Bill'), (9.0, 'Dan'), (9.5, 'Jordyn')]
Blues Traveler
Broken Bells
Deadmau5
Norah Jones
Phoenix
Slightly Stoopid
The Strokes
Vampire Weekend
Angelica
3.5
2
4.5
5
1.5
2.5
2
Bill
2
3.5
4
2
3.5
3
Chan
5
1
1
3
5
1
-
Dan
3
4
4.5
3
4.5
4
2
Hailey
4
1
4
4
1
Jordyn
4.5
4
5
5
4.5
4
4
Sam
5
2
3
5
4
5
-
Veronica
3
5
4
2.5
3
-
We see that Angelica rated every band that Veronica did. We have no new ratings, so no
recommendations.
Shortly, we will see how to improve the system to avoid these cases.
2-18
COLLABORATIVE FILTERING
exercise
2-19
exercise - solution
2-20
COLLABORATIVE FILTERING
Jordyn
seems to like
everthing. Her
ratings range
from 4 to 5.
Bill seems to
avoid the
extremes. His
ratings range
from 2 to 4
Blues Traveler
Broken Bells
Deadmau5
Norah Jones
Phoenix
Slightly Stoopid
The Strokes
Vampire Weekend
Angelica
3.5
2
4.5
5
1.5
2.5
2
Bill
2
3.5
4
2
3.5
3
Chan
5
1
1
3
5
1
-
Dan
3
4
4.5
3
4.5
4
2
Hailey
4
1
4
4
1
Jordyn
4.5
4
5
5
4.5
4
4
Sam
5
2
3
5
4
5
-
Veronica
3
5
4
2.5
3
-
Hailey is a binary
person giving either 1s
or 4s to bands.
2-21
So how do we compare, for example, Hailey to Jordan? Does Hailey's '4' mean the same as
Jordyn's '4' or Jordyn's '5'? I would guess it is more like Jordyn's '5'. This variability can
create problems with a recommendation system.
I absolutely
love Broken Bells!
Theyre tight! I
give them a 4.
2-22
Broken Bells
is ok. Id give
them a 4.
COLLABORATIVE FILTERING
Blues
Traveler
Norah
Jones
Phoenix
The
Strokes
Weird Al
Clara
4.75
4.5
4.25
Robert
This is an example of what is called 'grade inflation' in the data mining community. Clara's
lowest rating is 4all her rating are between 4 and 5. If we are to graph this chart it would
look like
Phoenix
5
Blues Traveler
Norah
Clara
4.5
The Strokes
Weird Al
3.5
3
Robert
The fact that this is a straight line indicates a perfect agreement between Clara and Robert.
Both rated Phoenix as the best band, Blues Traveler next, Norah Jones after that, and so on.
As Clara and Robert agree less, the less the data points reside on a straight line:
Phoenix
Weird Al
Norah Jones
Blues Traveler
Clara
The Strokes
3.5
3
Robert
Not So Good Agreement:
5
4.5
Norah Jones
Weird Al
Clara
4
The Strokes
Phoenix
3.5
3
Blues Traveler
Robert
2-24
COLLABORATIVE FILTERING
r=
n
i=1
n
i=1
(xi x )(yi y )
(xi x )2
n
i=1
(yi y )2
2-25
that urge and actually look at the formula. Many formulas that on a quick glimpse look
complex are actually understandable by mere mortals.
Other than perhaps looking complex, the problem with the formula above is that the
algorithm to implement it would require multiple passes through the data. Fortunately for us
algorithmic people, there is an alternative formula, which is an approximation of Pearson:
r=
xy
i=1 i i
( i=1 xi )2
x
y
i=1 i i=1 i
n
x2
i=1 i
( i=1 yi )2
n
2
i=1 i
(Remember what I said two paragraphs above about not skipping over formulas) This
formula, in addition to looking initially horribly complex is, more importantly, numerically
unstable meaning that what might be a small error is amplified by this reformulation. The big
plus is that we can implement it using a single-pass algorithm, which we will get to shortly.
First, lets dissect this formula and work through the example we saw a few pages back:
Blues
Traveler
Norah
Jones
Phoenix
The Strokes
Weird Al
Clara
4.75
4.5
4.25
Robert
2-26
COLLABORATIVE FILTERING
xy
i=1 i i
Which is in the first expression in the numerator. Here the x and y represent Clara and
Robert.
Blues
Traveler
Norah
Jones
Phoenix
The Strokes
Weird Al
Clara
4.75
4.5
4.25
Robert
For each band we are going to multiple Claras and Roberts rating together and sum the
results:
x
y
i=1 i i=1 i
n
2-27
So the
i=1 i
is the sum of Claras ratings, which is 22.5. The sum of Roberts is 15 and they rated 5 bands:
22.5 15
= 67.5
5
So the numerator in the formula on page 26 is 70 - 67.5 = 2.5
( i=1 xi )2
n
2
i=1 i
First,
Blues Traveler
Norah Jones
Phoenix
The Strokes
Weird Al
Clara
4.75
4.5
4.25
Robert
2
i=1 i
2-28
COLLABORATIVE FILTERING
Weve already computed the sum of Claras ratings, which is 22.5. Square that and we get
506.25. We divide that by the number of co-rated bands (5) and we get 101.25.
Putting that together:
( i=1 yi )2
n
2
i=1 i
= 55 45 = 3.162277
r=
2.5
2.5
=
= 1.00
.79057(3.162277) 2.5
2-29
exercise
Before going to the next page, implement the algorithm in Python. You
should get the following results.
>>> pearson(users['Angelica'], users['Bill'])
-0.90405349906826993
>>> pearson(users['Angelica'], users['Hailey'])
0.42008402520840293
>>> pearson(users['Angelica'], users['Jordyn'])
0.76397486054754316
>>>
For this implementation you will need 2 Python functions sqrt (square
root) and power operator ** which raises its left argument to the
power of its right argument:
>>> from math import sqrt
>>> sqrt(9)
3.0
>>> 3**2
9
2-30
COLLABORATIVE FILTERING
exercise - solution
2-31
number of plays
The Decemberists
The King is Dead
Radiohead
The King of Limbs
Katy Perry
E.T.
Ann
10
32
Ben
15
25
Sally
12
27
Just by eye-balling the above chart (and by using any of the distance formulas mentioned
above) we can see that Sally is more similar in listening habits to Ann than Ben is.
2-32
COLLABORATIVE FILTERING
So my top track is Moonlight Sonata by Marcus Miller with 25 plays. Chances are that you
have played that track zero times. In fact, chances are good that you have not played any of
my top tracks. In addition, there are over 15 million tracks in iTunes and I have only four
thousand. So the data for a single person is sparse since it has relatively few non-zero
attributes (plays of a track). When we compare two people by using the number of plays of
the 15 million tracks, mostly they will have shared zeros in common. However, we do not
want to use these shared zeros when we are computing similarity.
A similar case can be made when we are comparing text
documents using words. Suppose we liked a certain
book, say Tom Corbett Space Cadet: The Space Pioneers
by Carey Rockwell and we want to find a similar book.
One possible way is to use word frequency. The
attributes will be individual words and the values of
those attributes will be the frequency of those words in
the book. So 6.13% of the words in The Space Pioneers
are occurrences of the word the, 0.89% are the word
Tom, 0.25% of the words are space. I can compute the
similarity of this book to others by using these word
frequencies. However, the same problem related to
sparseness of data occurs here. There are 6,629
unique words in The Space Pioneers and there are a
bit over one million unique words in English. So if
our attributes are English words, there will be
relatively few non-zero attributes for The Space
Pioneers or any other book. Again, any measure of similarity should not
depend on the shared-zero values.
2-33
xy
x y
cos(x, y) =
where indicates the dot product and ||x|| indicates the length of the vector x. The length of a
vector is
2
i=1 i
Lets give this a try with the perfect agreement example used above:
Blues
Traveler
Norah
Jones
Phoenix
The Strokes
Weird Al
Clara
4.75
4.5
4.25
Robert
then
2-34
COLLABORATIVE FILTERING
cos(x, y) =
70
70
=
= 0.935
10.093 7.416 74.85
The cosine similarity rating ranges from 1 indicated perfect similarity to -1 indicate perfect
negative similarity. So 0.935 represents very good agreement.
Compute the Cosine Similarity between Angelica and Veronica (from our
dataset). (Consider dashes equal to zero)
Blues
Traveler
Broken
Bells
Deadmau
5
Norah
Jones
Phoenix
Slightly
Stoopid
The
Strokes
Vampire
Weekend
Angelica
3.5
4.5
1.5
2.5
Veronica
2.5
2-35
Compute the Cosine Similarity between Angelica and Veronica (from our
dataset).
Blues
Traveler
Broken
Bells
Deadmau
5
Norah
Jones
Phoenix
Slightly
Stoopid
The
Strokes
Vampire
Weekend
Angelica
3.5
4.5
1.5
2.5
Veronica
2.5
x = (3.5,2,0, 4.5,5,1.5,2.5,2)
y = (3,0,0,5, 4,2.5, 3,0)
xy =
(3.5 3) + (2 0) + (0 0) + (4.5 5) + (5 4) + (1.5 2.5) + (2.5 3) + (2 0) = 64.25
Cosine Similarity is
cos(x, y) =
2-36
64.25
64.25
=
= 0.9246
8.602 8.078 69.487
COLLABORATIVE FILTERING
Good job,
guys, nailed it!
2-37
So, if the data is dense (nearly all attributes have non-zero values) then Manhattan and
Euclidean are reasonable to use. What happens if the data is not dense? Consider an
expanded music rating system and three people, all of which have rated 100 songs on our
site:
untry
Linda and Eric enjoy the same kind of music. In fact, among their ratings, they have 20 songs
in common and the difference in their ratings of those 20 songs (on a scale of 1 to 5) averages
only 0.5!! The Manhattan Distance between them would be 20 x .5 = 10. The Euclidean
Distance would be:
2-38
COLLABORATIVE FILTERING
Linda and Jake have rated only one song in common: Chris Cagles What a Beautiful Day.
Linda thought it was okay and rated it a 3, Jake thought it was awesome and gave it a 5. So
the Manhattan Distance between Jake and Linda is 2 and the Euclidean Distance is
d = (3 5)2 = 4 = 2
So both the Manhattan and Euclidean Distances show that Jake is a closer match to Linda
than Eric is. So in this case both distance measures produce poor results.
Good idea, but that doesnt work either. To see why we need to bring in a few more
characters into our little drama: Cooper and Kelsey. Jake, Cooper and Kelsey have amazingly
similar musical tastes. Jake has rated 25 songs on our site.
2-39
Cooper
Why?
Kelsey
2-40
COLLABORATIVE FILTERING
To answer why, let us look at a the following simplified example (again, a 0 means that
person did not rate that song):
Song:
10
Jake
4.5
4.5
Cooper
Kelsey
Again, looking at the songs they mutually rated (songs 4, 5, and 6), Cooper and Kelsey seem
like equally close matches for Jake. However, Manhattan Distance using those zero values
tells a different story:
= 5 + 4 + 4 + 0.5 + 0 + .5 + 5 + 5 + 4 + 4 = 32
The problem is that these zero values tend to dominate any measure of distance. So the
solution of adding zeros is no better than the original distance formulas. One workaround
people have used is to computein some sensean average distance where one computes
the distance by using songs they rated in common divided that by the number of songs they
rated in common.
Again, Manhattan and Euclidean work spectacularly well on dense data, but if the data is
sparse it may be better to use Cosine Similarity.
2-41
Weirdnesses
Suppose we are trying to make recommendations for Amy who loves Phoenix, Passion Pit
and Vampire Weekend. Our closest match is Bob who also loves Phoenix, Passion Pit, and
Vampire Weekend. His father happens to play accordion for the Walter Ostanek Band, this
year's Grammy winner in the polka category. Because of familial obligations, Bob gives 5
stars to the Walter Ostanek Band. Based on our current recommendation system, we think
Amy will absolutely love the band. But common sense tells us she probably won't.
Or think of Professor Billy Bob Olivera who loves to read data mining books and science
fiction. His closest match happens to be me, who also likes data mining books and science
fiction. However, I like standard poodles and have rated The Secret Lives of Standard
Poodles highly. Our current recommendation system would likely recommend that book to
the professor.
2-42
COLLABORATIVE FILTERING
The problem is that we are relying on a single most similar person. Any quirk that person
has is passed on as a recommendation. One way of evening out those quirks is to base our
recommendations on more than one person who is similar to our user. For this we can use
the k-nearest neighbor approach.
K-nearest neighbor
In the k-nearest neighbor approach to collaborative filtering we use k most similar people to
determine recommendations. The best value for k is application specificyou will need to do
some experimentation. Here's an example to give you the basic idea.
Suppose I would like to make recommendations for Ann and am using k-nearest neighbor
with k=3. The three nearest neighbors and their Pearson scores are shown in the following
table:
2-43
Person
Sally
Eric
Amanda
Pearson
0.8
0.7
0.5
.0
.5 = 2
0
+
7
0.
0.8 +
Each of these three people are going to influence the recommendations. The question is how
can I determine how much influence each person should have. If there is a Pie of Influence,
how big a slice should I give each person? If I add up the Pearson scores I get 2. Sally's share
is 0.8/2 or 40%. Eric's share is 35% (0.7 / 2) and Amanda's share is 25%.
Suppose Amanda, Eric, and Sally, rated the band, The Grey Wardens as follows
Person
Amanda
Eric
Sally
2-44
COLLABORATIVE FILTERING
Person
Amanda
Eric
Sally
Influence
25.00%
35.00%
40.00%
Suppose I use the same data as above but use a k-nearest neighbor
approach with k=2. What is my projected rating for Grey Wardens?
Person
Sally
Eric
Amanda
Person
Amanda
Eric
Sally
Pearson
0.8
0.7
0.5
2-45
solution
Person
Sally
Eric
Amanda
Pearson
0.8
0.7
0.5
Person
Amanda
Eric
Sally
2-46
COLLABORATIVE FILTERING
import codecs
from math import sqrt
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
"Norah Jones": 4.5, "Phoenix": 5.0,
"Slightly Stoopid": 1.5,
"The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
"Deadmau5": 4.0, "Phoenix": 2.0,
"Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
"Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
"Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
"Deadmau5": 4.5, "Phoenix": 3.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
"Norah Jones": 4.0, "The Strokes": 4.0,
"Vampire Weekend": 1.0},
"Jordyn":
2-47
class recommender:
def __init__(self, data, k=1, metric='pearson', n=5):
""" initialize recommender
currently, if data is dictionary the recommender is initialized
to it.
For all other data types of data, no initialization occurs
k is the k value for k nearest neighbor
metric is which distance formula to use
n is the maximum number of recommendations to make"""
self.k = k
self.n = n
self.username2id = {}
self.userid2name = {}
self.productid2name = {}
# for some reason I want to save the name of the metric
self.metric = metric
if self.metric == 'pearson':
self.fn = self.pearson
#
# if data is dictionary set recommender data to it
#
if type(data).__name__ == 'dict':
self.data = data
2-48
COLLABORATIVE FILTERING
def convertProductID2name(self, id):
"""Given product id number return product name"""
if id in self.productid2name:
return self.productid2name[id]
else:
return id
2-49
self.username2id
#
f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
for line in f:
i += 1
# separate line into fields
fields = line.split(';')
userid = fields[0].strip('"')
2-50
COLLABORATIVE FILTERING
location = fields[1].strip('"')
if len(fields) > 3:
age = fields[2].strip().strip('"')
else:
age = 'NULL'
if age != 'NULL':
value = location + '
else:
value = location
self.userid2name[userid] = value
self.username2id[location] = userid
f.close()
print(i)
2-51
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator
ordered by nearness
nearest = self.computeNearestNeighbor(user)
#
# now get the ratings for the user
#
userRatings = self.data[user]
#
# determine the total distance
totalDistance = 0.0
for i in range(self.k):
totalDistance += nearest[i][1]
# now iterate through the k nearest neighbors
# accumulating their ratings
for i in range(self.k):
2-52
COLLABORATIVE FILTERING
# compute slice of pie
weight = nearest[i][1] / totalDistance
# get the name of the person
name = nearest[i][0]
# get the ratings for this person
neighborRatings = self.data[name]
# get the name of the person
# now find bands neighbor rated that user didn't
for artist in neighborRatings:
if not artist in userRatings:
if artist not in recommendations:
recommendations[artist] = (neighborRatings[artist]
* weight)
else:
recommendations[artist] = (recommendations[artist]
+ neighborRatings[artist]
* weight)
# now make list from dictionary
recommendations = list(recommendations.items())
recommendations = [(self.convertProductID2name(k), v)
for (k, v) in recommendations]
# finally sort and return
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
# Return the first n items
return recommendations[:self.n]
2-53
A New Dataset
Ok, it is time to look at a more realistic dataset. Cai-Nicolas Zeigler collected over one million
ratings of books from the Book Crossing website. This ratings are of 278,858 users rating
271,379 books. This anonymized data is available at http://www.informatik.uni-freiburg.de/
~cziegler/BX/ both as an SQL dump and a text file of comma-separated-values (CSV). I had
some problems loading this data into Python due to apparent character encoding problems.
My fixed version of the CSV files are available on this book's website.
The CSV files represent three tables:
BX-Users, which, as the name suggests, contains information about the users. There is an
integer user-id field, as well as the location (i.e., Albuquerque, NM) and age. The names
have been removed to anonymize the data.
BX-Books. Books are identified by the ISBN, book title, author, year of publication, and
publisher.
BX-Book-Ratings, which includes a user-id, book ISBN, and a rating from 0-10.
2-54
COLLABORATIVE FILTERING
The function loadBookDB in the recommender class loads the data from these files.
Now I am going to load the book dataset. The argument to the loadBookDB function is the
path to the BX book files.
>>> r.loadBookDB('/Users/raz/Downloads/BX-Dump/')
1700018
Note:
This is a large datase
t an d may take a
bit of time to loa d on
yo ur co mputer.
On my Hackintosh (2.8
GHz i7 860 with
8GB RAM ) it takes 24
seconds to loa d
the dataset an d 30 se
conds to run a
quer y.
Now I can get recommendations for user 17118, a person from Toronto:
>>> r.recommend('171118')
[("The Godmother's Web by Elizabeth Ann Scarborough", 10.0), ("The Irrational
Season (The Crosswicks Journal, Book 3) by Madeleine L'Engle", 10.0), ("The
Godmother's Apprentice by Elizabeth Ann Scarborough", 10.0), ("A Swiftly
Tilting Planet by Madeleine L'Engle", 10.0), ('The Girl Who Loved Tom Gordon by
Stephen King', 9.0), ('The Godmother by Elizabeth Ann Scarborough', 8.0)]
>>> r.userRatings('171118', 5)
Ratings for toronto, ontario, canada
2421
The Careful Writer by Theodore M. Bernstein! 10
Wonderful Life: The Burgess Shale and the Nature of History by Stephen Jay
Gould! 10
Pride and Prejudice (World's Classics) by Jane Austen!
10
The Wandering Fire (The Fionavar Tapestry, Book 2) by Guy Gavriel Kay! 10
Flowering trees and shrubs: The botanical paintings of Esther Heins by Judith
Leet! 10
2-55
Projects
You won't really learn this material unless you play
around with the code. Here are some suggestions of
what you might try.
1. Implement Manhattan distance and Euclidean
distance and compare the results of these three
methods.
2. The book website has a file containing movie
ratings for 25 movies. Create a function that loads
the data into your classifier. The recommend method
described above should recommend movies for a
specific person.
2-56
Explicit ratings
One way of distinguishing types of user preferences is whether they are explicit or implicit.
Explicit ratings are when the user herself explicitly rates the item. One example of this is the
thumbs up / thumbs down rating on sites such as Pandora and YouTube.
3-2
COLLABORATIVE FILTERING
Implicit Ratings
For implicit ratings, we don't ask users to give any ratingswe just observe their behavior.
An example of this is keeping track of what a user clicks on in the online New York Times.
3-3
In this example, Amazon keeps track of what people click on. It knows, for example, that
people who viewed the book Jupiters Travels: Four years around the world on a Triumph
also viewed the DVD Long Way Round, which chronicles the actor Ewan McGregor as he
travels with his mate around the world on motorcycles. As can be seen in the Amazon
screenshot above, this information is used to display the items in the section Customers who
viewed this also viewed.
Another implicit rating is what the customer actually buys. Amazon also keeps track of this
information and uses it for their recommendations Frequently Bought Together and
Customers Who Viewed This Item Also Bought:
You would think that Frequently Bought Together would lead to some unusual
recommendations but this works surprisingly well.
3-4
COLLABORATIVE FILTERING
Imagine what information a program can acquire by monitoring your behavior in iTunes.
First, there's the fact that I added a song to iTunes. That indicates minimally that I was
interested enough in the song to do so. Then there is the Play Count information. In the
image above, I've listened to Zee Avi's Anchor 52 times. That suggests that I like that song
(and in fact I do). If I have a song in my library for awhile and only listened to it once, that
might indicate that I don't like the song.
brain calisthenics
3-5
Jim
Explicit Rating:
match.com bio:
I am a vegan. I enjoy a
fine Cabernet Sauvignon,
long walks in the woods,
reading Chekov by the
fire, French Films,
Saturdays at the art
museum, and Schumann
piano works.
Implicit Ratings:
what we found in
Jims pocket
Receipts for:
12 pack of Pabst Blue Ribbon beer, Whataburger, Ben and Jerrys ice cream, pizza & donuts
DVD rental receipts: Marvels The Avengers, Resident Evil: Retribution, Ong Bak 3
3-6
COLLABORATIVE FILTERING
3-7
3-8
COLLABORATIVE FILTERING
Consider Mary, a college student. For some reason, she loves giving Amazon ratings. Ten
years ago she rated her favorite music albums with five stars: Giggling and Laughing: Silly
Songs for Kids, and Sesame Songs: Sing Yourself Silly! Her most recent ratings included 5
stars for Wolfgang Amadeus Phoenix and The Twilight Saga: Eclipse Soundtrack. Based on
these recent ratings she ends up being the closest neighbor to another college student Jen. It
would be odd to recommend Giggling and Laughing: Silly Songs for Kids to Jen. This is a
slightly different type of update problem than the one above, but a problem none-the-less.
brain calisthenics
3-9
A few pages ago I gave a list of items I bought at Amazon in the last month. It turns out I
bought two of those items for other people. I bought the anticancer book for my cousin and
the Rework book for my son. To see why this is a problem, let me come up with a more
compelling example by going further back in my purchase history. I bought some kettlebells
and the book Enter the Kettlebell! Secret of the Soviet Supermen as a gift for my son and a
Plush Chase Border Collie stuffed animal for my wife because our 14-year-old border collie
died. Using purchase history as an implicit rating of what a person likes, might lead you to
believe that people who like kettlebells, like stuffed animals, like microHelicopters, books on
anticancer, and the book Ready Player One. Amazon's purchase history can't distinguish
between purchases for myself and purchases I make as gifts. Stephen Baker describes a
related example:
3-10
COLLABORATIVE FILTERING
brain calisthenics
3-11
Implicit Data:
Web pages:
Music players:
Keep in mind that the algorithms described in chapter 2 can be used regardless of whether
the data is explicit or implicit.
3-12
COLLABORATIVE FILTERING
Lots of iron:
a server farm
User-based filtering.
So far we have been doing user-based collaborative filtering. We are comparing a user with
every other user to find the closest matches. There are two main problems with this
approach:
1.
2. Sparsity. Most recommendation systems have many users and many products but the
average user rates a small fraction of the total products. For example, Amazon carries
millions of books but the average user rates just a handful of books. Because of this the
algorithms we covered in chapter 2 may not find any nearest neighbors.
Because of these two issues it might be better to do what is called item-based filtering.
3-13
Item-based filtering.
Suppose I have an algorithm that identifies products that are most similar to each other. For
example, such an algorithm might find that Phoenix's album Wolfgang Amadeus Phoenix is
similar to Passion Pit's album, Manners. If a user rates Wolfgang Amadeus Phoenix highly
we could recommend the similar album Manners. Note that this is different than what we
did for user-based filtering. In user-based filtering we had a user, found the most similar
person (or users) to that user and used the ratings of that similar person to make
recommendations. In item-based filtering, ahead of time we find the most similar items, and
combine that with a user's rating of items to generate a recommendation.
Users
1
Tamera Young
Jasmine Abbey
Arturo Alvarez
...
...
Cecilia De La Cueva
...
...
m-1
m
3-14
...
Phoenix
...
Passion
Pit
5
4
1
Jessica Nguyen
Jordyn Zamora
...
COLLABORATIVE FILTERING
We would like to compute the similarity of Phoenix to Passion Pit. To do this we only use
users who rated both bands as indicated by the blue squares. If we were doing user-based
filtering we would determine the similarity between rows. For item-based filtering we are
determining the similarity between columnsin this case between the Phoenix and Passion
Pit columns.
ory based
User-based filtering is also called mem
need to
we
e
aus
Bec
collaborative filtering. Why?
e
store all the ratings in order to mak
recommendations.
el-based
Item-based filtering is also called mod
we dont need
collaborative filtering. Why? Because
el
to store all the ratings. We build a mod
every other
to
is
item
representing how close every
item!
3-15
3-16
COLLABORATIVE FILTERING
s(i, j) =
(R
u,i
Ru )(Ru, j Ru )
uU
(R
u,i
R u )2
uU
(R
u, j
R u )2
o
sers wh
t of all u nd j!
e
s
e
h
t
ia
U is
th items
rated bo
uU
(R
u,i
Ru
means the rating R user u gives to item i minus the average rating that user gave for all items
she rated. This gives us the normalized rating. In the formula above for s(i,j) we are finding
the similarity between items i and j. The numerator says that for every user who rated both
items multiply the normalized rating of those two items and sum the results. In the
denominator we sum the squares of all the normalized ratings for item i and then take the
square root of that result. We do the same for item j. And then we multiply those two
together.
To illustrate adjusted cosine similarity we will use the following data where five students
rated five musical artists.
Users
average
rating
Kacey
Musgraves
Imagine
Dragons
Daft Punk
Lorde
Fall Out
Boy
David
Matt
Ben
Chris
Torri
The first thing to do is to compute each users average rating. That is easy! Go ahead and fill
that in.
3-17
Users
average
rating
Kacey
Musgraves
Imagine
Dragons
Daft Punk
Lorde
Fall Out
Boy
David
3.25
Matt
3.0
Ben
2.75
3.2
4.25
Chris
Tori
Now for each pair of musical artists we are going to compute their similarity. Lets start with
Kacey Musgraves and Imagine Dragons. In the above table, I have circled the cases where a
user rated both bands. So the adjusted cosine similarity formula is
s(Musgraves, Dragons) =
(R
u,Musgraves
(R
u,Musgraves
R u )2
uU
Bens
ratings
3-18
Ru )(Ru,Dragons Ru )
uU
(R
u,Dragons
R u )2
uU
Chriss
ratings
Toris
ratings
0.7650
0.7650
0.7650
=
=
= 0.5260
2.765 0.765 (1.6628)(0.8746) 1.4543
COLLABORATIVE FILTERING
So the similarity between Kacey Musgraves and Imagine Dragons is 0.5260. I have
computed some of the other similarities and entered them in this table:
Fall Out
Boy
Lorde
Daft
Punk
Imagine
Dragons
0.5260
Kacey Musgraves
-0.9549
1.0000
Imagine Dragons
-0.3378
0.0075
Daft Punk
-0.9570
Lorde
-0.6934
3-19
Lorde
Daft
Punk
Imagine
Dragons
Kacey Musgraves
-0.9549
0.3210
1.0000
0.5260
Imagine Dragons
-0.3378
-0.2525
0.0075
Daft Punk
-0.9570
0.7841
Lorde
-0.6934
3-20
COLLABORATIVE FILTERING
Lorde
Daft Punk
Imagine
Dragons
Kacey Musgraves
-0.9549
0.3210
1.0000
0.5260
Imagine Dragons
-0.3378
-0.253
0.0075
Daft Punk
-0.9570
0.7841
Lorde
-0.6934
3-21
p(u,i) =
NsimilarTo(i )
(Si,N Ru,N )
NsimilarTo(i )
( Si,N )
English, please!
N is each of the
items that person u rated
that are similar to item i.
By similar I mean that
there is a similarity score
between N and i in our
matrix!
3-22
COLLABORATIVE FILTERING
p(u,i) =
NsimilarTo(i )
(Si,N Ru,N )
NsimilarTo(i )
( Si,N )
3-23
Our curre
nt
MaxR be th music ratings rang
e fr
e
MinR be th maximum rating (5 om 1 to 5. Let
e minimu
m rating (1 in our case) and
current ra
ting user
). Ru,N is th
ug
e
normalize
d rating (t ave item N. NRu,N is
he new ra
-1 to 1. Th
th
e
ting
e equation
to normali on the scale of
ze the rati
ng is
NRu ,N = 2( Ru ,N Min )
( Max M
R
R
inR )
( Max M
R
inR )
The equation to denormalize (go from the normalized rating to one in our original scale of 1-5
is:
1
Ru,N = ((NRu,N + 1) (Max R MinR )) + MinR
2
Lets say someone rated Fall Out Boy a 2. Our normalized rating would be ...
NRu,N =
3-24
COLLABORATIVE FILTERING
1
Ru,N = ((NRu,N + 1) (Max R MinR )) + MinR
2
1
1
= ((0.5 + 1) 4) + 1 = (2) + 1 = 1+ 1 = 2
2
2
Okay. We now have that numeric Kung Fu under our belt!
Davids Ratings
Artist
NR
Imagine Dragons
Daft Punk
Lorde
0.5
-1
more about
We will learn
n in the next
normalizatio
chapter!
3-25
Similarity Matrix
Fall Out
Boy
Lorde
Daft Punk
Imagine
Dragons
Kacey Musgraves
-0.9549
0.3210
1.0000
0.5260
Imagine Dragons
-0.3378
-0.2525
0.0075
Daft Punk
-0.9570
0.7841
Lorde
-0.6934
p(u,i) =
NsimilarTo(i )
(Si,N NRu,N )
( Si,N )
NsimilarTo(i )
Imagine Dragons
Daft Punk
Lorde
3-26
COLLABORATIVE FILTERING
So we predict that David will rate Kacey Musgraves a 0.753 on a scale of -1 to 1. To get back to
our scale of 1 to 5 we need to denormalize:
1
Ru,N = ((NRu,N + 1) (Max R MinR )) + MinR
2
1
1
= ((0.753 + 1) 4) + 1 = (7.012) + 1 = 3.506 + 1 = 4.506
2
2
So we predict that David will rate Kacey Musgraves a 4.506!
Colloborative
a Model-Based
is
y
it
ar
ili
m
Si
, one advantage
Adjusted Cosine
few pages back
a
ed
on
ti
en
m
d. As
es is that they
Filtering Metho
emory-based on
m
to
d
re
pa
m
ds tend to
ds co
of these metho
el-based metho
od
m
,
ts
se
ta
da
r large
scale better. Fo
y.
ire less memor
qu
be fast and re
e artists I
les differently. I may rat
sca
ing
rat
use
ple
peo
en
Oft
y rate
artists I like a 4. You ma
am not keen on a 3 and
ed
artists you like a 5. Adjust
artists you dislike a 1 and
ng the
this problem by subtracti
Cosine Similarity handles
ing.
rat
h
eac
rage rating from
corresponding users ave
3-27
Slope One
Another popular algorithm for item-based collaborative filtering is Slope One. A major
advantage of Slope One is that it is simple and hence easy to implement. Slope One was
introduced in the paper Slope One Predictors for Online Rating-Based Collaborative
Filtering by Daniel Lemire and Anna Machlachlan (http://www.daniel-lemire.com/fr/
abstracts/SDM2005.html). This is an awesome paper and well worth the read.
Here's the basic idea in a minimalist nutshell. Suppose Amy gave a rating of 3 to PSY and a
rating of 4 to Whitney Houston. Ben gave a rating of 4 to PSY. We'd like to predict how Ben
would rate Whitney Houston. In table form the problem might look like this:
PSY
Whitney Houston
Amy
Ben
To guess what Ben might rate Whitney Houston we could reason as follows. Amy rated
Whitney one whole point better than PSY. We can predict then than Ben would rate Whitney
one point higher so we will predict that Ben will give her a '5'.
There are actually several Slope One algorithms. I will present the Weighted Slope One
algorithm. Remember that a major advantage is that the approach is simple. What I present
may look complex, but bear with me and things should become clear. You can consider Slope
One to be in two parts. First, ahead of time (in batch mode, in the middle of the night or
whenever) we are going to compute what is called the deviation between every pair of items.
In the simple example above, this step will determine that Whitney is rated 1 better than PSY.
Now we have a nice database of item deviations. In the second phase we actually make
predictions. A user comes along, Ben for example. He has never heard Whitney Houston and
we want to predict how he would rate her. Using all the bands he did rate along with our
database of deviations we are going to make a prediction.
3-28
COLLABORATIVE FILTERING
Taylor Swift
PSY
Whitney Houston
Amy
Ben
Clara
3.5
Daisy
The first step is to compute the deviations. The average deviation of an item i with respect to
item j is:
ui u j
devi, j =
uSi , j ( X ) card(Si, j (X))
where card(S) is how many elements are in S and X is the entire set of all ratings. So
3-29
card(Sj,i(X)) is the number of people who have rated both j and i. Let's consider the deviation
of PSY with respect to Taylor Swift. In this case, card(Sj,i(X)) is 2there are 2 people (Amy
and Ben) who rated both Taylor Swift and PSY. The uj ui numerator is (that users rating
for Taylor Swift) minus (that users rating for PSY). So the deviation is:
devswift , psy =
(4 3) (5 2) 1 3
+
= + =2
2
2
2 2
So the deviation from PSY to Taylor Swift is 2 meaning that on average users rated Taylor
Swift 2 better than PSY. What is the deviation from Taylor Swift to PSY?
dev psy,swift =
(3 4) (2 5)
1
3
+
= + = 2
2
2
2
2
PSY
Taylor Swift
PSY
-2
Whitney Houston
3-30
Whitney Houston
COLLABORATIVE FILTERING
devswift ,houston =
(4 4) (5 3) 0 2
+
= + =1
2
2
2 2
dev psy,houston =
(3 4) (3.5 4) 1 .5
+
=
+
= .75
2
2
2
2
Taylor Swift
PSY
Whitney
Houston
Taylor Swift
PSY
-2
-0.75
Whitney
Houston
-1
0.75
3-31
brain calisthenics
Consider our streaming music site with one million users rating 200,000
artists. If we get a new user who rates 10 artists do we need to run the
algorithm again to generate the deviations of all 200k x 200k pairs or is
there an easier way?
(answer on next page)
3-32
COLLABORATIVE FILTERING
brain calisthenics
Consider our streaming music site with one million users rating 200,000 artists. If we
get a new user who rates 10 artists do we need to run the algorithm again to generate
the deviations of all 200k x 200k pairs or is there an easier way?
You don't need to run the algorithm on the entire dataset again.
That's the beauty of this method. For a given pair of items we only
need to keep track of the deviation and the total number of people
rating both items.
For example, suppose I have a deviation of Taylor Swift with respect
to PSY of 2 based on 9 people. I have a new person who rated
Taylor Swift 5 and PSY 1 the updated deviation would be
((9 * 2) + 4) / 10 = 2.2
3-33
P wS1 (u) j =
iS(u ){ j}
c j,i
iS(u ){ j}
where
PwS1(u)j means our prediction using Weighted Slope One of user us rating for item j. So, for
example PwS1(Ben)Whitney Houston means our prediction for what Ben would rate Whitney
Houston.
Let's say I am interested in answering that question: How might Ben rate Whitney Houston?
iS(u ){ j}
means for every musician that Ben has rated (except for Whitney Houston that is the {j} bit).
The entire numerator means for every musician i that Ben has rated (except for Whitney
Houston) we will look up the deviation of Whitney Houston to that musician and we will add
that to Ben's rating for musician i. We multiply that by the cardinality of that pairthe
number of people that rated both musicians (Whitney and musician i).
3-34
COLLABORATIVE FILTERING
PSY
Whitney Houston
Ben
1.
Taylor Swift
PSY
Whitney Houston
Taylor Swift
PSY
-2
-0.75
Whitney Houston
-1
0.75
Ben has rated Taylor Swift and gave her a 5that is the ui.
2. The deviation of Whitney Houston with respect to Taylor Swift is -1 this is the devj,i.
3. devj,i + ui then is 4.
4. Looking at page 3-19 we see that there were two people (Amy and Daisy) that rated both
Taylor Swift and Whitney Houston so cj,i = 2
5. So (devj,i + ui) cj,i = 4 x 2 = 8
6. Ben has rated PSY and gave him a 2.
7. The deviation of Whitney Houston with respect to PSY is 0.75
8.
9. Two people rated both Whitney Houston and PSY so (devj,i + ui) cj,i = 2.75 x 2 = 5.5
10. We sum up steps 5 and 9 to get 13.5 for the numerator
DENOMINATOR
11. Dissecting the denominator we get something like for every musician that Ben has rated,
sum the cardinalities of those musicians (how many people rated both that musician and
3-35
Whitney Houston). So Ben has rated Taylor Swift and the cardinality of Taylor Swift and
Whitney Houston (that is, the total number of people that rated both of them) is 2. Ben
has rated PSY and his cardinality is also 2. So the denominator is 4.
12. So our prediction of how well Ben will like Whitney Houston is
3-36
13.5
= 3.375
4
COLLABORATIVE FILTERING
devi, j =
ui u j
uSi , j ( X ) card(Si, j (X))
So the input to our computeDeviations function should be data in the format of users2 above.
The output should be a representation of the following data:
Taylor Swift
PSY
Whitney Houston
Taylor Swift
2 (2)
1 (2)
PSY
-2 (2)
-0.75 (2)
Whitney Houston
-1 (2)
0.75 (2)
The number in the parentheses is the frequency (that is, the number of people that rated that
pair of musicians). So for each pair of musicians we need to record both the deviation and the
frequency.
3-37
That pseudocode looks pretty nice but as you can see, there is a disconnect between the data
format expected by the pseudocode and the format the data is really in (see users2 above as
an example). As code warriors we have two possibilities, either alter the format of the data,
or revise the psuedocode. I am going to opt for the second approach. This revised pseudocode
looks like
def computeDeviations(self):
for each person in the data:
get their ratings
"
for each item & rating in that set of ratings:
"
for each item2 & rating2 in that set of ratings:
"
add the difference between the ratings to our computation
Step 1:
def computeDeviations(self):
"
# for each person in the data:
"
#
get their ratings
"
for ratings in self.data.values():
Python dictionaries (aka hash tables) are key value pairs. Self.data is a dictionary. The
values method extracts just the values from the dictionary. Our data looks like
users2 = {"Amy": {"Taylor Swift": 4, "PSY": 3, "Whitney Houston": 4},
"Ben": {"Taylor Swift": 5, "PSY": 2},
"Clara": {"PSY": 3.5, "Whitney Houston": 4},
"Daisy": {"Taylor Swift": 5, "Whitney Houston": 3}}
3-38
COLLABORATIVE FILTERING
So the first time through the loop ratings = {"Taylor Swift": 4, "PSY": 3,
"Whitney Houston": 4}.
Step 2
def computeDeviations(self):
# for each person in the data:
#
get their ratings
for ratings in self.data.values():
#for each item & rating in that set of ratings:
"
for (item, rating) in ratings.items():
"
self.frequencies.setdefault(item, {})
self.deviations.setdefault(item, {})
The Python dictionary method setdefault takes 2 arguments: a key and an initialValue. This
method does the following. If the key does not exist in the dictionary it is added to the
dictionary with the value initialValue. Otherwise it returns the current value of the key.
3-39
Step 3
def computeDeviations(self):
# for each person in the data:
#
get their ratings
for ratings in self.data.values():
# for each item & rating in that set of ratings:
for (item, rating) in ratings.items():
"
self.frequencies.setdefault(item, {})
"
"
"
"
"
self.deviations.setdefault(item, {})
# for each item2 & rating2 in that set of ratings:
for (item2, rating2) in ratings.items():
if item != item2:
# add the difference between the ratings
# to our computation
self.frequencies[item].setdefault(item2, 0)
self.deviations[item].setdefault(item2, 0.0)
self.frequencies[item][item2] += 1
self.deviations[item][item2] += rating - rating2
The code added in this step computes the difference between two ratings and adds that to the
self.deviations running sum. Again, using the data:
{"Taylor Swift": 4, "PSY": 3, "Whitney Houston": 4}
when we are in the outer loop where item = Taylor Swift and rating = 4 and in the inner
loop where item2 = PSY and rating2 = 3 the last line of the code above adds 1 to
self.deviations[Taylor Swift][PSY].
Step 4:
Finally, we need to iterate through self.deviations to divide each deviation by its associated
frequency.
3-40
COLLABORATIVE FILTERING
def computeDeviations(self):
# for each person in the data:
#
get their ratings
for ratings in self.data.values():
# for each item & rating in that set of ratings:
for (item, rating) in ratings.items():
self.frequencies.setdefault(item, {})
self.deviations.setdefault(item, {})
# for each item2 & rating2 in that set of ratings:
for (item2, rating2) in ratings.items():
if item != item2:
# add the difference between the ratings
# to our computation
self.frequencies[item].setdefault(item2, 0)
self.deviations[item].setdefault(item2, 0.0)
self.frequencies[item][item2] += 1
self.deviations[item][item2] += rating - rating2
for (item, ratings) in self.deviations.items():
for item2 in ratings:
ratings[item2] /= self.frequencies[item][item2]
ui u j
devi, j =
uSi , j ( X ) card(Si, j (X))
3-41
I get
>>> r = recommender(users2)
>>> r.computeDeviations()
>>> r.deviations
{'PSY': {'Taylor Swift': -2.0, 'Whitney Houston': -0.75}, 'Taylor
Swift': {'PSY': 2.0, 'Whitney Houston': 1.0}, 'Whitney Houston':
{'PSY': 0.75, 'Taylor Swift': -1.0}}
which is what we obtained when we computed this example by hand:
Taylor Swift
PSY
Whitney Houston
Taylor Swift
PSY
-2
-0.75
Whitney Houston
-1
0.75
3-42
COLLABORATIVE FILTERING
wS1
(u) j =
iS(u ){ j}
c j,i
iS(u ){ j}
The big question I have is can we beat the 18 line implementation of computeDeviations.
First, let's parse that formula and put it into English and/or pseudocode. You try:
3-43
3-44
COLLABORATIVE FILTERING
[(self.convertProductID2name(k),
v / frequencies[k])
for (k, v) in recommendations.items()]
3-45
This results matches what we calculated by hand. So the recommendation part of the
algorithm weighs in at 18 lines. So in 36 lines of Python code we implemented the Slope One
algorithm. With Python you can write pretty compact code.
First, I will load the data into the Python recommender object:
>>> r = recommender(0)
>>> r.loadMovieLens('/Users/raz/Downloads/ml-100k/')
102625
I will be using the info from User 1. Just to peruse the data, I will look at the top 50 items the
user 1 rated:
>>> r.showUserTopItems('1', 50)
When Harry Met Sally... (1989)"
5
Jean de Florette (1986)"5
Godfather, The (1972)" 5
Big Night (1996)" 5
Manon of the Spring (Manon des sources) (1986)"5
Sling Blade (1996)"
5
Breaking the Waves (1996)"
5
Terminator 2: Judgment Day (1991)" 5
Searching for Bobby Fischer (1993)" 5
3-46
COLLABORATIVE FILTERING
...
Now I will do the first step of Slope One: computing the deviations:
>>> r.computeDeviations()
3-47
Projects
1. See how well the Slope One recommender recommends movies for
you. Rate 10 movies or so (ones that are in the MovieLens dataset).
Does the recommender suggest movies you might like?
2. Implement Adjusted Cosine Similarity. Compare its performance to
Slope One.
3. (harder) I run out of memory (I have 8GB on my desktop) when I
try to run this on the Book Crossing Dataset. Recall that there are
270,000 books that are rated. So we would need a 270,000 x
270,000 dictionary to store the deviations. That's roughly 73 billion
dictionary entries. How sparse is this dictionary for the MovieLens
dataset? Alter the code so we can handle larger datasets.
3-48
These reco m
men ders can
create a rich
richer effect
-getfor po pular p
ro ducts an d
versa for un
v
icepo pular ones
Daniel Fle der
& Kartik Hos
anagar. 2009.
Cultures Nex
Blockbuster
t Rise or Fall
s
: The Impact
Systems on S
of
Recom men de
ales Diversity
r
Managemen
t Science vol
55
In this chapter we look at a different approach. Consider the streaming music site, Pandora.
In Pandora, as many of you know, you can set up different streaming radio stations. You seed
each station with an artist and Pandora will play music that is similar to that artist. I can
create a station seeded with the band Phoenix. It then plays bands it thinks are similar to
Phoenixfor example, it plays a tune by El Ten Eleven. It doesn't do this with collaborative
filteringbecause people who listened to Phoenix also listened to the El Ten Eleven. It plays
El Ten Eleven because the algorithm believes that El Ten Eleven is musically similar to
Phoenix. In fact, we can ask Pandora why it played a tune by the group:
It plays El Ten Elevens tune My Only Swerving on the Phoenix station because Based on
what you told us so far, were playing this track because it features repetitive melodic
phrasing, mixed acoustic and electric instrumentation, major key tonality, electric guitar riffs
and an instrumental arrangement. On my Hiromi station it plays a tune by E.S.T. because
it features classic jazz roots, a well-articulated piano solo, light drumming, an interesting
song structure and interesting part writing.
4-2
Pandora bases its recommendation on what it calls The Music Genome Project. They hire
professional musicians with a solid background in music theory as analysts who determine
the features (they call them 'genes') of a song. These analysts are given over 150 hours of
training. Once trained they spend an average of 20-30 minutes analyzing a song to determine
its genes/features. Many of these genes are technical
4-3
The analyst provides values for over 400 genes. Its a very labor intensive process and
approximately 15,000 new songs are added per month.
Mood
genre
Country
Jazz
Rock
Soul
Rap
Melancholy
joyful
passion
angry
unknown
So a genre value of 4 means Soul and a mood value of 3 means passion. Suppose I have a
rock song that is melancholyfor example the gag-inducing Youre Beautiful by James Blunt.
In 2D space, inked quickly on paper, that would look as follows:
4-4
FACT:
the
tone poll on
In a Rolling S
,
g Songs ever
Most Annoyin
#
ful placed 7!
Youre Beauti
Let's say Tex just absolutely loves You're Beautiful and we would like to recommend a song to
him.
4-5
Let me populate our dataset with more songs. Song 1 is a jazz song that is melancholy; Song 2
is a soul song that is angry and Song 3 is a jazz song that is angry. Which would you
recommend to Tex?
Song 1 looks
closest!
I hope you see that we have a fatal flaw in our scheme. Let's take a look at the possible values
for our variables again:
Mood
genre
melancholy
Country
joyful
Jazz
passion
Rock
angry
Soul
unknown
Rap
If we are trying to use any distance metrics with this scheme we are saying that jazz is closer
to rock than it is to soul (the distance between jazz and rock is 1 and the distance between
4-6
jazz and soul is 2). Or melancholy is closer to joyful than it is to angry. Even when we
rearrange values the problem remains.
Mood
genre
melancholy
Country
angry
Jazz
passion
Soul
joyful
Rap
unknown
Rock
Re-ordering does not solve the problem. No matter how we rearrange the values this won't
work. This shows us that we have chosen our features poorly. We want features where the
values fall along a meaningful scale. We can easily fix our genre feature by dividing it into 5
separate featuresone for country, another for jazz, etc.
4-7
This is exactly how Pandora constructs its gene set. The values of most genes are on a scale of
1-5 with integer increments. Genes are arranged into categories. For example, there is a
musical qualities category which contains genes for Blues Rock Qualities, Folk Rock
Qualities, and Pop Rock Qualities among others. Another category is instruments with genes
such as Accordion, Dirty Electric Guitar Riffs and Use of Dirty Sounding Organs. Using these
genes, each of which has a well-defined set of values from 1 to 5, Pandora represents each
song as a vector of 400 numeric values (each song is a point in a 400 dimensional space).
Now Pandora can make recommendations (that is, decide to play a song on a user-defined
radio station) based on standard distance functions like those we already have seen.
A simple example
Let us create a simple dataset so we can explore this approach. Suppose we have seven
features each one ranging from 1-5 in integer increments (I admit this isn't a very rational
nor complete selection):
Amount of piano
Amount of vocals
Driving beat
Blues Influence
Presence of dirty electric
guitar
Presence of backup vocals
Rap Influence
Now, using those features I rate ten tunes:
4-8
Piano
Vocals
Driving
beat
Blues
infl.
Dirty
elec.
Guitar
Backup
vocals
Rap
infl.
2.5
3.5
Phoenix/
Lisztomania
Heartless
Bastards /
Out at Sea
Todd Snider/
Don't Tempt Me
3.5
Glee Cast/
Jessie's Girl
3.5
La Roux/
Bulletproof
2.5
Mike Posner/
Cooler than me
Lady Gaga/
Alejandro
Thus, each tune is represented as a list of numbers and we can use any distance function to
compute the distance between tunes. For example, The Manhattan Distance between Dr.
Dogs Fate and Phoenixs Lisztomania is:
Dr. Dog/ Fate
2.5
3.5
Phoenix/
Lisztomania
0.5
1.5
Distance
4-9
distance to Glees
Jessies Girl
Dr. Dog/ Fate
??
Phoenix/ Lisztomania
4.822
Heartless Bastards /
Out at Sea
4.153
4.387
4.528
5.408
La Roux/ Bulletproof
6.500
5.701
4-10
??
2.291
4.387
Recall that the Euclidean Distance between any two objects, x and y,
which have n attributes is:
(x
d(x, y) =
yk ) 2
k=1
blues
guitar
backup
rap
Glee
3.5
Lady
G
(x-y)
0.5
(x-y)2
0.25
SUM
SQRT
19.25
4.387
4-11
4-12
!
!
"blues": 2, "guitar": 1,
"backup vocals": 1, "rap": 1},
"Mike Posner": {"piano": 2.5, "vocals": 4, "beat": 4,
"blues": 1, "guitar": 1, "backup vocals": 1,
"rap": 1},
"Black Eyed Peas/Rock That Body": {"piano": 2, "vocals": 5,
"beat": 5, "blues": 1,
!
!
!
!
!
"guitar": 2,
"backup vocals": 2,
"rap": 4},
!
"Lady Gaga/Alejandro": {"piano": 1, "vocals": 5, "beat": 3,
"blues": 2, "guitar": 1,
"backup vocals": 2, "rap": 1}}
Now suppose I have a friend who says he likes the Black Keys Magic Potion. I can plug that
into my handy Manhattan distance function:
>>> computeNearestNeighbor('The Black Keys/Magic Potion', music)
[(4.5, 'Heartless Bastards/Out at Sea'), (5.5, 'Phoenix/Lisztomania'),
(6.5, 'Dr Dog/Fate'), (8.0, "Glee Cast/Jessie's Girl"), (9.0, 'Mike
Posner'), (9.5, 'Lady Gaga/Alejandro'), (11.5, 'Black Eyed Peas/Rock
That Body'), (11.5, 'La Roux/Bulletproof'), (13.5, "Todd Snider/Don't
Tempt Me")]
and I can recommend to him Heartless Bastard's Out at Sea. This is actually a pretty good
recommendation.
NOTE:
The code for this example, as well as all
examples in this book, is available on the
book website
http://www.guidetodatamining.com
4-13
We can do the same. Remember our friend who liked The Black Keys Magic Potion and we
recommended Heartless Bastards Out at Sea. What features influenced that
recommendation? We can compare the two feature vectors:
Piano
Vocals Driving
beat
Blues
infl.
Rap
infl.
Black Keys
Magic Potion
Heartless Bastards
Out at Sea
3.5
difference
1.5
The features that are closest between the two tunes are piano, presence of backup vocals, and
rap influencethey all have a distance of zero. However, all are on the low end of the scale:
no piano, no presence of backup vocals, and no rap influence and it probably would not be
helpful to say We think you would like this tune because it lacks backup vocals. Instead, we
will focus on what the tunes have in common on the high end of the scale.
4-14
Because our data set has few features, and is not well-balanced, the other recommendations
are not as compelling:
>>> computeNearestNeighbor("Phoenix/Lisztomania", music)
[(5, 'Heartless Bastards/Out at Sea'), (5.5, 'Mike Posner'), (5.5, 'The
Black Keys/Magic Potion'), (6, 'Black Eyed Peas/Rock That Body'), (6,
'La Roux/Bulletproof'), (6, 'Lady Gaga/Alejandro'), (8.5, "Glee Cast/
Jessie's Girl"), (9.0, 'Dr Dog/Fate'), (9, "Todd Snider/Don't Tempt
Me")]
>>> computeNearestNeighbor("Lady Gaga/Alejandro", music)
[(5, 'Heartless Bastards/Out at Sea'), (5.5, 'Mike Posner'), (6, 'La
Roux/Bulletproof'), (6, 'Phoenix/Lisztomania'), (7.5, "Glee Cast/
Jessie's Girl"), (8, 'Black Eyed Peas/Rock That Body'), (9, "Todd
Snider/Don't Tempt Me"), (9.5, 'The Black Keys/Magic Potion'), (10.0,
'Dr Dog/Fate')]
4-15
A problem of scale
Suppose I want to add another feature to my set. This time I will add beats per minute (or
bpm). This makes some senseI might like fast beat songs or slow ballads. Now my data
would look like this:
Piano
Vocals
Driving Blues
beat
infl.
Dirty
elec.
Guitar
Backup
vocals
Rap
infl.
bpm
2.5
3.5
140
Phoenix/
Lisztomania
110
Heartless
Bastards /
Out at Sea
130
The Black
Keys/
Magic Potion
3.5
88
Glee Cast/
Jessie's Girl
3.5
120
Bad Plus/
Smells like
Teen Spirit
90
Without using beats per minute, the closest match to The Black Keys Magic Potion is
Heartless Bastards Out to Sea and the tune furthest away is Bad Pluss version of Smells Like
Teen Spirit. However, once we add beats per minute, it wrecks havoc with our distance
functionbpm dominates the calculation. Now Bad Plus is closest to The Black Keys simply
because the bpm of the two tunes are close.
4-16
Consider another example. Suppose I have a dating site and I have the weird idea that the
best attributes to match people up by are salary and age.
gals
guys
age
salary
75,000
name
age
salary
Yun L
35
55,000
Brian A
53
70,000
Allie C
52
45,000
Abdullah K
25
105,000
Daniela C
27
37
David A
35
69,000
Rita A
115,000
Michael W
48
43,000
name
Here the scale for age ranges from 25 to 53 for a difference of 28 and the salary scale ranges
from 43,000 to 115,000 for a difference of 72,000. Because these scales are so different,
salary dominates any distance calculation. If we just tried to eyeball matches we might
recommend David to Yun since they are the same age and their salaries are fairly close.
However, if we went by any of the distance formulas we have covered, 53-year old Brian
would be the person recommended to Yun. This does not look good for my fledgling dating
site.
problem
Arghhhh.
4-17
Normalization
Shhh. Im
normalizing
No need to panic.
Relax.
The solution is normalization!
To remove this bias we need to
standardize or normalize the data.
One common method of
normalization involves having the
values of each feature range from 0
to 1.
For example, consider the salary attribute in our dating example. The minimum salary was
43,000 and the max was 115,000. That makes the range from minimum to maximum
72,000. To convert each value to a value in the range 0 to 1 we subtract the minimum from
the value and divide by the range.
gals
name
Yun L
75,000
Allie C
55,000
Daniela C
Rita A
4-18
salary
normalized
salary
45,000
115,000
0.444
0.167
0.028
1.0
If you have taken a statistics course you will be familiar with more accurate methods for
standardizing data. For example, we can use what is called The Standard Score which can be
computed as follows
= Standard
Score
Standard Deviation is
sd =
(x x)
card(x)
card(x) is the cardinality of xthat is, how many values there are.
4-19
Consider the data from the dating site example a few pages back.
name
salary
Yun L
75,000
Allie C
55,000
Daniela C
45,000
Rita A
115,000
Brian A
70,000
Abdullah K
105,000
David A
69,000
Michael W
43,000
Yuns salary
The sum of all the salaries is 577,000. Since there are 8 people, the
mean is 72,125.
sd =
(x x)
card(x)
so that would be
Allies salary
Danielas salary
etc.
4-20
(standard deviation)
75000 72125
2875
=
= 0.117
24543.01
24543.01
Can you compute the Standard Scores for the following people?
name
salary
Yun L
75,000
Allie C
55,000
Daniela C
45,000
Rita A
115,000
Standard
Score
0.117
4-21
Can you compute the Standard Scores for the following people?
name
salary
Standard
Score
Yun L
75,000
0.117
Allie C
55,000
-0.698
Daniela C
45,000
-1.105
Rita A
115,000
1.747
Allie:
(55,000 - 72,125) / 24,543.01
= -0.698
Daniela:
(45,000 - 72,125) / 24,543.01
= -1.105
Rita:
(115,000 - 72,125) / 24,543.01
= 1.747
4-22
Not a bad average wage at LargeMart. As you can see, the mean is greatly influenced by
outliers.
Because of this problem with the mean, the standard score formula is often modified.
asd =
where
1
xi
card(x) i
is the median.
4-23
Okay, lets give this a try. In the table on the right Ive
arranged our salaries from lowest to highest. Since there
are an equal number of values, the median is the average
of the two middle values:
median =
(69,000 + 70,000)
= 69,500
2
asd =
1
xi
card(x) i
Name
Salary
Michael W
43,000
Daniela C
45,000
Allie C
55,000
David A
69,000
Brian A
70,000
Yun L
75,000
Abdullah K
105,000
Rita A
115,000
1
asd = ( 43,000 69,500 + 45,000 69,500 + 55,000 69,500) + ...)
8
1
= (26,500 + 24,500 + 14,500 + 500 + ...)
8
1
= (153,000) = 19,125
8
4-24
mss =
The following table shows the play count of various tracks I played. Can
you standardize the values using the Modified Standard Score?
track
play
count
Power/Marcus Miller
21
I Breathe In, I
Breathe Out/
Chris Cagle
15
12
Europa/Santana
modified
standard
score
4-25
The following table shows the play count of various tracks I played. Can
you standardize the values using the Modified Standard Score?
Step 1. Computing the median.
I put the values in order (3, 7, 12, 15, 21) and select the middle value, 12.
The median is 12.
1
asd = ( 3 12 + 7 12 + 12 12 + 15 12 + 21 12 )
5
1
1
= (9 + 5 + 0 + 3 + 9) = (26) = 5.2
5
5
Step 3. Computing the Modified Standard Scores.
Power / Marcus Miller: (21 - 12) / 5.2 = 9/5.2 = 1.7307692
I Breathe In, I Breathe Out / Chris Cagle: (15 - 12) / 5.2 = 3/5.2 = 0.5769231
Blessed / Jill Scott: (12 - 12) / 5.2 = 0
Europa / Santana: (3 - 12) / 5.2 = -9 / 5.2 = -1.7307692
Santa Fe / Beirut: (7 - 12) / 5.2 = - 5 / 5.2 = -0.961538
4-26
To normalize or not.
Normalization makes sense when the scale of the featuresthe scales of the different
dimensionssignificantly varies. In the music example earlier in the chapter there were a
number of features that ranged from one to five and then beats-per-minute that could
potentially range from 60 to 180. In the dating example, there was also a mismatch of scale
between the features of age and salary.
Suppose I am dreaming of being rich and looking at homes in the Santa Fe, New Mexico area.
asking
price
bedrooms
bathrooms
sq. ft.
$1,045,000
2.0
1,860
$1,895,000
4.0
2,907
$3,300,000
7.0
10,180
$6,800,000
6.0
8,653
$2,250,000
2.0
1,030
Consider a person giving thumbs up and thumbs down ratings to news articles on a news
site. Here a list representing a users ratings consists of binary values (1 = thumbs up; 0 =
thumbs down):
4-27
Bill = {0, 0, 0, 1, 1, 1, 1, 0, 1, 0 }
Obviously there is no need to normalize this data. What about the Pandora case: all variables
lie on a scale from 1 to 5 inclusive. Should we normalize or not? It probably wouldn't hurt the
accuracy of the algorithm if we normalized, but keep in mind that there is a computational
cost involved with normalizing. In this case, we might empirically compare results between
using the regular and normalized data and select the best performing approach. Later in this
chapter we will see a case where normalization reduces accuracy.
Back to Pandora
In the Pandora inspired example, we had each song represented by a number of attributes. If
a user creates a radio station for Green Day we decide what to play based on a nearest
neighbor approach. Pandora allows a user to give a particular tune a thumbs up or thumbs
down rating. How do we use the information that a user gives a thumbs up for a particular
song.?
Suppose I use 2 attributes for songs: the amount of dirty guitar and the presence of a driving
beat both rated on a 1-5 scale. A user has given the thumbs up to 5 songs indicating he liked
the song (and indicated on the following chart with a 'L'); and a thumbs down to 5 songs
indicating he disliked the song (indicated by a 'D').
Do you think the user will like or dislike the song indicated by the ? in this chart?
4-28
I am guessing you said he would like the song. We base this on the fact that the ? is closer to
the Ls in the chart than the Ds. We will spend the rest of this chapter and the next describing
computational approaches to this idea. The most obvious approach is to find the nearest
neighbor of the ? and predict that it will share the class of the nearest neighbor. The
question marks nearest neighbor is an L so we would predict that the ? tune is something
the user would like.
Piano
Vocals
Driving
beat
Blues
infl.
Dirty
elec.
Guitar
Backup
vocals
Rap
infl.
2.5
3.5
Phoenix/
Lisztomania
Heartless
Bastards /
Out at Sea
Todd Snider/
Don't Tempt Me
3.5
Glee Cast/
Jessie's Girl
3.5
La Roux/
Bulletproof
2.5
Mike Posner/
Cooler than me
Lady Gaga/
Alejandro
4-29
Here the strings piano, vocals, beat, blues, guitar, backup vocals, and rap occur multiple
times; if I have a 100,000 tunes those strings are repeated 100,000 times. I'm going to
remove those strings from the representation of our data and simply use vectors:
#
# the item vector represents the attributes: piano, vocals,
# beat, blues, guitar, backup vocals, rap
#
items = {"Dr Dog/Fate": [2.5, 4, 3.5, 3, 5, 4, 1],
"Phoenix/Lisztomania": [2, 5, 5, 3, 2, 1, 1],
"Heartless Bastards/Out at Sea": [1, 5, 4, 2, 4, 1, 1],
"Todd Snider/Don't Tempt Me": [4, 5, 4, 4, 1, 5, 1],
"The Black Keys/Magic Potion": [1, 4, 5, 3.5, 5, 1, 1],
"Glee Cast/Jessie's Girl": [1, 5, 3.5, 3, 4, 5, 1],
"La Roux/Bulletproof": [5, 5, 4, 2, 1, 1, 1],
"Mike Posner": [2.5, 4, 4, 1, 1, 1, 1],
"Black Eyed Peas/Rock That Body": [2, 5, 5, 1, 2, 2, 4],
"Lady Gaga/Alejandro": [1, 5, 3, 2, 1, 2, 1]}
4-30
4-31
My way of representing thumbs up as L for like and thumbs down as D is arbitrary. You
could use 0 and 1, like and dislike.
In order to use the new vector format for songs I need to revise the Manhattan Distance and
the computeNearestNeighbor functions.
4-32
Finally, I need to create a classify function. I want to predict how a particular user would rate
an item represented by itemName and itemVector. For example:
"Chris Cagle/ I Breathe In. I Breathe Out"
[1, 5, 2.5, 1, 1, 5, 1]
(NOTE: To better format the Python example below, I will use the string Cagle to represent
that singer and song pair.)
The first thing the function needs to do is find the nearest neighbor of this Chris Cagle tune.
Then it needs to see how the user rated that nearest neighbor and predict that the user will
rate Chris Cagle the same. Here's my rudimentary classify function:
def classify(user, itemName, itemVector):
"""Classify the itemName based on user ratings
Should really have items and users as parameters"""
# first find nearest neighbor
nearest = computeNearestNeighbor(itemName, itemVector, items)[0][1]
rating = users[user][nearest]
return rating
Ok. Let's give this a try. I wonder if Angelica will like Chris Cagle's I Breathe In, I Breathe
Out?
We are predicting she will like it! Why are we predicting that?
4-33
We are predicting that Angelica will like Chris Cagle's I Breathe In, I Breathe Out because
that tune's nearest neighbor is Lady Gagas Alejandro and Angelica liked that tune.
What we have done here is build a classifierin this case, our task was to classify tunes as
belonging to one of two groupsthe like group and the dislike group.
Attention, Attention.
We just built a classifier!!
4-34
Next, we checked
whether Angelica
liked or disliked
the Alejandrosh
e
liked it. So we pr
edict that Angelic
a
will also like the
Chris Cagle tune
,I
Breathe In, I Brea
the Out.
4-35
Targeted Marketing
Similar to political microtargeting.
Instead of a broad advertising campaign
to sell my expensive Vegas time share
luxury condos, can I identify likely
buyers and market just to them? Even
better if I can identify subgroups of likely
buyers and I can really tailor my ads to
specific groups.
4-36
What sport?
To give you a preview of what we will be working on in the next few chapters let us work with
an easier example than those given on the previous pageclassifying what sport various
world-class women athletes play based solely on their height and weight. In the following
table I have a small sample dataset drawn from a variety of web sources.
Name
Sport
Age
Height
Weight
Asuka Teramoto
Gymnastics
16
54
66
Brittainey Raven
Basketball
22
72
162
Chen Nan
Basketball
30
78
204
Gabby Douglas
Gymnastics
16
49
90
Helalia Johannes
Track
32
65
99
Irina Miketenko
Track
40
63
106
Jennifer Lacy
Basketball
27
75
175
Kara Goucher
Track
34
67
123
Linlin Deng
Gymnastics
16
54
68
Nakia Sanford
Basketball
34
76
200
Nikki Blue
Basketball
26
68
163
Qiushuang Huang
Gymnastics
20
61
95
Rebecca Tunney
Gymnastics
16
58
77
Rene Kalmer
Track
32
70
108
Shanna Crossley
Basketball
26
70
155
Shavonte Zellous
Basketball
24
70
155
Tatyana Petrova
Track
29
63
108
Tiki Gelana
Track
25
65
106
Valeria Straneo
Track
36
66
97
Viktoria Komova
Gymnastics
17
61
76
4-37
The gymnastic data lists some of the top participants in the 2012 and 2008 Olympics. The
basketball players play for teams in the WNBA. The women track stars were finishers in the
2012 Olympic marathon . Granted this is a trivial example but it will allow us to apply some
of the techniques we have learned.
As you can see, I've included age in the table. Just scanning the data you can see that age
alone is a moderately good predictor. Try to guess the sports of these athletes.
4-38
The answers
Candace Parker plays basketball for the WNBAs Los Angeles Sparks and Russias UMMC
Ekaterinburg. McKayla Maroney was a member of the U.S. Womens Gymnastic Team and
won a Gold and a Silver. Olivera Jevti is a Serbian long-distance runner who competed in
the 2008 and 2012 Olympics. Lisa Jane Weightman is an Australian long-distance runner
who also competed in the 2008 and 2012 Olympics.
You just performed classificationyou predicted the class of objects based on their
attributes. (In this case, predicting the sport of athletes based on a single attribute, age.)
brain calisthenics
Suppose I want to guess what sport a person plays based on their height
and weight. My database is smallonly two people. Nakia
Sanford, the center for the Womens National Basketball
Association team Phoenix Mercury, is 64 and weighs
200 pounds. Sarah Beale, a forward on Englands
National Rugby Team, is 510 and weighs 190.
Based on that database, I want to classify Catherine
Spencer as either a basketball player or rugby player.
She is 510 and weighs 200 pounds. What sport do you
think she plays?
4-39
4-40
Test Data.
Let us remove age from the picture. Here is a group of individuals I would like to classify:
Name
Height
Weight
Crystal Langhorne
74
190
Li Shanshan
64
101
Kerri Strug
57
87
60
97
Kelly Miller
70
140
Zhu Xiaolin
67
123
Lindsay Whalen
69
169
Koko Tsurumi
55
75
Paula Radcliffe
68
120
Erin Thorn
69
144
Jaycie Phelps
Sport
Lets build a
classifier!
4-41
Python Coding
Instead of hard-coding the data in the Python code, I decided to put the data for this example
into two files: athletesTrainingSet.txt and athletesTestSet.txt.
I am going to use the data in the
athletesTrainingSet.txt file to build the classifier.
The data in the athletesTestSet.txt file will be used
to evaluate this classifier. In other words, each entry
in the test set will be classified by using all the
entries in the training set.
Gymnastics
54
66
Brittainey Raven
Basketball
72
162
Chen Nan
Basketball
78
204
Gabby Douglas
Gymnastics
49
90
Each line of the text represents an object described as a tab-separated list of values. I want
my classifier to use a persons height and weight to predict what sport that person plays. So
the last two columns are the numerical attributes I will use in the classifier and the second
column represents the class that object is in. The athletes name is not used by the classifier. I
dont try to predict what sport a person plays based on their name and I am not trying to
predict the name from some attributes.
Hey, you look what...
maybe five foot eleven
and 150? I bet your
name is Clara Coleman.
4-42
However, keeping the name might be useful as a means of explaining the classifiers decision
to users: We think Amelia Pond is a gymnast because she is closest in height and weight to
Gabby Douglas who is a gymnast.
As I said, I am going to write my Python code to not be so hard coded to a particular example
(for example, to only work for the athlete example). To help meet this goal I am going to add
an initial header line to the athlete training set file that will indicate the function of each
column. Here are the first few lines of that file:
comment
class
num
num
Asuka Teramoto
Gymnastics
54
66
Brittainey Raven
Basketball
72
162
Any column labeled comment will be ignored by the classifier; a column labeled class
represents the class of the object, and columns labeled num indicate numerical attributes of
that object.
k brain calisthenics How do you think we should represent this data in Python? Here are some
possibilities (or come up with your own representation).
a dictionary of the form:
{'Asuka Termoto': ('Gymnastics', [54, 66]),
'Brittainey Raven': ('Basketball', [72, 162]), ...
4-43
This is not a very good representation of our data. The key for the dictionary is
the athletes name, which we do not even use in the calculations.
a list of lists of the form:
[['Asuka Termoto', 'Gymnastics', 54, 66],
['Brittainey Raven', 'Basketball', 72, 162], ...
This is not a bad representation. It mirrors the input file and since the nearest
neighbor algorithm requires us to iterate through the list of objects, a list makes
sense.
a list of tuples of the form:
[('Gymnastics', [54, 66], ['Asuka Termoto']),
('Basketball', [72, 162], ['Brittainey Raven'],...
I like this representation better than the above since it separates the attributes
into their own list and makes the division between class, attributes, and comments
precise. I made the comment (the name in this case) a list since there could be
multiple columns that are comments.
4-44
4-45
s
code it
Before we can standardize the
data using the Modified Standard
Score we need methods that will
compute the median and absolute
standard deviation of numbers
in a list:
>>>
>>>
>>>
65
>>>
>>>
8.0
heights = [54, 72, 78, 49, 65, 63, 75, 67, 54]
median = classifier.getMedian(heights)
median
asd = classifier.getAbsoluteStandardDeviation(heights, median)
asd
or?
r
tionEr
Asser
xt
See ne
page
4-46
The getMedian function you are to complete initially looks like this:
def getMedian(self, alist):
"""return median of alist"""
"""TO BE DONE"""
return 0
So initially, getMedian returns 0 as the median for any list. You are to complete getMedian
so it returns the correct value. In the unitTest procedure, I call getMedian with the list
[54, 72, 78, 49, 65, 63, 75, 67, 54]
The assert statement in unitTest says the value returned by getMedian should equal 65. If
it does, execution continues to the next line and
getMedian and getAbsoluteStandardDeviation work correctly
is printed. If they are not equal the program terminates with an error:
4-47
If you download the code from the books website and run it without making any changes,
you will get this error. Once you have correctly implemented getMedian and
getAbsoluteStandardDeviation this error will disappear.
This use of assert as a means of testing software components is a common technique among
software developers.
it is important that each part of the specification be turned into a piece of code that
implements it and a test that tests it. If you dont have tests like these then you dont know
when you are done, you dont know if you got it right, and you dont know that any future
changes might be breaking something. - Peter Norvig
4-48
Solution
Here is one way of writing these algorithms:
As you can see my getMedian method first sorts the list before finding the median. Because I
am not working with huge data sets I think this is a fine solution. If I wanted to optimize my
code, I might replace this with a selection algorithm.
Right now, the data is read from the file athletesTrainingSet.txt and stored in the list data in
the classifier with the following format:
[('Gymnastics',
('Basketball',
('Basketball',
('Gymnastics',
[54,
[72,
[78,
[49,
4-49
Now I would like to normalize the vector so the list data in the classifier contains normalized
values. For example,
In the for loop we want to normalize the data, column by column. So the first time through
the loop we will normalize the height column, and the next time through, the weight column.
code it
4-50
Solution
Here is an implementation of the normalizeColumn method:
def normalizeColumn(self, columnNumber):
"""given a column number, normalize that column in self.data"""
# first extract values to list
col = [v[1][columnNumber] for v in self.data]
median = self.getMedian(col)
asd = self.getAbsoluteStandardDeviation(col, median)
#print("Median: %f
ASD = %f" % (median, asd))
self.medianAndDeviation.append((median, asd))
for v in self.data:
v[1][columnNumber] = (v[1][columnNumber] - median) / asd
You can see I also store the median and absolute standard deviation of each column in the
list medianAndDeviation. I use this information when I want to use the classifier to
predict the class of a new instance. For example, suppose I want to predict what sport is
played by Kelly Miller, who is 5 feet 10 inches and weighs 170. The first step is to convert her
height and weight to Modified Standard Scores. That is, her original attribute vector is [70,
140].
After processing the training data, the value of meanAndDeviation is
[(65.5, 5.95), (107.0, 33.65)]
meaning the data in the first column of the vector has a median of 65.5 and an absolute
standard deviation of 5.95; the second column has a median of 107 and a deviation of 33.65.
I use this info to convert the original vector [70,140] to one containing Modified Standard
Scores. This computation for the first attribute is
mss =
xi x! 70 65.5 4.5
=
=
= 0.7563
asd
5.95
5.95
4-51
mss =
xi x! 140 107
33
=
=
= 0.98068
asd
33.65
33.65
The final bit of code to write is the part that predicts the class of a new instancein our
current example, the sport a person plays. To determine the sport played by Kelly Miller,
who is 5 feet 10 inches (70 inches) and weighs 170 we would call
classifier.classify([70, 170])
code it
4-52
Solution
The implementation of the nearestNeighbor methods turns out to be very short.
def manhattan(self, vector1, vector2):
"""Computes the Manhattan distance."""
return sum(map(lambda v1, v2: abs(v1 - v2), vector1, vector2))
Thats it!!!
We have written a nearest neighbor classifier in roughly 200 lines of Python.
4-53
In the complete code which you can download from our website, I have included a function,
test, which takes as arguments a training set file and a test set file and prints out how well
the classifier performed. Here is how well the classifier did on our athlete data:
>>> test("athletesTrainingSet.txt", "athletesTestSet.txt")
-
Track
+
+
Aly Raisman!
Gymnastics! 62!
115
Basketball
190
Basketball
Diana Taurasi!
Basketball! 72!
163
<snip>
-
Track
Hannah Whelan!
Gymnastics! 63!
117
Gymnastics
Jaycie Phelps!
Gymnastics! 60!
97
80.00% correct
As you can see, the classifier was 80% accurate. It performed perfectly on predicting
basketball players but made four errors between track and gymnastics.
4-54
All the data sets described in the book are available on the books website:
guidetodatamining.com. This allows you to download the data and
experiment with the algorithm. Does normalizing the data improve or
worsen the accuracy? Does having more data in the training set improve
results? What effect does switching to Euclidean Distance have?
REMEMBER: Any learning that takes place happens in your brain, not mine.
The more you interact with the material in the book, the more you will
learn.
The Iris data set looks like this (species is what the classifier is trying to predict):
Sepal
length
Sepal
width
Petal
Length
Petal
Width
Species
5.1
3.5
1.4
0.2
l.setosa
4.9
3.0
1.4
0.2
l setosa
4-55
There were 120 instances in the training set and 30 in the test set (none of the test set
instances were in the training set).
Again, a fairly impressive result considering how simple our classifier is. Interestingly,
without normalizing the data the classifier is 100% accurate. We will explore this
normalization problem in more detail in a later chapter.
cylinders
c.i.
HP
weight
secs. 0-60
make/model
30
68
49
1867
19.5
fiat 128
45
90
48
2085
21.7
vw rabbit (diesel)
20
307
130
3504
12
In the modified version of the data, we are trying to predict mpg, which is a discrete category
(with values 10, 15, 20, 25, 30, 35, 40, and 45) using the attributes cylinders, displacement,
horsepower, weight, and acceleration.
4-56
4-57
Heads Up on Normalization
In this chapter we talked the importance
of normalizing data. This is critical when
attributes have drastically different
scales (for example, income and age). In
order to get accurate distance
measurements, we should rescale the
attributes so they all have the same
scale.
While most data miners use the term
normalization to refer to this rescaling,
others make a distinction between
normaliza-tion and standardization. For
them, normalization means scaling values
so they lie on a scale from 0 to 1.
Standardization, on the other hand,
refers to scaling an attribute so the
average (mean or median) is 0, and other
values are deviations from this average
(standard deviation or absolute standard
deviation). So for these data miners,
Standard Score and Modified Standard
Score are examples of standardization.
value min
max min
4-58
say standardize
You say
code it
Can you modify our classifier code so that it normalizes the attributes
using the formula on our previous page?
You can test its accuracy with our three data sets:
classifier built
data set
using no
normalization
using Modified
Standard Score
Athletes
80.00%
80.00%
Iris
100.00%
93.33%
MPG
32.00%
56.00%
4-59
my results
classifier built
data set
using no
normalization
using Modified
Standard Score
Athletes
80.00%
60.00%
80.00%
Iris
100.00%
83.33%
93.33%
MPG
32.00%
36.00%
56.00%
Hmm. These are disappointing results compared with using Modified Standard
Score.
4-60
Once we build a classifier, we might be interested in answering some questions about it such
as:
5-2
People in data mining never test with the data they used to train the system.
You can see why we don't use the training data for testing if we consider the nearest neighbor
algorithm. If Marissa Coleman the basketball player from the above example, was in our
training data, she at 6 foot 1 and 160 pounds would be the nearest neighbor of herself. So
when evaluating a nearest neighbor algorithm, if our test set is a subset of our training data
we would always be close to 100% accurate. More generally, in evaluating any data mining
algorithm, if our test set is a subset of our training data the results will be optimistic and
often overly optimistic. So that doesnt seem like a great idea.
How about the idea we used in the last chapter? We divide our data into two parts. The larger
part we use for training and the smaller part we use for evaluation. As it turns out that has
its problems too. We could be extremely unlucky in how we divide up our data. For example,
all the basketball players in our test set might be short (like Debbie Black who is only 5 foot 3
and weighs 124 pounds) and get classified as marathoners. And all the track people in the
test set might be short and lightweight for that sport like Tatyana Petrova (5 foot 3 and 108
pounds) and get classified as gymnasts. With a test set like this, our accuracy will be poor. On
the other hand, we could be very lucky in our selection of a test set. Every person in the test
set is the prototypical height and weight for their respective sports and our accuracy is near
100%. In either case, the accuracy based on a single test set may not reflect the true accuracy
when our classifier is used with new data.
A solution to this problem might be to repeat the process a number of times and average the
results. For example, we might divide the data in half. Lets call the parts Part 1 and Part 2:
Data set
Part 1
Part 2
5-3
We can use the data in Part 1 to train our classifier and the data in Part 2 to test it. Then we
will repeat the process, this time training with Part 2 and testing with Part 1. Finally we
average the results. One problem with this though, is that we are only using 1/2 the data for
training during each iteration. But we can fix this by increasing the number of parts. For
example, we can have three parts and for each iteration we will train on 2/3 of the data and
test on 1/3. So it might look like this
iteration 1
iteration 2
iteration 3
5-4
Data
So we will put 50 basketball players in each bucket and 50 non-players. Each bucket holds
information on 100 individuals.
During each iteration hold back one of the buckets. For iteration 1, we will
hold back bucket 1, iteration 2, bucket 2, and so on.
2.2
We will train the classifier with data from the other buckets. (during the
first iteration we will train with the data in buckets 2 through 10).
2.3
We will test the classifier we just built using data from the bucket we held
back and save the results. In our case these results might be:
35 of the basketball players were classified correctly
29 of the non basketball players were classified correctly
5-5
Often we will put the final results in a table that looks like this:
classified as a basketball
player
classified as not a
basketball player
372
128
220
280
So of the 500 basketball players 372 of them were classified correctly. One thing we could do
is add things up and say that of the 1,000 people we classified 652 (372 + 280) of them
correctly. So our accuracy is 65.2%. The measures we obtain using ten-fold cross-validation
are more likely to be truly representative of the classifiers performance compared with twofold, or three-fold cross-validation. This is so, because each time we train the classifier we are
using 90% of our data compared with using only 50% for two-fold cross-validation.
5-6
Leave-One-Out
In the machine learning literature, n-fold cross validation (where n is the number of samples
in our data set) is called leave-one-out. We already mentioned one benefit of leave-one-out
at every iteration we are using the largest possible amount of our data for training. The other
benefit is that it is deterministic.
I am happy to report
that the classifier was 73.69%
accurate!!
The classifier was only
71.27% accuate.
5-7
Hmm. They did not get the same results. Did Emily or Li make a mistake? Not necessarily. In
10-fold cross validation we place the data randomly into 10 buckets. Since there is this
random element, it is likely that Emily and Li did not divide the data into buckets in exactly
the same way. In fact, it is highly unlikely that they did. So when they train the classifier, they
are not using exactly the same data and when they test this classifier they are using different
test sets. So it is quite logical that they would get different results. This result has nothing to
do with the fact that two different people were performing the evaluation. If Lucy herself ran
10-fold cross validation twice, she too would get slightly different results. The reason we get
different results is that there is a random component to placing the data into buckets. So 10fold cross validation is called non-deterministic because when we run the test again we are
not guaranteed to get the same result. In contrast, the leave-one-out method is deterministic.
Every time we use leave-one-out on the same classifier and the same data we will get the
same result. That is a good thing!
5-8
Stratification.
Let us return to an example from the previous chapterbuilding a classifier that predicts
what sport a woman plays (basketball, gymnastics, or track). When training the classifier we
want the training data to be representative and contain data from all three classes. Suppose
we assign data to the training set in a completely random way. It is possible that no
basketball players would be included in the training set and because of this, the resulting
classifier would not be very good at classifying basketball players. Or consider creating a data
set of 100 athletes. First we go to the Womens NBA website and write down the info on 33
basketball players; next we go to Wikipedia and get 33 women who competed in gymnastics,
at the 2012 Olympics and write that down; finally, we go again to Wikipedia to get
information on women who competed in track at the Olympics and record data for 34 people.
So our dataset looks like this:
comment
class
Asuka Teramoto
Brittainey Raven
Chen Nan
Basketball
Gabby Douglas Gymnastics
Helalia Johannes
Irina Miketenko Track
Jennifer Lacy Basketball
Kara Goucher Track
Linlin Deng
Gymnastics
Nakia Sanford Basketball
Nikki Blue
Basketball
Qiushuang Huang
Rebecca Tunney
Rene Kalmer Track
Shanna Crossley
Shavonte Zellous
Tatyana PetrovaTrack
Tiki Gelana
Track
Valeria Straneo Track
Viktoria Komova
comment
class
Asuka Teramoto
Brittainey Raven
Chen Nan
Basketball
Gabby Douglas Gymnastics
Helalia Johannes
Irina Miketenko Track
Jennifer Lacy Basketball
Kara Goucher Track
Linlin Deng
Gymnastics
Nakia Sanford Basketball
Nikki Blue
Basketball
Qiushuang Huang
Rebecca Tunney
Rene Kalmer Track
Shanna Crossley
Shavonte Zellous
Tatyana PetrovaTrack
Tiki Gelana
Track
Valeria Straneo Track
Viktoria Komova
comment
class
Asuka Teramoto
Brittainey Raven
Chen Nan
Basketball
Gabby Douglas Gymnastics
Helalia Johannes
Irina Miketenko Track
Jennifer Lacy Basketball
Kara Goucher Track
Linlin Deng
Gymnastics
Nakia Sanford Basketball
Nikki Blue
Basketball
Qiushuang Huang
Rebecca Tunney
Rene Kalmer Track
Shanna Crossley
Shavonte Zellous
Tatyana PetrovaTrack
Tiki Gelana
Track
Valeria Straneo Track
Viktoria Komova
num
Gymnastics
Basketball
78
49
Track
63
75
67
54
76
68
Gymnastics
Gymnastics
70
Basketball
Basketball
63
65
66
Gymnastics
num
Gymnastics
Basketball
78
49
Track
63
75
67
54
76
68
Gymnastics
Gymnastics
70
Basketball
Basketball
63
65
66
Gymnastics
num
Gymnastics
Basketball
78
49
Track
63
75
67
54
76
68
Gymnastics
Gymnastics
70
Basketball
Basketball
63
65
66
Gymnastics
num
54
72
204
90
65
106
175
123
68
200
163
61
58
108
70
70
108
106
97
61
num
54
72
204
90
65
106
175
123
68
200
163
61
58
108
70
70
108
106
97
61
num
54
72
204
90
65
106
175
123
68
200
163
61
58
108
70
70
108
106
97
61
66
162
99
76
66
162
99
33 women gymnasts
95
77
155
155
76
66
162
99
95
77
34 women marathoners
155
155
76
5-9
Lets say we are doing 10-fold cross validation. We start at the beginning of the list and put
every ten people in a different bucket. In this case we have 10 basketball players in both the
first and second buckets. The third bucket has both basketball players and gymnasts. The
fourth and fifth buckets solely contain gymnasts and so on. None of our buckets are
representative of the dataset as a whole and you would be correct in thinking this would skew
our results. The preferred method of assigning instances to buckets is to make sure that the
classes (basketball players, gymnasts, marathoners) are representing in the same proportions
as they are in the complete dataset. Since one-third of the complete dataset consists of
basketball players, one-third of the entries in each bucket should also be basketball players.
And one-third the entries should be gymnasts and one-third marathoners. This is called
stratification and this is a good thing. The problem with the leave-one-out evaluation
method is that necessarily all the test sets are non-stratified since they contain only one
instance. In sum, while leave-one-out may be appropriate for very small datasets, 10-fold
cross validation is by far the most popular choice.
Confusion Matrices
So far, we have been evaluating our classifier
by computing the percent accuracy. That is,
5-10
The name confusion matrix comes from the observation that it is easy for us to see where our
algorithm gets confused. Lets look at an example using our women athlete domain. Suppose
we have a dataset that consists of attributes for 100 women gymnasts, 100 players in the
Womens National Basketball Association, and 100 women marathoners. We evaluate the
classifier using 10-fold cross-validation. In 10-fold cross-validation we use each instance of
our dataset exactly once for testing. The results of this test might be the following confusion
matrix:
gymnast
basketball player
marathoner
gymnast
83
17
basketball player
92
marathoner
16
75
Again, the real class of each instance is represented by the rows; the class predicted by our
classifier is represented by the columns. So, for example, 83 instances of gymnasts were
classified correctly as gymnasts but 17 were misclassified as marathoners. 92 basketball
players were classified correctly as basketball players but 8 were misclassified as
marathoners. 75 marathoners were classified correctly but 9 were misclassified as gymnasts
and 16 misclassified as basketball players.
The diagonal of the confusion matrix represents instances that were classified correctly.
gymnast
basketball player
marathoner
83
17
basketball player
92
marathoner
16
85
gymnast
83 + 92 + 75 250
=
= 83.33%
300
300
5-11
It is easy to inspect the matrix to get an idea of what type of errors our classifier is making. It
this example, it seems our algorithm is pretty good at distinguishing between gymnasts and
basketball players. Sometimes gymnasts and basketball players get misclassified as
marathoners and marathoners occasionally get misclassified as gymnasts or basketball
players.
Confusion matrices
are not that
confusing!
A programming example
Let us go back to a dataset we used in the last chapter, the Auto Miles Per Gallon data set
from Carnegie Mellon University. The format of the data looked like:
mpg
cylinders
c.i.
HP
weight
secs. 0-60
make/model
30
68
49
1867
19.5
fiat 128
45
90
48
2085
21.7
vw rabbit (diesel)
20
307
130
3504
12
I am trying to predict the miles per gallon of a vehicle based on number of cylinders,
displacement (cubic inches), horsepower, weight, and acceleration. I put all 392 instances in
a file named mpgData.txt and wrote the following short Python program that divided the
data into ten buckets using a stratified method. (Both the data file and Python code are
available on the website guidetodatamining.com.)
5-12
import random
def buckets(filename, bucketName, separator, classColumn):
"""the original data is in the file named filename
bucketName is the prefix for all the bucket names
separator is the character that divides the columns
(for ex., a tab or comma) and classColumn is the column
that indicates the class"""
# put the data in 10 buckets
numberOfBuckets = 10
data = {}
# first read in the data and divide by category
with open(filename) as f:
lines = f.readlines()
for line in lines:
if separator != '\t':
line = line.replace(separator, '\t')
# first get the category
category = line.split()[classColumn]
data.setdefault(category, [])
data[category].append(line)
# initialize the buckets
buckets = []
for i in range(numberOfBuckets):
buckets.append([])
# now for each category put the data into the buckets
for k in data.keys():
#randomize order of instances for each class
random.shuffle(data[k])
bNum = 0
# divide into buckets
for item in data[k]:
buckets[bNum].append(item)
bNum = (bNum + 1) % numberOfBuckets
# write to file
for bNum in range(numberOfBuckets):
f = open("%s-%02i" % (bucketName, bNum + 1), 'w')
for item in buckets[bNum]:
f.write(item)
f.close()
buckets("mpgData.txt", 'mpgData','\t',0)
5-13
Executing this code will produce ten files labelled mpgData01, mpgData02, etc.
code it
Can you revise the nearest neighbor code from the last chapter so the
function test performs 10-fold cross validation on the 10 data files we
just created (you can download them at guidetodatamining.com)?
Your program should output a confusion matrix that looks something like:
predicted MPG
actual
MPG
5-14
10
15
20
25
30
35
40
45
10
10
15
68
14
20
14
66
25
14
35
21
30
17
21
14
35
14
40
45
53.316% accurate
total of 392 instances
The filenames of the buckets will be something like mpgData-01, mpgData-02, etc. In this
case, bucketPrefix will be mpgData. testBucketNumber is the bucket containing the
test data. If testBucketNumber is 3, the classifier will be trained on buckets 1, 2, 4, 5, 6, 7,
8, 9, and 10. dataFormat is a string specifying how to interpret the columns in the data. For
example,
"class!
num!
num!
num!
num!
num!
comment"
specifies that the first column represents the class of the instance. The next 5 columns
represent numerical attributes of the instance and the final column should be interpreted as
a comment.
The complete, new initializer method is as follows:
5-15
import copy
class Classifier:
def __init__(self, bucketPrefix, testBucketNumber, dataFormat):
""" a classifier will be built from files with the bucketPrefix
excluding the file with textBucketNumber. dataFormat is a
string that describes how to interpret each line of the data
files. For example, for the mpg data the format is:
"class!
num! num! num! num! num! comment"
"""
self.medianAndDeviation = []
# reading the data in from the file
self.format = dataFormat.strip().split('\t')
self.data = []
# for each of the buckets numbered 1 through 10:
for i in range(1, 11):
# if it is not the bucket we should ignore, read the data
if i != testBucketNumber:
filename = "%s-%02i" % (bucketPrefix, i)
f = open(filename)
lines = f.readlines()
f.close()
for line in lines:
fields = line.strip().split('\t')
ignore = []
vector = []
for i in range(len(fields)):
if self.format[i] == 'num':
vector.append(float(fields[i]))
elif self.format[i] == 'comment':
ignore.append(fields[i])
elif self.format[i] == 'class':
classification = fields[i]
self.data.append((classification, vector, ignore))
self.rawData = copy.deepcopy(self.data)
# get length of instance vector
self.vlen = len(self.data[0][1])
# now normalize the data
for i in range(self.vlen):
self.normalizeColumn(i)
5-16
testBucket method
Next, we write a new method that will test the data in one bucket:
This takes as input a bucketPrefix and a bucketNumber. If the prefix is "mpgData " and the
number is 3, the test data will be read from the file mpgData-03. testBucket will return a
dictionary in the following format:
{'35':!
'40': !
'30': !
'15': !
'10': !
'20': !
'25': !
{'35':
{'30':
{'35':
{'20':
{'15':
{'15':
{'30':
1, '20':
1},
3, '30':
3, '15':
1},
2, '20':
5, '25':
1, '30': 1},
1, '45': 1, '25': 1},
4, '10': 1},
4, '30': 2, '25': 1},
3}}
5-17
The key of this dictionary represents the true class of the instances. For example, the first line
represents results for instances whose true classification is 35 mpg. The value for each key is
another dictionary that represents how our classifier classified the instances. For example,
the line
'15': ! {'20': 3, '15': 4, '10': 1},
represents a test where 3 of the instances that were really 15mpg were misclassified as
20mpg, 4 were classified correctly as 15mpg, and 1 was classified incorrectly as 10mpg.
5-18
num!
num!
num!
num!
num!
comment")
10
15
20
25
30
35
40
45
Classified as:
10
15
20
25
30
35
40
45
+----+----+----+----+----+----+----+----+
| 5 | 8 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 63 | 14 | 1 | 0 | 0 | 0 | 0 |
| 0 | 14 | 67 | 8 | 5 | 1 | 1 | 0 |
| 0 | 1 | 13 | 35 | 22 | 6 | 1 | 1 |
| 0 | 1 | 3 | 17 | 21 | 14 | 5 | 2 |
| 0 | 0 | 2 | 7 | 10 | 13 | 5 | 1 |
| 0 | 0 | 1 | 0 | 5 | 5 | 0 | 0 |
| 0 | 0 | 0 | 2 | 1 | 1 | 0 | 2 |
+----+----+----+----+----+----+----+----+
5-19
Kappa Statistic!
At the start of this chapter we mentioned some of the questions we might be interested in
answering about a classifier including How good is this classifier. We also have been refining
our evaluation methods and looked at 10-fold cross-validation and confusion matrices. In the
example on the previous pages we determined that our classifier for predicted miles per
gallon of selected car models was 53.316% accurate. But does 53.316% mean our classifier is
good or not so good? To answer that question we are going to look at one more statistics, the
Kappa Statistic.
5-20
The Kappa Statistic compares the performance of a classifier to that of a classifier that makes
predictions based solely on chance. To show you how this works I will start with a simpler
example than the mpg one and again return to the women athlete domain. Here are the
results of a classifier in that domain:
gymnast
basketball
player
marathoner
TOTALS
gymnast
35
20
60
basketball player
88
12
100
marathoner
28
40
TOTALS
40
100
60
200
I also show the totals for the rows and columns. To determine the accuracy we sum the
numbers on the diagonal (35 + 88 + 28 = 151) and divide by the total number of instances
(200) to get 151 / 200 = .755
Now I am going to generate another confusion matrix that will represent the results of a
random classifier (a classifier that makes random predictions). First, we are going to make a
copy of the above table only containing the totals:
gymnast
basketball
player
marathoner
TOTALS
gymnast
60
basketball player
100
marathoner
40
TOTALS
40
100
60
200
Looking at the bottom row, we see that 50% of the time (100 instances out of 200) our
classifier classifies an instance as Basketball Player, 20% of the time (40 instances out of
200) it classifies an instance as gymnast and 30% as marathoner.
5-21
Classifier:
0%
gymnast: 2
player: 50%
basketball
: 30%
marathoner
gymnast
gymnast
basketball
player
marathoner
TOTALS
12
30
18
60
basketball player
100
marathoner
40
TOTALS
40
100
60
200
And we will continue in this way. There are 100 real basketball players. The random classifier
will classify 20% of them (or 20) as gymnasts, 50% as basketball players and 30% as
marathoners. And so on:
gymnast
basketball
player
marathoner
TOTALS
gymnast
12
30
18
60
basketball player
20
50
30
100
marathoner
20
12
40
TOTALS
40
100
60
200
To determine the accuracy of the random method we sum the numbers on the diagonal and
divide by the total number of instances:
P(r) =
5-22
12 +50 +12 74
=
= .37
200
200
The Kappa Statistic will tell us how much better the real classifier is compared to this random
one. The formula is
P(c) P(r)
1 P(r)
where P(c) is the accuracy of the real classifier and P(r) is the accuracy of the random one. In
this case the accuracy of the real classifier was .755 and that of the random one was .37 so
How do we interpret that .61? Does that mean our classifier is poor, good, or great? Here is a
chart that will help us interpret that number:
0.01-0.20
slightly good
0.21-0.40
fair performance
0.41-0.60
moderate performance
0.61-0.80
0.81-1.00
* Landis,
5-23
ed
eng
psych
cs
50
15
ed
75
12
33
eng
12
123
30
psych
25
30
170
accuracy = 0.697
5-24
Total
solution
How good is our classifier? Can you compute the Kappa Statistic and
interpret what that statistic means?
First, we sum all the columns:
SUM
%
cs
ed
eng
psych
TOTAL
60
120
180
240
600
10%
20%
30%
40%
100%
ed
eng
psych
Total
cs
16
24
32
80
ed
12
24
36
48
120
eng
17
34
51
68
170
psych
23
46
69
92
230
Total
60
120
180
240
600
5-25
solution continued
P(c) P(r)
1 P(r)
5-26
One problem with the nearest neighbor algorithm occurs when we have outliers. Let me
explain what I mean by that. And let us return, yet again, to the women athlete domain; this
time only looking at gymnasts and marathoners. Suppose we have a particularly short and
lightweight marathoner. In diagram form, this data might be represented as on the next
page, where m indicates marathoner and g, gymnast.
normalized height
m
g g
g
g
x
m
m
normalized weight
We can see that short lightweight marathoner as the sole m in the group of gs. Suppose x is
an instance we would like to classify. Its nearest neighbor is that outlier m, so it would get
classified as a marathoner. If we just eyeballed the diagram we would say that x is most likely
a gymnast since it appears to be in the group of gymnasts.
kNN
normalized height
One way to improve our current nearest neighbor approach is instead of looking at one
nearest neighbor we look at a number of nearest neighborsk nearest neighbors or kNN.
Each neighbor will get a vote and the algorithm will predict that the instance will belong to
the class with the highest number of votes. For example, suppose we are using three nearest
neighbors (k = 3). In that case we have 2 votes for gymnast and one for marathoner, so we
would predict x is a gymnast:
m
g g
g
g
x
normalized weight
5-28
m
m
gymnast
marathoner!
gymnast
5-29
User
Distance
Rating
Sally
Tara
10
Jade
15
So Sally was closest to Ben and she gave Funky Meters a 4. Because I want the rating of the
closest person to be weighed more heavily in the final value than the other neighbors, the
first step we will do is to convert the distance measure to make it so that the larger the
number the closer that person is. We can do this by computing the inverse of the distance
(that is, 1 over the distance). So the inverse of Sallys distance of 5 is
1
= 0.2
5
User
Inverse Distance
Rating
Sally
0.2
Tara
0.1
Jade
0.067
Now I am going to divide each of those inverse distances by the sum of all the inverse
distances. The sum of the inverse distances is 0.2 + 0.1 + 0.067 = 0.367.
User
Influence
Rating
Sally
0.545
Tara
0.272
Jade
0.183
We should notice two things. First, that the sum of the influence values totals 1. The second
thing to notice is that with the original distance numbers Sally was twice as close to Ben as
Tara was, and that is preserved in the final numbers were Sally has twice the influence as
5-30
Tara does. Finally we are going to multiple each persons influence and rating and sum the
results:
I am wondering how well Sofia will like the jazz pianist Hiromi. What is
the predicted value given the following data using the k nearest
neighbor algorithm with k = 3.?
person
Gabriela
Ethan
Jayden
5-31
Inverse Distance
Rating
Gabriela
1/4 = 0.25
Ethan
1/8 = 0.125
Jayden
1/10 = 0.1
Influence
Rating
Gabriela
0.526
Ethan
0.263
Jayden
0.211
Finally, I multiply the influence by the rating and sum the results:
5-32
Astonishingly, over 30% of Pima people develop diabetes. In contrast, the diabetes rate in
the United States is 8.3% and in China it is 4.2%.
Each instance in the dataset represents information about a Pima woman over the age of 21
and belonged to one of two classes: a person who developed diabetes within five years, or a
person that did not. There are eight attributes:
5-33
attributes:
1. Number of times pregnant
2. Plasma glucose concentration
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
Here is an example of the data (the last column represents the class0=no diabetes;
1=diabetes):
2
99
52
15
94
24.6
0.637 21
83
58
31
18
34.3
0.336 25
139
80
35
160
31.6
0.361 25
170
64
37
225
34.5
0.356 30
So, for example, the first woman has had 2 children, has
a plasma glucose concentration of 99, a diastolic blood
pressure of 52 and so on.
5-34
code it - part 1
There are two files on our website. pimaSmall.zip is a zip file containing 100
instances of the data divided into 10 files (buckets). pima.zip is a zip file
containing 393 instances. When I used the pimaSmall data with the nearest
neighbor classifier we built in the previous chapter using 10-fold crossvalidation I got these results:
0
1
Classified as:
0
1
+----+----+
| 45 | 14 |
| 27 | 14 |
+----+----+
Hint: The py
thon functi
on
heapq.nsma
llest(n, list)
w
list with th
e n smalles ill return a
t items.
5-35
code it - answer
5-36
code it - part 2
5-37
code it - results!
pima
k=1
59.00%
71.247%
k=3
61.00%
72.519%
5-38
Hmm. 72.519% seems like pretty good accuracy but is it? Compute the
Kappa Statistic to find out:
no diabetes
diabetes
no diabetes
219
44
diabetes
64
66
Performance:
slightly good
fair
moderate
substantially good
near perfect
5-39
diabetes
TOTAL
no diabetes
219
44
263
diabetes
64
66
130
TOTAL
283
110
393
ratio
0.7201
0.2799
no diabetes
diabetes
no diabetes
diabetes
189.39
73.61
93.61
36.39
189.39 + 36.39
393
= .5745
p(r)=
5-40
accuracy
Ms dades
More data
5-41
This isn't to say that you shouldn't pick the best algorithm for the job. As we have already
seen picking a good algorithm makes a significant difference. However, if you are trying to
solve a practical problem (rather than publish a research paper) it might not be worth your
while to spend a lot of time researching and tweaking algorithms. You will perhaps get more
bang for your buckor a better return on your timeif you concentrate on getting more data.
With that nod toward the importance of data, I will continue my path of introducing new
algorithms.
ssifiers for
recommending items at Am
azon
assessing consumer credit
risk
classifying land cover usi
ng image analysis
recognizing faces
classifying the gender of
people in images
recommending web pages
recommending vacation pa
ckages
5-42
Nave Bayes
Let us return yet again to our women athlete example. Suppose I ask you what sport Brittney
Griner participates in (gymnastics, marathon running, or basketball) and I tell you she is 6
foot 8 inches and weighs 207 pounds. I imagine you would say basketball and if I ask you
how confident you feel about your decision I imagine you would say something along the
lines of pretty darn confident.
Now I ask you what sport Heather Zurich (pictured
on the right) plays. She is 6 foot 1 and weighs 176
pounds. Here I am less certain how you will answer.
You might say basketball and I ask you how
confident you are about your prediction. You
probably are less confident than you were about your
prediction for Brittney Griner. She could be a tall
marathon runner.
Finally, I ask you about what sport Yumiko Hara
participates in; she is 5 foot 4 inches tall and weighs
95 pounds. Let's say you say gymnastics and I ask
how confident you feel about your decision. You will
probably say something along the lines of not too
confident. A number of marathon runner have
similar heights and weights.
With the nearest neighbor algorithms, it is difficult to
quantify confidence about a classification. With
classification methods based on probability
Bayesian methodswe can not only make a classification but we can make probabilistic
classificationsthis athlete is 80% likely to be a basketball player, this patient has a 40%
chance of getting diabetes in the next five years, the probability of rain in Las Cruces in the
next 24 hours is 10%.
The ability to make probabilistic classifications, and the fact that they are eager learners
are two advantages of Bayesian methods.
6-2
Probability
I am assuming you have some basic knowledge of probability. I flip a coin; what is the
probably of it beings a 'heads'? I roll a 6 sided fair die, what is the probability that I roll a '1'?
that sort of thing. I tell you I picked a random 19 year old and have you tell me the probability
of that person being female and without doing any research you say 50%. These are
examples of what is called prior probability and is denoted P(h)the probability of
hypothesis h.
So for a coin:
P(heads) = 0.5
For a six sided dice, the probability of rolling a 1:
P(1) = 1/6
If I have an equal number of 19 yr. old male and
females
P(female) = .5
Suppose I give you some additional information about that 19 yr. oldthe person is a student
at the Frank Lloyd Wright School of Architecture in Arizona. You do a quick Google search,
see that the student body is 86% female and revise your estimate of the likelihood of the
person being female to 86%.
This we denote as P(h|D) the probability of the hypothesis h given some data D. For
example:
6-3
The formula is
P(A | B) =
P(A B)
P(B)
An example.
In the following table I list some people and the types of laptops and phones they have:
name
laptop
phone
Kate
PC
Android
Tom
PC
Android
Harry
PC
Android
Annika
Mac
iPhone
Naomi
Mac
Android
Joe
Mac
iPhone
Chakotay Mac
iPhone
Neelix
Mac
Android
Kes
PC
iPhone
BElanna
Mac
iPhone
P(iPhone) =
5
= 0.5
10
P(iPhone | mac) =
P(mac iPhone)
P(mac)
P(mac iPhone) =
4
= 0.4
10
P(mac) =
6-4
6
= 0.6
10
So the probability of that some person uses an iPhone given that person uses a Mac is
P(iPhone | mac) =
0.4
= 0.667
0.6
That is the formal definition of posterior probability. Sometimes when we implement this we
just use raw counts:
P(iPhone | mac)=
4
= 0.667
6
tip
If you feel you need practice with basic probabilities please see the links to
tutorials at guidetodatamining.com.
6-5
P(mac | iPhone) =
=
P(iPhone mac)
P(iPhone)
0.4
= 0.8
0.5
Some terms:
P(h), the probability that some hypothesis h is true, is called the prior probability of h.
Before we have any evidence, the probability of a person owning a Mac is 0.6 (the evidence
might be knowing that the person also owns an iPhone).
P(h|d) is called the posterior probability of h. After we observe some data d what is the
probability of h? For example, after we observe that a person owns an iPhone, what is the
probability of that same person owning a Mac? It is also called conditional probability.
In our quest to build a Bayesian Classifier we will need two additional probabilities, P(D) and
P(D|h). To explain these consider the following example.
6-6
P(D) is the probability that some training data will be observed. For example, looking on the
next page we see that the probability that the zip code will be 88005 is 5/10 or 0.5.
P(88005) = 0.5
P(D|h) is the probability that some data value holds given the hypothesis. For example, the
probability of the zip code being 88005 given that the person bough Sencha Green Tea or
P(88005|Sencha Tea).
6-7
Customer
ID
Zipcode
bought organic
produce?
bought Sencha
green tea?
88005
Yes
Yes
88001
No
No
88001
Yes
Yes
88005
No
No
88003
Yes
No
88005
No
Yes
88005
No
No
88001
No
No
88005
Yes
Yes
10
88003
Yes
Yes
In this case we are looking at all the instances where the person bought Sensha Tea. There
are 5 such instances. Of those, 3 are with the 88005 zip code.
P(88005 | SenchaTea) =
3
= 0.6
5
Whats the probability of the zip code being 88005 given that the person did
not buy Sencha tea?
6-8
Whats the probability of the zip code being 88005 given that the person did
not buy Sencha tea?
There are 5 occurrences of a person not buying Sencha tea. Of those, 2 lived in
the 88005 zip code. So
P(88005 | SenchaTea) =
2
= 0.4
5
That sym
bol mea
ns not.
This is key to understanding the rest of the chapter so let us practice just a bit
more.
1. What is the probability of a person being in the 88001 zipcode (without
knowing anything else)?
2. What is the probability of a person being in the 88001 zipcode knowing that
they bought Sencha tea?
3. What is the probability of a person being in the 88001 zipcode knowing that
they did not buy Sencha tea?
6-9
This is key to understanding the rest of the chapter so let us practice just a bit
more.
1. What is the probability of a person being in the 88001 zipcode (without
knowing anything else)?
There are 10 total entries in our database and only 3 of them are from
88001 so P(88001) is 0.3
2. What is the probability of a person being in the 88001 zipcode knowing that
they bought Sencha tea?
There are 5 instances of buying Sencha tea and only 1 of them is from the
88001 zipcode so
P(88001 | SenchaTea) =
1
= 0.2
5
3. What is the probability of a person being in the 88001 zipcode knowing that
they did not buy Sencha tea?
There are 5 instances of not buying Sencha tea and 2 of them are from the
88001 zipcode:
P(88001 | SenchaTea) =
6-10
2
= 0.4
5
Bayes Theorem
Bayes Theorem describes the relationship between P(h), P(h|D), P(D), and P(D|h):
P(h | D) =
P(D | h)P(h)
P(D)
This theorem is the cornerstone of all Bayesian methods. Usually in data mining we use this
theorem to decide among alternative hypotheses. Given the evidence, is the person a
gymnast, marathoner, or basketball player. Given the evidence, will this person buy Sencha
tea, or not. To decide among alternatives we compute the probability for each hypothesis. For
example,
We want to display an ad for Sencha Tea on our smart shopping cart display only
if we think that person is likely to buy the tea. We know that person lives in the
88005 zipcode.
There are two competing hypotheses:
The person will buy Sencha tea.
We compute P(buySenchaTea|88005)
The person will not buy Sencha tea.
We compute P(buySenchaTea|88005)
We pick the hypothesis with the highest probability!
So if P(buySenchaTea|88005) = 0.6 and
P(buySenchaTea|88005) = 0.4
So it is more likely that the person will buy the tea so we will display the ad.
6-11
Suppose we work for an electronics store and we have three sales flyers in email form. One
flyer features a laptop, another features a desktop and the final flyer a tablet. Based on what
we know about each customer we will email that customer the flyer that will most likely
generate a sale. For example, I may know that a customer lives in the 88005 zipcode, that
she has a college age daughter living at home, and that she goes to yoga class. Should I send
her the flyer with the laptop, desktop, or tablet?
P(D | laptop)P(laptop)
P(D)
P(laptop | D) =
P(desktop | D) =
P(tablet | D) =
P(D | desktop)P(desktop)
P(D)
P(D | tablet)P(tablet)
P(D)
6-12
P(h1 | D) =
P(D | h1 )P(h1 )
P(D)
...
,
P(hn | D) =
P(h2 | D) =
P(D | h2 )P(h2 )
P(D)
P(D | hn )P(hn )
P(D)
Once we compute all these probabilities, we will pick the hypothesis with the highest
probability. This is called the maximum a posteriori hypothesis, or hMAP.
6-13
We can translate that English description of calculating the maximum a posteriori hypothesis
into the following formula:
P(D | h)P(h)
P(D)
P(D | h)P(h)
P(D)
You might notice that for all these calculations, the denominators are identicalP(D). Thus,
they are independent of the hypotheses. If a specific hypothesis has the max probability with
the formula used above, it will still be the largest if we did not divide all the hypotheses by
P(D). If our goal is to find the most likely hypothesis, we can simplify our calculations:
6-14
Our hypotheses:
The patient has the particular cancer
The patient does not have that particular
cancer.
Lets translate what I wrote above into probability notation. Please match
up the English statements below with their associated notations and write in the
probabilities. If there is no English statement matching a probability, please
write one.
We know that only 0.8% of the people
in the U.S. have this form of cancer.
P(POS|cancer) = _______
P(POS|cancer) = _______
When the disease is present the test
returns a correct POS result 98% of the
time;
P(cancer) = _______
P(cancer) = _______
P(NEG|cancer) = _______
P(NEG|cancer) = _______
6-15
P(POS|cancer) = 0.03
P(cancer) = 0.008
P(cancer) = 0.992
P(NEG|cancer) = 0.02
6-16
P(NEG|cancer) = 0.97
6-17
P(cancer | POS) =
0.0078
= 0.21
0.0078 + 0.0298
6-18
Ulrich (1995):
Here is why the results seem so
"How to improve Bayesian reasonin
g without
instruction: Frequency formats."
counterintuitive. Most people see the statistic
Psychological
Review. 102: 684-704.
that 98% of the people who have this
particular cancer will have a positive test
result and also conclude that 98% of the
Eddy, David M. (1982): "Probabilistic reasoning
people who have a positive test result have
in clinical medicine: Problems and
this particular cancer. This fails to take into
opportunities." In D. Kahneman, P. Slovic, and A.
Tversky, eds, Judgement under uncertainty:
account that this cancer affects only 0.8% of
Heuristics and biases. Cambridge University
the population. Lets say we give the test to
Press, Cambridge, UK.
everyone in a city of 1 million people. That
means that 8,000 people have cancer and
992,000 do not. First, lets consider giving the test to the 8,000 people with cancer. We
know that 98% of the time when we give the test to people with cancer the test correctly
returns a positive result. So 7,840 people have a correct positive result and 160 of those
people with cancer have an incorrect negative result. Now lets turn to the 992,000 people
without cancer. When we give the test to them, 97% of the time we get a correct negative
result so (992,000 * 0.97) or 962,240 of them have a correct negative result and 30,000 have
an incorrect positive result. I have summarized these results on the following page.
6-19
7,840
160
30,000
962,240
Now, consider Ann getting a positive test result and the data in the positive test result
column. 30,000 of the people with a positive test result had no cancer while only 7,840 of
them had cancer. So it seems probable that Ann does not have cancer.
Zipcode
bought
organic
produce?
bought
Sencha
green tea?
88005
Yes
Yes
88001
No
No
88001
Yes
Yes
88005
No
No
88003
Yes
No
88005
No
Yes
88005
No
No
P(h1|D) = P(buySenchaTea|88005)
88001
No
No
and
88005
Yes
Yes
P(h2|D) = P( buySenchaTea|88005)
10
88003
Yes
Yes
P(h | D) =
P(D | h)P(h)
P(D)
6-20
P(88005 | buySenchaTea)P(buySenchaTea)
P(88005)
when we can just as easily compute P(buySenchaTea|88005) directly from the data in the
table. In this simple case you would be correct but for many real world problems it is very
difficult to compute P(h|D) directly.
Consider the previous medical example where we were interested in determining whether a
person had cancer or not given that a certain test returned a positive result.
1 - P(cancer)
However, computing P(cancer|POS) directly would be significantly more challenging. This is
asking us to determine the probability that when we give the test to a random average person
in the entire population and the test result is POS then that person has cancer. To do this we
want a representative sample of the population but since only 0.8% of people have cancer a
sample size of 1,000 people would only have 8 people with cancerfar too few to feel that
our counts are representative of the population as a whole. So we would need an extremely
large sample size. So Bayes Theorem provides a strategy for computing P(h|D) when it is
hard to do so directly.
6-21
Nave Bayes
Most of the time we have more evidence than just a single piece of data. In the Sencha tea
example we had two types of evidence: zip code and whether the person purchased organic
food. To compute the probability of an hypothesis given all the evidence, we simply multiply
the individual probabilities. In this example
Code:
Customer
ID
Zipcode
bought
organic
produce?
bought
Sencha
green tea?
88005
Yes
Yes
Sencha tea
88001
No
No
88001
Yes
Yes
88005
No
No
88003
Yes
No
88005
No
Yes
88005
No
No
88001
No
No
88005
Yes
Yes
10
88003
Yes
Yes
etc.
6-22
Here's how Stephen Baker describes the smart shopping cart technology:
here's what shopping with one of these carts might feel like. You grab a cart on
the way in and swipe your loyalty card. The welcome screen pops up with a
shopping list. It's based on patterns of your last purchases. Milk, eggs, zucchini,
whatever. Smart systems might provide you with the quickest route to each item.
Or perhaps they'll allow you to edit the list, to tell it, for example, never to
promote cauliflower or salted peanuts again. This is simple stuff. But according to
Accenture's studies, shoppers forget an average of 11 percent of the items they
intend to buy. If stores can effectively remind us of what we want, it means fewer
midnight runs to the convenience store for us and more sales for them.
Baker. 2008. P49.
The Numerati
I've mentioned this book by Stephen Baker several times. I highly
encourage you to read this book. The paperback is only $10 and it
is a good late night read.
6-23
i100 i500
Let's say we are trying to help iHealth, a
company that sells wearable exercise
monitors that compete with the Nike Fuel
and the Fitbit Flex. iHealth sells two models
that increase in functionality: the i100 and
the i500:
iHealth100:
heart rate, GPS (to compute miles per
hour, etc), wifi to automatically connect
to iHealth website to upload data.
iHealth500:
i100 features + pulse oximetry (oxygen
in blood) + free 3G connection to
iHealth website
They sell these online and they hired us to come up with a recommendation system for their
customers. To get data to build our system when someone buys a monitor, we ask them to
fill out the questionnaire. Each question in the questionnaire relates to an attribute. First, we
ask them what their main reason is for starting an exercise program and have them select
among three options: health, appearance or both. We ask them what their current exercise
level is: sedentary, moderate, or active. We ask them how motivated they are: moderate or
aggressive. And finally we ask them if they are comfortable with using technological devices.
Our results are as follows.
6-24
Main Interest
Current
How Motivated Comfortable
Exercise Level
with tech.
Devices?
Model #
both
sedentary
moderate
yes
i100
both
sedentary
moderate
no
i100
health
sedentary
moderate
yes
i500
appearance
active
moderate
yes
i500
appearance
moderate
aggressive
yes
i500
appearance
moderate
aggressive
no
i100
health
moderate
aggressive
no
i500
both
active
moderate
yes
i100
both
moderate
aggressive
yes
i500
appearance
active
aggressive
yes
i500
both
active
aggressive
no
i500
health
active
moderate
no
i500
health
sedentary
aggressive
yes
i500
appearance
active
moderate
no
i100
health
sedentary
moderate
no
i100
Using the nave Bayes method, which model would you recommend to a person whose
main interest is health
current exercise level is moderate
is moderately motivated
and is comfortable with technological devices
Turn the page if you need a hint!
6-25
and
P(i500 | health, moderateExercise, moderateMotivation, techComfortable)
P(moderateExercise|i100) =
P(moderateMotivated|i100) =
P(techComfortable|i100) =
P(i100) = 6 / 15
That was my clue. Now hopefully you can figure out the example
6-26
First we compute
P(i100 | health, moderateExercise, moderateMotivation, techComfortable)
P(health|i100) = 1/6
P(moderateExercise|i100) = 1/6
P(moderateMotivated|i100) = 5/6
P(techComfortable|i100) = 2/6
P(i100) = 6 / 15
so
P(i100| evidence) = .167 * .167 * .833 * .333 * .4 = .00309
Now we compute
P(i500 | health, moderateExercise, moderateMotivation, techComfortable)
P(health|i500) = 4/9
P(moderateExercise|i500) = 3/9
P(moderateMotivated|i500) = 3/9
P(techComfortable|i500) = 6/9
P(i500) = 9 / 15
P(i500| evidence) = .444 * .333 * .333 * .667 * .6 = .01975
6-27
Doing it in Python
Great! Now that we understand how a Nave Bayes Classifier works let us consider how to
implement it in Python. The format of the data files will be the same as that in the previous
chapter, a text file where each line consists of tab-separated values. For our iHealth example,
the data file would look like the following:
comfortable with tech
devices?
current exercise level
main interest
how motivated
both!
!
sedentary!
both!
!
sedentary!
health! !
sedentary!
appearance! acti
ve! !
appearance! mode
rate!
appearance! mode
rate!
health! !
moderate!
both!
!
active! !
both!
!
moderate!
appearance! acti
ve! !
both!
!
active! !
health! !
active! !
health! !
sedentary!
appearance! acti
ve! !
health! !
sedentary!
which model
moderate!
yes! i100
moderate!
no! i100
moderate!
yes! i500
moderate!
yes! i500
aggressive! yes!
i500
aggressive! no!
i100
aggressive! no!
i500
moderate!
yes! i100
aggressive! yes!
i500
aggressive! yes!
i500
aggressive! no!
i500
moderate!
no! i500
aggressive! yes!
i500
moderate!
no! i100
moderate!
no! i100
Shortly we will be using an example with substantially more data and I would like to keep the
ten-fold cross validation methods we used in code from the previous chapter. Recall that that
method involved dividing the data into ten buckets (files). We would train on nine of them
and test the classifier on the remaining bucket. And we would repeat this ten times; each
time withholding a different bucket for testing. The simple iHealth example, with only 15
instances, was designed so we could work through the Nave Bayes Classifier method by
hand. With only 15 instances it seems silly to divide them into 10 buckets. The ad hoc, not
very elegant solution we will use, is to have ten buckets but all the 15 instances will be in one
bucket and the rest of the buckets will be empty.
6-28
The Nave Bayes Classifier code consists of two components, one for training and one for
classifying.
Training
The output of training needs to be:
a set of prior probabilitiesfor example,
P(i100) = 0.4
a set of conditional probabilitiesfor
example, P(health|i100) = 0.167
I am going to represent the set of prior probabilities as a Python dictionary (hash table):
self.prior = {'i500': 0.6, 'i100': 0.4}
The conditional probabilities are a bit more complex. My way of doing thisand there are
probably better methodsis to associate a set of conditional probabilities with each class.
{'i500': {1: {'appearance': 0.3333333333333, 'health': 0.4444444444444,
'both': 0.2222222222222},
2: {'sedentary': 0.2222222222222, 'moderate': 0.333333333333,
'active': 0.4444444444444444},
3: {'moderate': 0.333333333333, 'aggressive': 0.66666666666},
4: {'no': 0.3333333333333333, 'yes': 0.6666666666666666}},
'i100': {1: {'appearance': 0.333333333333, 'health': 0.1666666666666,
'both': 0.5},
2: {'sedentary': 0.5, 'moderate': 0.16666666666666,
'active': 0.3333333333333},
3: {'moderate': 0.83333333334, 'aggressive': 0.166666666666},
4: {'no': 0.6666666666666, 'yes': 0.3333333333333}}}
The 1, 2, 3, 4 represent column numbers. So the first line of the above is the probability of
the value of the first column being appearance given that the device is i500 is 0.333.
6-29
The first step in computing these probabilities is simply to count things. Here are the first
few lines of the input file:
both!
!
both!
!
health! !
appearance!
sedentary!
sedentary!
sedentary!
active! !
moderate!
moderate!
moderate!
moderate!
yes!i100
no! i100
yes!i500
yes!i500
Counting things
{'i100': 1}
Prior probability
{'i500': 9, 'i100': 6}
To obtain the prior probabilities I simply divide those number by the total number of
instances.
To determine the conditional probabilities I am going
Conditional
to count the occurrences of attribute values in the
different columns in a dictionary called counts. and I
am going to maintain separate counts for each class.
So, in processing the string both in the first line, counts will be:
{'i100': {1: {'both': 1}}
and at the end of processing the data, the value of counts will be
6-30
probability
{'i100': {1:
2:
3:
4:
'i500': {1:
2:
3:
4:
So, in the first column of the i100 instances there were 2 occurrences of appearance, 1 of
health and 3 of both. To obtain the conditional probabilities we divide those numbers by
the total number of instances of that class. For example, there are 6 instances of i100 and 2 of
them had a value of appearance for the first column, so
P(appearance|i100) = 2/6 = .333
With that background here is the Python code for training the classifier (remember, you can
download this code at guidetodatamining.com).
# _____________________________________________________________________
class BayesClassifier:
def __init__(self, bucketPrefix, testBucketNumber, dataFormat):
""" a classifier will be built from files with the bucketPrefix
excluding the file with textBucketNumber. dataFormat is a
string that describes how to interpret each line of the data
files. For example, for the iHealth data the format is:
"attr!
attr! attr! attr! class"
"""
total = 0
classes = {}
counts = {}
# reading the data in from the file
self.format = dataFormat.strip().split('\t')
self.prior = {}
self.conditional = {}
6-31
6-32
#
# ok done counting. now compute probabilities
#
# first prior probabilities p(h)
#
for (category, count) in classes.items():
self.prior[category] = count / total
#
# now compute conditional probabilities p(h|D)
#
for (category, columns) in counts.items():
self.conditional.setdefault(category, {})
for (col, valueCounts) in columns.items():
self.conditional[category].setdefault(col, {})
for (attrValue, count) in valueCounts.items():
self.conditional[category][col][attrValue] = (
count / classes[category])
Classifying
Okay, we have trained the classifier. Now we want to classify various instances. For example,
which model should we recommend for someone whose primary interest is health, and who
is moderately active, moderately motivated, and is comfortable with technology:
6-33
When we did this by hand we computing the probability of each hypothesis given the
evidence and we simply translate that method to code:
def classify(self, itemVector):
"""Return class we think item Vector is in"""
results = []
for (category, prior) in self.prior.items():
prob = prior
col = 1
for attrValue in itemVector:
if not attrValue in self.conditional[category][col]:
# we did not find any instances of this attribute value
# occurring with this category so prob = 0
prob = 0
else:
prob = prob * self.conditional[category][col][attrValue]
col += 1
results.append((prob, category))
# return the category with the highest probability
return(max(results)[1])
And when I try the code I get the same results we received when we did this by hand:
>>c = Classifier("iHealth/i", 10, "attr\tattr\tattr\tattr\tclass")
>>print(c.classify(['health', 'moderate', 'moderate', 'yes'])
i500
6-34
Attribute Information:
1. Class Name: 2 (democrat, republican)
2. handicapped-infants: 2 (y,n)
3. water-project-cost-sharing: 2 (y,n)
4. adoption-of-the-budget-resolution: 2 (y,n)
5. physician-fee-freeze: 2 (y,n)
6. el-salvador-aid: 2 (y,n)
7. religious-groups-in-schools: 2 (y,n)
8. anti-satellite-test-ban: 2 (y,n)
9. aid-to-nicaraguan-contras: 2 (y,n)
10. mx-missile: 2 (y,n)
11. immigration: 2 (y,n)
12. synfuels-corporation-cutback: 2 (y,n)
13. education-spending: 2 (y,n)
14. superfund-right-to-sue: 2 (y,n)
15. crime: 2 (y,n)
16. duty-free-exports: 2 (y,n)
17. export-administration-act-south-africa: 2 (y,n)
y
y
y
y
n
y
y
y
y
y
y
y
n
n
n
n
n
n
n
n
n
n
n
y
y
y
y
y
y
y
y
y
y
y
n
y
n
y
n
y
n
n
n
n
n
n
n
n
n
n
n
n
n
n
y
n
y
y
n
n
y
y
y
y
6-35
Our Nave Bayes Classifier works fine with this example (the format string says that the first
column is to be interpreted as the class of the instance and the rest of the columns are to be
interpreted as attributes):
format = "class\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr
\tattr\tattr\tattr\tattr\tattr\tattr\tattr"
tenfold("house-votes/hv", format)
Classified as:
democrat
republican
+-------+-------+
democrat
|
111 |
13 |
|-------+-------|
republican
|
9 |
99 |
+-------+-------+
90.517 percent correct
total of 232 instances
Thats great!
6-36
CISPA
Reader
Privacy Act
Internet Sales
Tax
Internet
Snooping Bill
Republican
0.99
0.01
0.99
0.5
Democrat
0.01
0.99
0.01
1.0
% voting yes
This table shows that 99% of Republicans in the sample voted for the CISPA (Cyber
Intelligence Sharing and Protection Act), only 1% voted for the Reader Privacy Act, 99%
voted for Internet Sales Tax and 50% voted for the Internet Snooping Bill. (I made up these
numbers and they do not reflect reality.) We pick a U.S. Representative who wasnt in our
sample, Representative X, who we would like to classify as either a Democrat or Republican.
I added how that representative voted to our table:
CISPA
Reader
Privacy Act
Internet Sales
Tax
Internet
Snooping Bill
Republican
0.99
0.01
0.99
0.5
Democrat
0.01
0.99
0.01
1.0
Rep. X
emocrat
e person is a D
Do you think th
or Republican?
6-37
I would guess Democrat. Let us work through the example step-by-step using Nave Bayes.
The prior probabilities of P(Democrat) and P(Republican) are both 0.5 since there are 100
Republicans and 100 Democrats in the sample. We know that Representative X voted no to
CISPA and we also know
P(Republican|C=no) = 0.01
where C = CISPA. And with that bit of evidence our current P(h|D) probabilities are
h=
p(h)
P(C=no|h)
P(h|D)
Republican
0.5
0.01
0.005
Democrat
0.5
0.99
0.495
Factoring in Representative Xs yes vote to the Reader Privacy Act and Xs no to the sales
tax bill we get:
h=
p(h)
P(C=no|h)
P(R=yes|h)
P(T=no|h)
P(h|D)
Republican
0.5
0.01
0.01
0.01
0.0000005
Democrat
0.5
0.99
0.99
0.99
0.485
P(Democrat | D)=
0.485
0.485
=
= 0.99999
0.485 + 0.0000005 0.4850005
6-38
h=
p(h)
P(C=no|h)
P(R=yes|h)
P(T=no|h)
P(S=no|h)
P(h|D)
Republican
0.5
0.01
0.01
0.01
0.50
2.5E-07
Democrat
0.5
0.99
0.99
0.99
0.00
Whoops. We went from 99.99% likelihood that X was a Democrat to 0%. This is so because
we had 0 occurrences of a Democrat voting no for the snooping bill. Based on these
probabilities we predict that person X is a Republican. This goes against our intuition!
Estimating Probabilities
The probabilities in Nave Bayes are really estimates of the true probabilities. True
probabilities are those obtained from the entire population. For example, if we could give a
cancer test to everyone in the entire population, we could, for example, get the true
probability of the test returning a negative result given that the person does not have cancer.
However, giving the test to everyone is near impossible. We can estimate that probability by
selecting a random representative sample of the population, say 1,000 people, giving the test
to them, and computing the probabilities. Most of the time this gives us a very good estimate
of the true probabilities, but when the true probabilities are very small, these estimates are
likely to be poor. Here is an example. Suppose the true probability of a Democrat voting no to
the Internet Snooping Bill is 0.03P(S=no|Democrat) = 0.03.
Brain Calisthenics
6-39
Brain Calisthenicsanswer
0
So based on the sample P(S=no|Democrat) = 0.
As we just saw in the previous example, when a probability is 0 it dominates the Nave Bayes
calculationit doesnt matter what the other values are. Another problem is that
probabilities based on a sample produce a biased underestimate of the true probability.
Fixing this.
If we a trying to calculate something like P(S=no|Democrat) our calculation has been
the number that both are Democrats and voted no on the snooping bill.
P(S=no|Democrat) =
total number of Democrats
P(x | y) =
nc
n
6-40
The problem we have is when nc equals zero. We can eliminate this problem by changing the
formula to:
P(x | y) =
nc + mp
n+m
This formula
is from p179
of the bo ok
Machine Lear
ning by To m
Mitchell.
Republican Vote
CISPA
Reader
Privacy Act
Internet Sales
Tax
Internet
Snooping Bill
Yes
99
99
50
No
99
50
CISPA
Reader
Privacy Act
Internet Sales
Tax
Internet
Snooping Bill
Yes
99
100
No
99
99
Democratic Vote
6-41
The person we are trying to classify voted no to CISPA. First we compute the probability that
hes a Republican given that vote. Our new formula is
P(x | y) =
nc + mp
n+m
n is the number of Republicans, which is 100 and nc is the number of Republicans who voted
no to CISPA, which is 1. m is the number of values for the attribute how they voted on
CISPA, which is 2 (yes or no). So plugging those number into our formula
P(cispa = no | republican) =
1+ 2(.5)
2
=
= 0.01961
100 + 2 102
We follow the same procedure for a person voting no to CISPA given they are a Democrat.
P(cispa = no | democrat) =
99 + 2(.5) 100
=
= 0.9804
100 + 2
102
h=
p(h)
P(C=no|h)
P(h|D)
Republican
0.5
0.01961
0.0098
Democrat
0.5
0.9804
0.4902
Factoring in Representative Xs yes vote to the Reader Privacy Act and Xs no to the sales
Finish this problem and classify the individual as either a Republican or Democrat.
Recall, he voted no to Cispa, yes to the Reader Privacy act, and no both to the sales tax
and snooping bills.
6-42
Finish this problem and classify the individual as either a Republican or Democrat.
Recall, he voted no to CISPA, yes to the Reader Privacy act, and no both to the Internet
sales tax and snooping bills.
The calculations for the next 2 columns mirror those we did for the CISPA vote. The
probability that this person voted no to the snooping bill given that he is a Republican is
P(s = no | republican) =
50 + 2(.5) 51
=
= 0.5
100 +2
102
P(s = no | democrat) =
0 + 2(.5)
1
=
= 0.0098
100 +2 102
p(h)
P(C=no|h)
P(R=yes|h)
P(I=no|h)
P(S=no|h)
P(h|D)
Republican
0.5
0.01961
0.01961
0.01961
0.5
0.000002
Democrat
0.5
0.9804
0.9804
0.9804
0.0098
0.004617
So unlike the previous approach we would classify this individual as a Democrat. This
matches our intuitions.
6-43
A clarification
For this example, the value of m was 2 for all calculations. However, it is not the case that m
remains necessarily constant across attributes. Consider the health monitor example
discussed earlier in the chapter. The attributes for that example included:
survey
What is your m
ain interest in ge
tting a monitor
health
?
appearance
both
What is your cu
rrent exercise
level?
sedentary
moderate
active
Are you comfort
yes
no
devices?
Let us say the number of the people surveyed who own the i500 monitor is 100 (this is n).
The number of people who own a i500 and are sedentary is 0 (nc). So, the probability of
someone being sedentary given they own an i500 is
P(sedentary | i500) =
6-44
nc + mp 0 + 3(.333)
1
=
=
= 0.0097
n+m
100 + 3
103
Numbers
You probably noticed that I changed from numerical data which I used in all the nearest
neighbor approaches I discussed to using categorical data for the nave Bayes formula. By
categorical data we mean that the data is put into discrete categories. For example, we
divide people up in how they voted for a particular bill and the people who voted yes go in
one category and the people who voted no go in another. Or we might categorize musicians
by the instrument they play. So all saxophonists go in one bucket, all drummers in another,
all pianists in another and so on. And these categories do not form a scale. So, for example,
saxophonists are not closer to pianists than they are to drummers. Numerical data is on a
scale. An annual salary of $105,000 is closer to a salary of $110,000 than it is to one of
$40,000.
For Bayesian approaches we count thingshow many occurrences are there of people who
are sedentaryand it may not be initially obvious how to count things that are on a scalefor
example, something like grade point average. There are two approaches.
Age
< 18
18-22
23-30
30-40
> 40
Annual Salary
> $200,000
150,000 - 200,000
100,00 - 150,000
60,000-100,000
40,000-60,000
6-45
6-46
The terms Gaussian Distribution and Probability Density Function sound cool, but they
are more than good phrases to know so you can impress your friends at dinner parties. So
what do they mean and how they can be used with the Nave Bayes method? Consider the
example we have been using with an added attribute of income:
Main Interest
Current
How Motivated Comfortable Income
Exercise Level
with tech.
(in $1,000)
Devices?
Model #
both
sedentary
moderate
yes
60
i100
both
sedentary
moderate
no
75
i100
health
sedentary
moderate
yes
90
i500
appearance
active
moderate
yes
125
i500
appearance
moderate
aggressive
yes
100
i500
appearance
moderate
aggressive
no
90
i100
health
moderate
aggressive
no
150
i500
both
active
moderate
yes
85
i100
both
moderate
aggressive
yes
100
i500
appearance
active
aggressive
yes
120
i500
both
active
aggressive
no
95
i500
health
active
moderate
no
90
i500
health
sedentary
aggressive
yes
85
i500
appearance
active
moderate
no
70
i100
health
sedentary
moderate
no
45
i100
Lets think of the typical purchaser of an i500, our awesome, premiere device. If I were to ask
you to describe this person you might give me the average income:
mean =
And perhaps after reading chapter 4 you might give the standard deviation:
6-47
sd =
(x x )
card(x)
Recall that the standard deviation describes the range of scattering. If all the values are
bunched up around the mean, the standard deviation is small; if the values are scattered the
standard deviation is large
What is the income standard deviation of the people who bought the i500? (those
values are shown in the column below)
Income
(in $1,000)
90
125
100
150
100
120
95
90
85
6-48
What is the standard deviation of the income of the people who bought the i500?
(those values are shown in the column above)
Income
(in $1,000)
(x-106.111)
(x-106.111)2
90
-16.111
259.564
125
18.889
356.794
100
-6.111
37.344
150
43.889
1926.244
100
-6.111
37.344
120
13.889
192.904
95
-11.111
123.454
90
-16.111
259.564
85
-21.111
445.674
sd =
3638.889
9
= 404.321 = 20.108
= 3638.889
6-49
For this sample, I can use the standard deviation formula I used above, but there is another
formula that has been shown to be a better estimate of the entire population standard
deviation. This formula is called the sample standard deviation and it is just a slight
modification of the previous formula:
sd =
(x x )
card(x) 1
6-50
sd =
3638.889
=
9 1
3638.889
8
= 454.861 = 21.327
For the rest of this chapter we will be using sample standard deviation.
You probably have heard terms such as normal distribution, bell curve, and Gaussian
distribution. Gaussian distribution is just a high falutin term for normal distribution. The
function that describes this distribution is called the Gaussian function or bell curve. Most of
the time the Numerati (aka data miners) assume attributes follow a Gaussian distribution.
What is means is that about 68% of the instances in a Gaussian distribution fall within 1
standard deviation of the mean and 95% of the instances fall within 2 standard deviations of
the mean:
In our case, the mean was 106.111 and the sample standard deviation was 21.327. So 95% of
the people who purchase an i500 earn between $42,660 and $149,770. If I asked you if you
thought P(100k| i500) the likelihood that an i500 purchaser earns $100,000was, you
might think that's pretty likely. If I asked you what you thought the likelihood of
P(20k| i500)the likelihood that an i500 purchaser earns $20,000was , you might think it
was pretty unlikely.
6-51
To formalize this, we are going to use the mean and standard deviation to compute this
probability as follows:
1
P(xi | y j ) =
e
2 ij
Maybe putting the formula in a bigger
font makes it look simpler!
( xi ij )2
2 ij2
Lets jump right into dissecting this formula so we can see how simple it really is. Let us say
we are interested in computing P(100k|i500), the probability that a person earns $100,000
(or 100k) given they purchased an i500. A few pages ago we computed the average income
(mean) of people who bought the i500. We also computed the sample standard deviation.
These values are shown on the following page. In Numerati speak, we represent the mean
with the Greek letter (mu) and the standard deviation as (sigma).
6-52
1
P(xi | y j ) =
e
2 ij
( xi ij )2
2 ij2
ij = 106.111
ij = 21.327
xi = 100
1
e
2 (21.327)
P(xi | y j ) =
(100106.111)2
2(21.327)2
P(xi | y j ) =
1
e0.0411
53.458
The e is a mathematical constant that is the base of the natural logarithm. Its value is
approximately 2.718.
P(xi | y j ) =
1
(2.718)0.0411 = (0.0187)(0.960) = 0.0180
53.458
So the probability that the income of a person who bought the i500 is $100,000 is 0.0180.
6-53
In the table below I have the horsepower ratings for cars that get 35 miles per gallon.
I would like to know the probability of a Datsun 280z having 132 horsepower given it
gets 35 miles per gallon.
car
HP
Datsun 210
65
Ford Fiesta
66
VW Jetta
74
Nissan Stanza
88
Ford Escort
65
88
Plymouth Horizon
70
Suburu DL
67
6-54
ij = _____
ij = _____
xi = _____
In the table below I have the horsepower ratings for cars that get 35 miles per gallon.
I would like to know the probability of a Datsun 280z having 132 horsepower given it
gets 35 miles per gallon.
ij = 72,875
car
HP
Datsun 210
65
Ford Fiesta
66
VW Jetta
74
Nissan Stanza
88
Ford Escort
65
88
Plymouth Horizon
70
Suburu DL
67
ij = 9.804
xi =132
672.875
= 96.126 = 9.804
7
6-55
In the table below I have the horsepower ratings for cars that get 35 miles per gallon.
I would like to know the probability of a Datsun 280z having 132 horsepower given it
gets 35 miles per gallon.
ij = 72,875
ij = 9.804
P(xi | y j ) =
1
e
2 ij
( xi ij )2
2 ij2
xi =132
1
P(132hp | 35mpg) =
e
2 (9.804)
(13272.875)2
2(9.804 )2
3495.766
1
1
=
e 192.237 =
e18.185
24.575
6.283(9.804)
= 0.0407(0.00000001266)
= 0.0000000005152
Ok. it is extremely unlikely that a Datsun 280z, given that it gets 35 miles to the gallon
has 132 horsepower. (but it does!)
6-56
6-57
Python Implementation
Training Phase
The Nave Bayes method relies on prior and conditional probabilities. Lets go back to our
Democrat/Republican example. Prior probabilities are the probabilities that hold before we
have observed any evidence. For example, if I know there are 233 Republicans and 200
Democrats, then the prior probability of some arbitrary member of the U.S. House of
Representatives being a Republican is
P(republican) =
233
= 0.54
433
This is denoted P(h). Conditional Probability P(h|D) is the probability that h holds given that
we know D, for example, P(democrat|bill1Vote=yes). In Nave Bayes, we flip that probability
and compute P(D|h)for example, P(bill1Vote=yes|democrat).
In our existing Python program we store these conditional probabilities in a dictionary of the
following form:
So the probability that someone voted yes to bill 1 given that they are a Democrat
(P(bill 1=yes|democrat)) is 0.333.
We will keep this data structure for attributes whose values are discrete values (for example,
yes, no, sex=male, sex=female). However, when attributes are numeric we will be using
the probability density function and we need to store the mean and sample standard
deviation for that attribute. For these numeric attributes I will use the following structures:
6-58
As before, each instance is represented by a line in a data file. The attributes of each
instances are separated by tabs. For example, the first few lines of a data file for the Pima
Indians Diabetes Data set might be:
3! 78!
50
4! 111! ! 32! 88! 31.0!
72! 47!
0.248!
207! 37.1!
1! 189!
26! 1
60
1.390!
1! 117! ! 23! 846! 30.1!
5
6! 1
88
0.398!
3! 107! ! 24! 145! 34.5!
59! 1
62! 13!
0
.403! 4
48! 22.9!
7! 81!
0! 1
78! 40!
0.678!
48! 46.7!
2! 99!
23! 1
70! 16!
0
.
261! 42!
44! 20.4!
5! 105!
0
72
0.235!
2! 142! ! 29! 325! 36.9!
27! 0
82! 18!
0
.
159! 28!
64! 24.7!
1! 81!
0
72
0.761!
0! 100! ! 18! 40! 26.6!
21! 0
88! 60!
0
.
283! 24!
110! 46.8!
0
0.962!
31! 0
Also as before, we are going to represent how the program should interpret each column by
use of a format string, which uses the terms
attr identifies columns that should be interpreted as non-numeric attributes, and which
will use the Bayes methods shown earlier in this chapter.
num identifies columns that should be interpreted as numeric attributes, and which will
use the Probability Density Function (so we will need to compute the mean and standard
deviation during training).
class identifies the column representing the class of the instance (what we are trying to
learn)
6-59
In the Pima Indian Diabetes data set the format string will be
"num
num
num
num
num
num
num
num
class"
To compute the mean and sample standard deviation we will need some temporary data
structures during the training phase. Again, let us look at a small sample of the Pima data set.
3! 78!
50
4! 111! ! 32! 88! 31.0!
72! 47!
207! 37.1!
1! 189!
60
2! 142! ! 23! 846! 30.1!
82! 18!
64! 24.7!
1! 81!
72
0! 100! ! 18! 40! 26.6!
88! 60!
110! 46.8!
0.248!
1.390!
0.398!
0.761!
0.283!
0.962!
26! 1
56! 1
59! 1
21! 0
24! 0
31! 0
The last column represents the class of each instance. So the first three individuals developed
diabetes and that last three did not. All the other columns represent numeric attributes. of
which we need to compute the mean and standard deviation for each of the two classes. To
compute the mean for each class and attribute I will need to keep track of the running total.
In our existing code we already keep track of the total number of instances. I will implement
this using a dictionary:
totals
So for class 1, the column 1 total is 8 (3 + 4 + 1), the column 2 total is 378, etc.
For class 0, the column 1 total is 3 (2 + 1 + 0), the column 2 total is 323 and so on.
For standard deviation, we will also need to keep the original data, and for that we will use a
dictionary in the following format:
6-60
numericValues
{'1': 1: [3, 4, 1], 2: [78, 111, 189], ...},
{'0': {1: [2, 1, 0], 2: [142, 81, 100]}
I have added the code to create these temporary data structures to the __init__() method
of our Classifier class as shown below:
import math
class Classifier:
def __init__(self, bucketPrefix, testBucketNumber, dataFormat):
""" a classifier will be built from files with the bucketPrefix
excluding the file with textBucketNumber. dataFormat is a string that
describes how to interpret each line of the data files. For example,
for the iHealth data the format is:
"attr!attr! attr! attr! class"
"""
total = 0
classes = {}
# counts used for attributes that are not numeric
counts = {}
# totals used for attributes that are numereric
# we will use these to compute the mean and sample standard deviation
# for each attribute - class pair.
totals = {}
numericValues = {}
# reading the data in from the file
self.format = dataFormat.strip().split('\t')
#
self.prior = {}
self.conditional = {}
# for each of the buckets numbered 1 through 10:
for i in range(1, 11):
# if it is not the bucket we should ignore, read in the data
if i != testBucketNumber:
filename = "%s-%02i" % (bucketPrefix, i)
f = open(filename)
6-61
lines = f.readlines()
f.close()
for line in lines:
fields = line.strip().split('\t')
ignore = []
vector = []
nums = []
for i in range(len(fields)):
if self.format[i] == 'num':
nums.append(float(fields[i]))
elif self.format[i] == 'attr':
vector.append(fields[i])
elif self.format[i] == 'comment':
ignore.append(fields[i])
elif self.format[i] == 'class':
category = fields[i]
# now process this instance
total += 1
classes.setdefault(category, 0)
counts.setdefault(category, {})
totals.setdefault(category, {})
numericValues.setdefault(category, {})
classes[category] += 1
# now process each non-numeric attribute of the instance
col = 0
for columnValue in vector:
col += 1
counts[category].setdefault(col, {})
counts[category][col].setdefault(columnValue, 0)
counts[category][col][columnValue] += 1
# process numeric attributes
col = 0
for columnValue in nums:
col += 1
totals[category].setdefault(col, 0)
#totals[category][col].setdefault(columnValue, 0)
totals[category][col] += columnValue
numericValues[category].setdefault(col, [])
numericValues[category][col].append(columnValue)
6-62
code it
Can you add the code to compute the means and standard deviations? Download the
file naiveBayesDensityFunctionTraining.py from guidetodatamining.com.
Your program should produce the data structures ssd and means:
c = Classifier("pimaSmall/pimaSmall", 1,
"num!num!
num!
num!
num!
num!
num!
num!
>> c.ssd
{'0': {1: 2.54694671925252, 2: 23.454755259159146, ...},
'1': {1: 4.21137914295475, 2: 29.52281872377408,}}
>>> c.means
{'0': {1: 2.8867924528301887, 2: 111.90566037735849, ...},
'1': {1: 5.25, 2: 146.05555555555554, ...}}
class")
6-63
code it solution
Here is my solution:
#
# now compute mean and sample standard deviation
#
self.means = {}
self.ssd = {}
self.totals = totals
for (category, columns) in totals.items():
self.means.setdefault(category, {})
for (col, cTotal) in columns.items():
self.means[category][col] = cTotal / classes[category]
# standard deviation
for (category, columns) in numericValues.items():
self.ssd.setdefault(category, {})
for (col, values) in columns.items():
SumOfSquareDifferences = 0
theMean = self.means[category][col]
for value in values:
SumOfSquareDifferences += (value - theMean)**2
columns[col] = 0
self.ssd[category][col] = math.sqrt(SumOfSquareDifferences
/ (classes[category] - 1))
6-64
code it 2
Can you revise the classify method so it uses the probability density function for
numeric attributes? The file to modify is naiveBayesDensityFunctionTemplate.py. Here
is the original classify method:
def classify(self, itemVector, numVector):
"""Return class we think item Vector is in"""
results = []
sqrt2pi = math.sqrt(2 * math.pi)
for (category, prior) in self.prior.items():
prob = prior
col = 1
for attrValue in itemVector:
if not attrValue in self.conditional[category][col]:
# we did not find any instances of this attribute value
# occurring with this category so prob = 0
prob = 0
else:
prob = prob * self.conditional[category][col][attrValue]
col += 1
# return the category with the highest probability
#print(results)
return(max(results)[1])
6-65
code it 2 - solution
Can you revise the classify method so it uses the probability density function for
numeric attributes? The file to modify is naiveBayesDensityFunctionTemplate.py.
Solution:
def classify(self, itemVector, numVector):
"""Return class we think item Vector is in"""
results = []
sqrt2pi = math.sqrt(2 * math.pi)
for (category, prior) in self.prior.items():
prob = prior
col = 1
for attrValue in itemVector:
if not attrValue in self.conditional[category][col]:
# we did not find any instances of this attribute
value
# occurring with this category so prob = 0
prob = 0
else:
prob = prob * self.conditional[category][col]
[attrValue]
col += 1
col = 1
for x in numVector:
mean = self.means[category][col]
ssd = self.ssd[category][col]
ePart = math.pow(math.e, -(x - mean)**2/(2*ssd**2))
prob = prob * ((1.0 / (sqrt2pi*ssd)) * ePart)
col += 1
results.append((prob, category))
# return the category with the highest probability
#print(results)
return(max(results)[1])
6-66
pimaSmall
pima
k=1
59.00%
71.247%
k=3
61.00%
72.519%
Here are the results when we use Nave Bayes with these two data sets:
Bayes
pimaSmall
pima
72.000%
77.354%
Wow! So it looks
like Nave Bayes performs
better than kNN!
6-67
6-68
What enables us to multiple probabilities together is the fact that the events these
probabilities represent are independent. For example, consider a game where we flip a coin
and roll a die. These events are independent meaning what we roll on the die does not
depend on whether we flip a heads or tails on the coin. And, as I just said, if events are
independent we can determine their joint probability (the probability that they both
occurred) by multiplying the individual probabilities together. So the probability of getting a
heads and rolling a 6 is
1
= 0.08333
6
Let's say I alter a deck of cards keeping all the black cards (26 of them) but only retaining the
face cards for the red suits (6 of them). That makes a 32 card deck. What is the probability
of selecting a face card?
P( facecard) =
12
= 0.375
32
6-69
P(red) =
6
= 0.1875
32
What is the probability of selecting a single card that is both red and a face card? Here we do
not multiply probabilities. We do not do
6-70
gs
Think about the music attributesthin
(1-5
ar
guit
like amount of distorted
scale), amount of classical violin sound.
Here many of these attributes are not
d
independent. If I have a lot of distorte
a
ing
hav
of
ility
guitar sound, the probab
.
classical violin sound decreases
Think about cases yourself. For example, consider attributes of cars. Are they independent?
Attributes of a movie? Amazon purchases?
So, for Bayes to work we need to use attributes that are independent, but most real-world
problems violate that condition. What we are going to do is just to assume that they are
independent! We are using the magic wand of sweeping things under the rugand
ignoreing this problem. We call it nave Bayes because we are navely assuming
independence even though we know it is not. It turns out that nave Bayes works really,
really, well even with this nave assumption.
code it
Can you run the nave Bayes code on our other data sets? For example, our kNN
algorithm was 53% accurate on the auto MPG data set. Does a Bayes approach
produce better results?
tenfold("mpgData/mpgData", "class attr! num
num
num
num! comment")
?????
!
6-71
Classifying
unstructured text
In previous chapters we've looked at recommendation systems that have people explicitly
rate things with star systems (5 stars for Phoenix), thumbs-up/thumbs-down (Inception-thumbs-up!), and numerical scales. We've looked at implicit things like the behavior of
peopledid they buy the item, did they click on a link. We have also looked at classification
systems that use attributes like height, weight, how people voted on a particular bill. In all
these cases the information in the datasets can easily be represented in a table.
age
26
56
23
glucose
level
78
111
81
blood
pressure
50
72
78
diabetes?
1
1
0
mpg
cylinde
30
45
20
rs
HP
sec. 0-6
4
68
4
48
8
130
19.5
21.7
12
This type of data is called structured datadata where instances (rows in the tables above)
are described by a set of attributes (for example, a row in a table might describe a car by a set
of attributes including miles per gallon, the number of cylinders and so on). Unstructured
data includes things like email messages, twitter messages, blog posts, and newspaper
articles. These types of things (at least at first glance) do not seem to be neatly represented in
a table.
For example, suppose we are interested in determining whether various movies are good or
not good and we want to analyze Twitter messages:
We, as speakers of English can see that Andy Gavin likes Gravity, since he said puts the
thrill back in thriller and good acting. We know that Debra Murphy seems not so excited
about the movie since she said save your $$$. And if someone writes I wanna see Gravity
sooo bad, we should all go see it!!! that person probably likes the movie even though they
used the word bad.
Suppose I am at my local food co-op and see something called Chobani Greek Yogurt. It looks
interesting but is it any good? I get out my iPhone, do a google search and find the following
from the blog Woman Does Not Live on Bread Alone:
7-2
Is that a positive or negative review for Chobani? Even based on the second sentence: If not,
stop reading, gather your keys and get to your local grocery store, it seems that this will
be a positive review. She describes the flavor as fantastic and calls the yogurt delicious. It
seems that I should buy it and check it out. I will be right back...
7- 3
Let's imagine an automatic system that can read some text and decide whether it is a positive
or negative report about a product. Why would we want such a system? Suppose there is a
company that sells health monitors, they might want to know about what people are saying
about their products. Are what people say mostly positive or negative? They release an ad
campaign for a new product. Are people favorable about the product (Man, I sooo want this!)
or negative (looks like crap). Apple has a press conference to talk about the iPhone problems.
Is the resulting press coverage positive? A Senate candidate delivers a major policy speech
do the political bloggers view it favorably? So an automatic system does sound useful.
7-4
Let's say I want to create a system that can tell whether a person likes or dislikes various food
products. We might come up with an idea of having a list of words that would provide
evidence that a person likes the product and another list of words that provides evidence that
the person doesn't like the product.
Like words:
delicious
tasty
good
love
smooth
Dislike words:
awful
blan d
bad
hate
gritty
If we are trying to determine if a particular reviewer likes Chobani yogurt or not, we can just
count the number of like words and the number of dislike words in their text. We will
classify the text based on which number is higher. We can do this for other classification
tasks. For example, if we want to decide whether someone is pro-choice or pro-life, we can
base it on the words and phrases they use. If they use the phrase 'unborn child' then chances
are they are pro-life; if they use fetus they are more likely to be pro-choice. It's not surprising
that we can use the occurrence of words to classify text.
7- 5
I am going to go
through all the hypotheses
and pick the one with the
maximum probability
We will use the nave Bayes methods that were introduced in the previous chapter. We start
with a training data set and, since we are now interested in unstructured text this data set is
called the training corpus. Each entry in the corpus we will call a document even if it is a
140 character Twitter post. Each document is labeled with its class. So, for example, we
might have a corpus of Twitter posts that rated movies. Each post is labeled in some way as a
favorable review or unfavorable and we are going to train our classifier using this corpus of
labeled documents. The P(h) in the formula above is the probability of these labels. If we
have 1,000 documents in our training corpus and 500 of them are favorable reviews and 500
unfavorable then
P( favorable) = 0.5
7-6
P(unfavorable)= 0.5
Okay, back to
Now let's examine the P(D|h) part of the formulathe probability of seeing some evidence,
some data D given the hypothesis h. The data D we are going to use is the words in the text.
One approach would be to start with the first sentence of a document, for example, Puts the
Thrill back in Thriller. And compute things like the probability that a 'like' document starts
with the word Puts; what's the probability of a 'like' document having a second word of the;
and the probability of the third word of a like document being Thrill and so on. And then
compute the probability of a dislike document starting with the word Puts, the probability of
the second word of a dislike document being the and so on.
7- 7
If a Twitter message
has about 14 words, we
would need to compute...
Hmm. yeah. That is a huge number of probabilities which makes this approach unworkable.
And, fortunately, there is a better approach. We are going to simplify things a bit by treating
the documents as bags of unordered words. Instead of asking things like What's the
probability that the third word is thrill given it is a 'like' document we will ask What's the
probability that the word thrill occurs in a 'like' document. Here is how we are going to
compute those probabilities.
Training Phase
First, we are going to determine the vocabularythe unique wordsof all the documents
(both like and dislike documents). So, for example, even though the may occur thousands of
times in our training corpus it only occurs once in our vocabulary. Let
Vocabulary
denote the number of words in the vocabulary. Next, for each word wk in the vocabulary we
are going to compute the probability of that word occurring given each hypothesis: P(wk |hi).
7-8
We are going to compute this as follows. For each hypothesis (in this case 'like' and dislike')
1.
combine the documents tagged with that hypothesis into one text file.
2. count how many word occurrences there are in the file. This time, if there are 500
occurrences of the we are going to count it 500 times. Lets call this n.
3. For each word in the vocabulary wk, count how many times that word occurred in the
text. Call this nk
4. For each word in the vocabulary wk, compute
P(wk | hi ) =
nk +1
n + Vocabulary
7- 9
Lets say our training corpus consisted of 500 Twitter messages with positive reviews of
movies and 500 negative. So
P(dislike) = 0.5
P(like)= 0.5
word
P(word|like)
P(word|dislike)
am
0.007
0.009
by
0.012
0.012
good
0.002
0.0005
gravity
0.00001
0.00001
great
0.003
0.0007
hype
0.0007
0.002
0.01
0.01
over
0.005
0.0047
stunned
0.0009
0.002
the
0.047
0.0465
7-10
word
P(word|like)
P(word|dislike)
P(like) = 0.5
P(dislike) =0.05
0.01
0.01
am
0.007
0.009
stunned
0.0009
0.002
by
0.012
0.012
the
0.047
0.0465
hype
0.0007
0.002
over
0.005
0.0047
gravity
0.00001
0.00001
6.22E-22
4.72E-21
0.000000000000000000000622
dislike
0.000000000000000000004720
7- 11
Yes. If we multiply
the word probabilities for
even a short document of
100 words we are going
to get a very, very, very
small number.
Heres an illustration of the problem. Lets say we have a 100 word document and the average
probability of each word is 0.001 (words like tell, reported, average, morning, and am have
a probability of around 0.001). If I multiply those probabilities in Python we get zero:
>>> 0.0001**100
0.0
However, if we add the log of the probabilities we do get a non-zero value:
>>> import math
>>> p = 0
>>> for i in range(100):
!
p += math.log(0.0001)
>>> p
-921.034037197617
7-12
bn = x
The logorithm (or log) of a number (the x above) is the exponent (the n above)
that you need to raise a base (b) to equal that number. For example, suppose
the base is 10,
log10(1000) = 3 since 1000 equals 103
The base of the Python log function is the mathematical constant e. We dont
really need to know about e. What is of interest to us is:
1. logs compress the scale of a number ( with logs we can represent smaller
numbers in Python)
for ex.,
.0000001 x .000005 = .000000000005
the logs of those numbers are:
-16.11809 + -9.90348 = -26.02157
2. instead of multiplying the probabilities we are going to add the logs of the
probabilities (as shown above).
Newsgroup Corpus
We will first investigate how this algorithm works by using a standard reference corpus of
usenet newsgroup posts. The data consists of posts from 20 different newsgroups:
comp.graphics
misc.forsale
soc.religion.christian
alt.atheism
comp.os.ms-windows-misc
rec.autos
talk.politics.guns
sci.space
comp.sys.ibm.pc.hardware
rec.motorcycles
talk.politics.mideast
sci.crypt
comp.sys.mac.hardware
rec.sport.baseball
talk.politics.misc
sci.electronics
comp.windows.x
rec.sport.hockey
talk.religion.misc
sci.med
7- 13
We would like to build a classifier that can correctly determine what group the post came
from. For example, we would like to classify this post
From: essbaum@rchlan
d.vnet.ibm.com
(Alexander Essbaum)
Subject: Re: Mail order
response time
Disclaimer: This posti
ng represents the poste
r's
views, not necessarily
those of IBM
Nntp-Posting-Host: rel
va.rchland.ibm.com
Organization: IBM Ro
chester
Lines: 18
> I have ordered many
times from Competitio
n
> accesories and ussua
lly get 2-3 day delivery.
ordered 2 fork seals an
d2
CA for my FZR. two we guide bushings from
eks
and 1 guide bushing. cal later get 2 fork seals
l CA and ask for
remaining *guide* bushi
ng
bushings (explain on the and order 2 *slide*
phone which bushings
are which; the guy see
med to understand). tw
o
weeks later get 2 guide
bushings.
*sigh*
how much you wanna
bet that once i get ALL
the
parts and take the fork
apart that some parts
won't fit?
7-14
ain
lemen. On the m
Ladies and Gent
ds in the
sed on the wor
stage ... Just ba
to tell
ing to attempt
text, we are go
fr
the post is om
which newsgroup
For example, we would like to build a system that would classify the following post as being
from rec.motorcycle:
I am looking at buying a Dual Sport type motorcycle. This is my first
cycle as well. I am interested in any experiences people have with
the following motorcycles, good or bad.
Honda XR250L
Suzuki DR350S
Suzuki DR250ES
Yamaha XT350
Most XXX vs. YYY articles I have seen in magazines pit the Honda XR650L
against another cycle, and the 650 always comes out shining. Is it safe
to assume that the 250 would be of equal quality ?
7- 15
I...
I is not helpful
am...
not helpful
looking...
not helpful
at...
not helpful
buying...
a ....
not helpful
dual...
definitely helpful
sport ...
definitely
type...
probably not
motorcycle
definitely!!!!
If we throw out the 200 most frequent words in English our document looks like this:
7-16
Removing these words cuts down the size of our text by about half. Plus, it doesn't look like
removing these words will have any impact on our ability to categorize texts. Indeed data
miners have called such words words without any content, and fluff words. H.P. Luhn, in his
seminal paper 'The automatic creation of literature abstracts' says of these words that they
are too common to have the type of significance being sought and would constitute 'noise' in
the system. That noise argument is interesting as it implies that removing these words will
improve performance. These words that we remove are called 'stop words'. We have a list of
such words, the 'stop word list', and remove these words from the text in a preprocessing
step. We remove these words because 1) it cuts down on the amount of processing we need to
do and 2) it does not negatively impact the performance of our systemas the noise
argument suggests removing them might improve performance.
While removing stop words may be useful in some situations, you should not just
automatically remove them without thinking. For example, it turns out just using the most
frequent words and throwing out the rest (the reverse technique of the above) provides
7- 17
sufficient information to identify where Arabic documents were written. (If you are curious
about this check out the paper Linguistic Dumpster Diving: Geographical Classification of
Arabic Text I co-wrote with some of my colleagues at New Mexico State University. It is
available on my website http://zacharski.org). In looking at online chats, sexual predators
use words like I, me, and you, much more frequently than non-predators. If your task is to
identify sexual predators, removing frequent words would actually hurt your performance.
top words.
remove s
Dont blindly
Think First.
!
!
text file
!
comp.graphics
!
!
text file
!
!
...
7-18
1 for alt.atheism
2
n
1 for comp.graphics
class BayesText
1. the init method:
a. read in the words from the stoplist
b. read the training directory to get the names of the
subdirectories (in addition to being the names of the
subdirectories, these are the names of the categories).
c. For each of those subdirectories, call a method train
that will count the occurrences of words in all the files of
that subdirectory.
d. compute the probabilities using
P(wk | hi ) =
nk +1
n + Vocabulary
7- 19
7-20
# now delete
for word in toDelete:
del self.vocabulary[word]
# now compute probabilities
vocabLength = len(self.vocabulary)
print("Computing probabilities:")
for category in self.categories:
print('
' + category)
denominator = self.totals[category] + vocabLength
for word in self.vocabulary:
if word in self.prob[category]:
count = self.prob[category][word]
else:
count = 1
self.prob[category][word] = (float(count + 1)
/ denominator)
print ("DONE TRAINING\n\n")
7- 21
The results of the training phase are stored in a dictionary (hash table) called prob. The keys
of the dictionary are the different classifications (comp.graphics, rec.motorcycles,
soc.religion.christian, etc); the values are dictionaries. The keys of these subdictionaries are
the words and the values are the probabilities of those words. Here is an example:
bT = BayesText(trainingDir, stoplistfile)
>>>bT.prob["rec.motorcycles"]["god"]
0.00013035445075435553
>>>bT.prob["soc.religion.christian"]["god"]
0.004258192391884386
>>>bT.prob["rec.motorcycles"]["the"]
0.028422937849264914
>>>bT.prob["soc.religion.christian"]["the"]
0.039953678998362795
So, for example, the probability of the word god occurring in a text in the rec.motorcycles
newsgroup is 0.00013 (or one occurrence of god in every 10,000 words). The probability of
the word god occurring in a text in soc.religion.christian is .00424 (one occurrence in every
250 words).
Training also results in a list called categories, which, as you might predict, is simply a list of
all the categories:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware', ...]
7-22
code it
Can you code a method called classify that will predict the classification of a
document? For example:
>>> bT.classify("20news-bydate-test/rec.motorcycles/104673")
'rec.motorcycles'
>>> bT.classify("20news-bydate-test/sci.med/59246")
'sci.med'
>>> bT.classify("20news-bydate-test/soc.religion.christian/21424")
'soc.religion.christian'
As you can see, the classify method takes a filename as an argument and returns a
string denoting the classification.
A Python file you can use as a template, bayesText-ClassifyTemplate.py, is available
on our website.
class BayesText:
' + category)
(self.prob[category],
self.totals[category]) = self.train(trainingdir,
category)
# I am going to eliminate any word in the vocabulary
7- 23
Finally, lets have a method that classifies every document in the test directory and prints out
the percent accuracy of this method.
7-24
DONE TRAINING
Running Test ...
....................
Accuracy is
77.774827%
7- 25
code it
Can you run the classifier with a few stop word lists? Does performance improve? Which is most
accurate? (You will need to search the web to find these lists)
stop list size
accuracy
77.774827
list 1
list 2
7-26
accuracy
77.774827%
25 word list
78.757302%
79.938927%
So in this case, it looks like having a 174 word stop word list improved performance about 2%
over having no stop word list? Does this match your results?
7- 27
Lorde is
awesome!
Okay,
I agree. Lorde IS
awesome!
One common type of sentiment analysis is to determine the polarity of a review or comment
(positive or negative) and we can use a Nave Bayes Classifier for this task. We can try this
out by using the polarity movie review dataset first presented in Pang and Lee 2004 1. Their
dataset consists of 1,000 positive and 1,000 negative reviews. Here are some examples:
r thriller of the month
the second serial-kille
rts deceptively okay ,
is just awful . oh , it sta
uing characters and
with a handful of intrig
rk . ...
some solid location wo
Pang, Bo and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts. Proceedings of ACL.
1
7-28
You can download the original dataset from http://www.cs.cornell.edu/People/pabo/moviereview-data/. I have organized the data into 10 buckets (folds) with the following directory
structure:
review_polarity_buckets
!
txt_sentoken
!
!
neg
!
! !
0
!
! !
!
files in fold
!
! !
1
!
! !
!
files in fold
!
! !
...
!
! !
9!
!
! !
!
files in fold
!
!
pos
!
! !
0
!
! !
!
files in fold
!
! !
...
0
1
9
0
code it
Can you modify the Naive Bayes Classifier code to do 10-fold cross validation of the classifier on
this data set. The output should look something like:
Classified as:
neg
pos
+-----+-----+
neg
|
1 |
2 |
pos
|
3 |
4 |
+-----+-----+
12.345 percent correct
total of 2000 instances
7- 29
Obvious Disclaimer
You wont become
proficient in data m
ining by
reading this book an
ymore than reading
a book
about piano playing
will make you profic
ient at
piano playing. You
need to practice!
Brahms
Woman practicing
code it my results
neg
pos
Classified as:
neg
pos
+-----+-----+
| 845 | 155 |
| 222 | 778 |
+-----+-----+
er:
Yet another remind
e books
for download on th
le
The code is availab
/
todatamining.com
website http://guide
7- 31
7-32
7- 33
7-34
7- 35
bT = BayesText(dataPrefix, stoplist, i)
r = bT.test(theDir, i)
for (key, value) in r.items():
results.setdefault(key, {})
for (ckey, cvalue) in value.items():
results[key].setdefault(ckey, 0)
results[key][ckey] += cvalue
categories = list(results.keys())
categories.sort()
print(
"\n
Classified as: ")
header =
"
"
subheader = "
+"
for category in categories:
header += "% 2s
" % category
subheader += "-----+"
print (header)
print (subheader)
total = 0.0
correct = 0.0
for category in categories:
row = " %s
|" % category
for c2 in categories:
if c2 in results[category]:
count = results[category][c2]
else:
count = 0
row += " %3i |" % count
total += count
if c2 == category:
correct += count
print(row)
print(subheader)
print("\n%5.3f percent correct" %((correct * 100) / total))
print("total of %i instances" % total)
# change these to match your directory structure
theDir = "/Users/raz/Downloads/review_polarity_buckets/txt_sentoken/"
stoplistfile = "/Users/raz/Downloads/20news-bydate/stopwords25.txt"
tenfold(theDir, stoplistfile)
7-36
7- 37
Chapter 8: Clustering
Discovering
Groups
In previous chapters we have been developing classification systems. In these systems we
train a classifier on a set of labeled examples.
sport
Height
Weight
basketball
72
162
gymnastics
54
66
track
63
106
basketball
78
204
plasma
glucose
diastolic
BP
BMI
diabetes?
99
52
24.6
83
58
34.4
139
80
31.6
and so on. In other words, the classifier selects a label from a set of labels it
acquired during the training phaseit knows the possible labels.
This task is called clustering. The system divides a set of instances into clusters or groups
based on some measure of similarity. There are two main types of clustering algorithms.
k-means clustering
For one type, we tell the algorithm how many clusters to make. Please cluster these 1,000
people into 5 groups. Please classify these web pages into 15 groups. These methods go by
the name of k-means clustering algorithms and we will discuss those a bit later in the
chapter.
hierarchical clustering
For the other approach we dont specify how many clusters to make. Instead the algorithm
starts with each instance in its own cluster. At each iteration of the algorithm it combines the
two most similar clusters into one. It repeatedly does this until there is only one cluster. This
8-2
CLUSTERING
is called hierarchical clustering and its name makes sense. The running of the algorithm
results in one cluster, which consists of two sub-clusters. Each of those two sub-clusters in
turn, consist of 2 sub-sub clusters and so on.
Then we repeat...
Then we repeat...
Then we repeat...
8-3
Again, at each iteration of the algorithm we join the two closest clusters. To determine the
closest clusters we use a distance formula. But we have some choices in how we compute
the distance between two clusters, which leads to different clustering methods. Consider the
three clusters (A, B, and C) illustrated below each containing two members. Which pair of
clusters should we join? Cluster A with B, or cluster C with B?
B2
A1
B1
C1
C2
A2
Single-linkage clustering
In single-linkage clustering we define the distance between two clusters as the shortest
distance between any member of one cluster to any member of the other. With this
definition, the distance between Cluster A and Cluster B is the distance between A1 and B1,
since that is shorter than the distances between A1 and B2, A2 and B1, and A2 and B2. With
single-linkage clustering, Cluster A is closer to Cluster B than C is to B, so we would combine
A and B into a new cluster.
Complete-linkage clustering
In complete-linkage clustering we define the distance between two clusters as the greatest
distance between any member of one cluster to any member of the other. With this
definition, the distance between Cluster A and Cluster B is the distance between A2 and B2.
With complete-linkage clustering, Cluster C is closer to Cluster B than A is to B, so we would
combine B and C into a new cluster.
Average-linkage clustering
In average-linkage clustering we define the distance between two clusters as the average
distance between any member of one cluster to any member of the other. In the diagram
above, it appears that the average distance between Clusters C and B would be less than the
average between A and B and we would combine B and C into a new cluster.
8-4
CLUSTERING
Good idea! Lets practice by clustering dog breeds based on height and weight!
height
(inches)
weight
(pounds)
Border Collie
20
45
Boston Terrier
16
20
Brittany Spaniel
18
35
Bullmastiff
27
120
Chihuahua
German Shepherd
25
78
Golden Retriever
23
70
Great Dane
32
160
Portuguese
Water Dog
21
50
Standard Poodle
19
65
Yorkshire Terrier
breed
ng.
Psst! I think we are forgetting somethi
re
Isnt there something we should do befo
computing distance?
8-5
Normalization!
Lets
change those numbers to Modified
Standard Scores
weight
-0.1455
Boston Terrier
-0.7213
-0.873
Brittany Spaniel
-0.3607
-0.4365
Bullmastiff
1.2623
2.03704
Chihuahua
-2.1639
-1.2222
German Shepherd
0.9016
0.81481
Golden Retriever
0.541
0.58201
Great Dane
2.16393
3.20106
Portuguese
Water Dog
0.1803
Standard Poodle
-0.1803
0.43651
Yorkshire Terrier
-2.525
-1.25132
breed
Border Collie
8-6
CLUSTERING
Border Collie
BT
BS
GS
GR
GD
PWD
SP
YT
1.024
0.463
2.521
2.417
1.317
0.907
3.985
0.232
0.609
2.756
0.566
3.522
1.484
2.342
1.926
4.992
1.255
1.417
1.843
2.959
1.967
1.777
1.360
4.428
0.695
0.891
2.312
4.729
1.274
1.624
1.472
2.307
2.155
5.015
3.681
3.251
6.188
2.644
2.586
0.362
0.429
2.700
1.088
1.146
4.001
3.081
0.685
0.736
3,572
3.766
3.625
6.466
0.566
2.980
Boston Terrier
Brittany Spaniel
Bullmastiff
Chihuahua
German Shphrd
Golden Retriever
Great Dane
Portuguese WD
2.889
Standard Poodle
4.0
Great Dane
Based on
this chart, which
two breeds do
you think are the
closest?
3.0
Bullmastiff
2.0
1.0
0
Border Collie
Boston Terrier
-1.0
Yorkshire
-2.0
-3.00
weight
German Shepherd
St. Poodle
-2.25
Golden Retriever
Portuguese WD
Brittany Spaniel
Chihuahua
-1.50
-0.75
0.75
1.50
2.25
3.00
height
8-7
If you said Border Collie and Portuguese Water Dog, you would be correct!
The algorithm.
Step 1.
Initially, each breed is in its own cluster. We find the two closest clusters and combine them
into one cluster. From the table on the preceding page we see that the closest clusters are the
Border Collie and the Portuguese Water Dog (distance of 0.232) so we combine them.
Border Collie
Portuguese WD
Step 2.
We find the two closest clusters and combine them into one cluster. From the table on the
preceding page we see that these are the Chihuahua and the Yorkshire Terrier (distance of
0.362) so we combine them.
Chihuahua
Yorkshire T.
Border Collie
Portuguese WD
Step 3.
We repeat the process again. This time combining the German Shepherd and the Golden
Retriever.
Chihuahua
Yorkshire T.
German Shphrd
Golden Retriever
Border Collie
Portuguese WD
8-8
CLUSTERING
Step 4.
We repeat the process yet again. From the table we see that the next closest pair is the Border
Collie and the Brittany Spaniel. The Border Collie is already in a cluster with the Portuguese
Water Dog which we created in Step 1. So in this step we are going to combine that cluster
with the Brittany Spaniel.
Chihuahua
Yorkshire T.
German Shphrd
Golden Retriever
Border Collie
Portuguese WD
called a
diagram is
f
o
e
p
ty
This
ally a tree
It is basic
.
m
a
r
g
o
r
clusters.
dend
epresents
r
t
a
th
m
diagra
Brittany Spaniel
And we continue:
Chihuahua
Yorkshire T.
German Shphrd
Golden Retriever
Border Collie
Portuguese WD
Brittany Spaniel
Boston Terrier
8-9
Chihuahua
Yorkshire T.
German Shphrd
Golden Retriever
Border Collie
Portuguese WD
Brittany Spaniel
Boston Terrier
8-10
CLUSTERING
Chihuahua
Yorkshire T.
German Shphrd
Golden Retriever
Border Collie
Portuguese WD
Brittany Spaniel
Boston Terrier
Standard Poodle
Bullmastiff
Great Dane
8-11
Sure!!
In a regular queue, the order in which you put the
items in the queue is the order you get the items out
of the queue...
(13, Yui)
2nd
(16, Suzuka)
1st
(15, Moa)
Queue
(13, Yui)
3rd
8-12
(16, Suzuka)
2nd
(15, Moa)
1st
CLUSTERING
In a priority queue each item put into the queue has an associated priority. The order in
which items are retrieved from the queue is based on this priority. Items with a higher
priority are retrieved before items with a lower one. In our example data, suppose the
younger a person is, the higher their priority.
We put the tuples into the queue in the
as before!
(13, Yui)
(16, Suzuka)
same order
Priority Queue
(15, Moa)
)
(15, Moa
(16, Suzuka)
(13, Yui)
queue will be
The first item to be retrieved from the
has the highest
Yui because she is youngest and thus
priority!
Priority Queue
)
(16, Suzuka)
(15, Moa
(16, Suzuka)
(15, Moa)
3rd
2nd
(13, Yui)
1st
(13, Yui)
# singersQueue
8-13
>>> singersQueue.get()
(15, 'Moa Kikuchi')
>>> singersQueue.get()
(16, 'Suzuka Nakamoto')
>>> singersQueue.get()
(17, 'Ayaka Sasaki')
For our task of building a hierarchical clusterer, we will put the clusters in a priority queue.
The priority will be the shortest distance to a clusters nearest neighbor. Using our dog breed
example, we will put the Border Collie in our queue recording that its nearest neighbor is the
Portuguese Water Dog at a distance of 0.232. We put similar entries into the queue for the
other breeds:
Priority Queue
cluster: (Border Collie)
neighbor: Portuguese Wa
ter Dog
distance: 0.232
cluster: (Chihuahua)
neighbor: Yorkshire Terrier
distance: 0.362
cluster: (Portuguese Wa
ter Dog)
neighbor: Border Collie
distance: 0.232
etc.
etc.
etc.
We will get the two entries with the shortest distance, making sure we have a matching pair.
In this case we get the entries for Border Collie and Portuguese Water Dog. Next, we join the
clusters into one cluster. In this case, we create a Border Collie - Portuguese Water Dog
cluster. And put that cluster on the queue:
8-14
CLUSTERING
Priority Queue
cluster: (Border Collie,
Portuguese Water Dog)
neighbor: Brittany Spaniel
distance: 0.463
cluster: (Chihuahua)
neighbor: Yorkshire Terrier
distance: 0.362
etc.
etc.
etc.
And repeat until there is only one cluster on the queue. The entries we will put on the queue
need to be slightly more complex than those used in this example. So lets look at this
example in more detail.
ounds)
,weight (p
ht (inches)
ig
he
d,
ee
br
ie,20,45
Border Coll r,16,20
ie
rr
Te
Boston
5
aniel,18,3
Brittany Sp 27,120
f,
Bullmastif 8
8,
Chihuahua,
herd,25,78
German Shep ver,23,70
ie
tr
Golden Re
,32,160
ne
Da
t
ea
Gr
21,50
Water Dog,
Portuguese
65
9,
,1
odle
Standard Po rrier,6,7
Te
Yorkshire
The data in this file is read into a list called, not surprisingly, data. The list data saves the
information by column. Thus, data[0] is a list containing the breed names (data[0][0] is
the string Border Collie, data[0][1] is Boston Terrier and so on). data[1] is a list
8-15
containing the height values, and data[2] is the weight list. All the data except that in the
first column is converted into floats. For example, data[1][0] is the float 20.0 and
data[2][0] is the float 45. Once the data is read in, it is normalized. Throughout the
description of the algorithm I will use the term index to refer to the row number of the
instance (for example, Border Collie is index 0, Boston Terrier is index 1, and Yorkshire
Terrier is index 10).
the distance between the Border Collie (index 0) and the Boston Terrier
(index 1), is 1.0244
the distance between the Border Collie the Brittany Spaniel is 0.463
...
10: ((0, 10), 2.756)}
We will also keep track of the Border Collies nearest neighbor and the distance to that
nearest neighbor:
closest distance: 0.232
nearest pair: (0, 8)
The problem of identical distances and what is with all those tuples.
You may have noticed that in the table on page 8-7, the distance between the Portuguese
Water Dog and the Standard Poodle and the distance between the Boston Terrier and the
Brittany Spaniel are the same0.566. If we retrieve items from the priority queue based on
distance there is a possibility that we will retrieve Standard Poodle and Boston Terrier and
join them in a cluster, which would be an error. To prevent this error we will use a tuple
containing the indices (based on the data list) of the two breeds that the distance
represents. For example, Portuguese Water Dog is entry 8 in our data and the Standard
8-16
CLUSTERING
Poodle is entry 9, so the tuple will be (8,9). This tuple is added to the nearest neighbor list.
The nearest neighbor for the poodle will be:
['Portuguese Water Dog', 0.566, (8,9)]
and the nearest neighbor for the Portuguese Water Dog will be:
['Standard Poodle', 0.566, (8,9)]
By using this tuple, when we retrieve items from the queue we can see if they are a matching
pair.
You can see that if the first items in the tuples match, Python uses the next item to break the
tie. In the case of all those 15 year olds, the entries are retrieved based on the next item, the
persons name. And since these are strings, they are ordered alphabetically. Thus the entry
for Avaka Sasaki is retrieved before Moa Kikuchi and Moa is retrieved before Suzuka, which
is retrieved before Yui.
8-17
In our case of hierarchical clustering, We use the distance between breeds as the primary
priority. To resolve ties we will use an index number. The first element we put on the queue
will have an index of 0, the second element an index of 1, the third , 2, and so on. Our
complete entry we add to the queue will be of the form:
distance to
nearest neighbor
index number
current cluster
We initialize the priority queue by placing on the queue, an entry like this for each breed.
Next we compute the distance of this new cluster to all the other dog breeds except those in
the new cluster. We do this by merging the distance dictionaries of the two initial clusters in
the following way. Lets call the distance dictionary of the first item we get from the queue
distanceDict1, the distance dictionary of the second item we get from the queue
distanceDict2, and the distance dictionary we are constructing for the new cluster
newDistanceDict.
8-18
CLUSTERING
key
10
The complete entry that will be placed on the queue as a result of merging the Border Collie
and the Portuguese Water Dog will be
(0.4634184292111748, 11, [('Border Collie', 'Portuguese Water Dog'),
[2, 0.4634184292111748, (0, 2)],
{1: ((0, 1), 1.0244831578726061), 2: ((0, 2), 0.4634184292111748),
3: ((3, 8), 2.306550008240866), 4: ((0, 4), 2.4170099809294157),
5: ((5, 8), 1.0882157079364436), 6: ((6, 8), 0.6846961944627522),
7: ((7, 8), 3.7658290695451373), 9: ((8, 9), 0.5662258734585477),
10: ((0, 10), 2.756155583828758)}])
8-19
Code It
8-20
CLUSTERING
Code It - solution
Remember:
This is only my solution and not
necessarily the best solution. You
might have come up with a better one!
"""
Example code for hierarchical clustering
"""
def getMedian(alist):
"""get median value of list alist"""
tmp = list(alist)
tmp.sort()
alen = len(tmp)
if (alen % 2) == 1:
return tmp[alen // 2]
else:
return (tmp[alen // 2] + tmp[(alen // 2) - 1]) / 2
def normalizeColumn(column):
"""Normalize column using Modified Standard Score"""
median = getMedian(column)
asd = sum([abs(x - median) for x in column]) / len(column)
result = [(x - median) / asd for x in column]
return result
class hClusterer:
""" this clusterer assumes that the first column of the data is a label
not used in the clustering. The other columns contain numeric data"""
def __init__(self, filename):
file = open(filename)
self.data = {}
self.counter = 0
self.queue = PriorityQueue()
lines = file.readlines()
8-21
file.close()
header = lines[0].split(',')
self.cols = len(header)
self.data = [[] for i in range(len(header))]
for line in lines[1:]:
cells = line.split(',')
toggle = 0
for cell in range(self.cols):
if toggle == 0:
self.data[cell].append(cells[cell])
toggle = 1
else:
self.data[cell].append(float(cells[cell]))
# now normalize number columns (that is, skip the first column)
for i in range(1, self.cols):
self.data[i] = normalizeColumn(self.data[i])
###
###
###
###
###
###
###
###
###
###
###
###
###
###
###
8-22
CLUSTERING
if dist < minDistance:
minDistance = dist
nearestNeighbor = j
nearestNum = j
# create nearest Pair
if i < nearestNeighbor:
nearestPair = (i, nearestNeighbor)
else:
nearestPair = (nearestNeighbor, i)
# put instance on priority queue
self.queue.put((minDistance, self.counter,
[[self.data[0][i]], nearestPair, neighbors]))
self.counter += 1
def cluster(self):
done = False
while not done:
topOne = self.queue.get()
nearestPair = topOne[2][1]
if not self.queue.empty():
nextOne = self.queue.get()
nearPair = nextOne[2][1]
tmp = []
##
## I have just popped two elements off the queue,
## topOne and nextOne. I need to check whether nextOne
## is topOne's nearest neighbor and vice versa.
## If not, I will pop another element off the queue
## until I find topOne's nearest neighbor. That is what
## this while loop does.
##
while nearPair != nearestPair:
tmp.append((nextOne[0], self.counter, nextOne[2]))
self.counter += 1
nextOne = self.queue.get()
nearPair = nextOne[2][1]
##
## this for loop pushes the elements I popped off in the
## above while loop.
##
8-23
minDistance = 99999
nearestPair = ()
nearestNeighbor = ''
merged = {}
nNeighbors = nextOne[2][2]
for (key, value) in topOne[2][2].items():
if key in nNeighbors:
if nNeighbors[key][1] < value[1]:
dist = nNeighbors[key]
else:
dist = value
if dist[1] < minDistance:
minDistance = dist[1]
nearestPair = dist[0]
nearestNeighbor = key
merged[key] = dist
if merged == {}:
return curCluster
else:
self.queue.put( (minDistance, self.counter,
[curCluster, nearestPair, merged]))
self.counter += 1
8-24
CLUSTERING
8-25
traverse(T[1], h-sep, 0)
traverse(T, maxHeight(T), -1)
filename = '//Users/raz/Dropbox/guide/pg2dm-python/ch8/dogs.csv'
n
hg = hClusterer(filename)
cluster = hg.cluster()
printDendrogram(cluster)
Chihuahua -------------------------------+
|--+
Yorkshire Terrier -----------------------+ |
|-Great Dane ------------------------------+ |
|--+
Bullmastiff --------------------------+ |
|--+
German Shepherd ----------------+
|
|--+ |
Golden Retriever ---------------+ | |
|--+
Standard Poodle ----------------+ |
|--+
Boston Terrier --------------+ |
|--+
Brittany Spaniel ---------+ |
|--+
Border Collie ---------+ |
|--+
Portuguese Water Dog --+
8-26
CLUSTERING
you try!
Breakfast Cere
als
8-27
To run the clusterer on this dataset we only needed to change the filename from dogs.csv to
cereal.csv. Here is an abbreviated version of the results:
Mueslix Crispy Blend --------------------------------------------------------------------+
|--+
Muesli Raisins & Almonds -------------------------------------------------------------+ |
|--+
Muesli Peaches & Pecans --------------------------------------------------------------+
...
Lucky Charms ----------+
|--+
Fruity Pebbles --+
|
|--+ |
Trix ------------+ | |
|--+
Cocoa Puffs -----+ |
|--+
Count Chocula ---+
Trix, is most similar to Fruity Pebbles. (I recommend you confirm this by running out right now and
buying a box of each.) Perhaps not surprisingly, Muesli Raisins & Almonds is closest to Muesli
Peaches & Pecans.
8-28
CLUSTERING
Introducing ...
-means clustering
k-means clustering is
The Most Popular clustering
algorithm!
K-means is cool!
8-29
8-30
CLUSTERING
(1,
(1,
(2,
(2,
(4,
(4,
(5,
(5,
2)
4)
2)
3)
2)
4)
1)
3)
8-31
point
(1, 2)
(1,4)
(2, 2)
(2, 3)
(4, 2)
(4, 4)
(5, 1)
(5, 3)
CLUSTER 2
(2, 2)
(4, 2)
(4, 4)
(5, 1)
(5, 3)
8-32
CLUSTERING
(1, 2)
1.33
3.4
(1, 4)
1.33
4.6
(2, 2)
1.67
2.4
(2, 3)
0.67
2.6
(4, 2)
3.67
0.4
(4, 4)
3.67
1.6
(5, 1)
5.67
2.4
(5, 3)
3.67
1.6
CLUSTER 2
(4, 2)
(4, 4)
(5, 1)
(5, 3)
8-33
(1, 2)
1.25
4.0
(1, 4)
1.75
5.0
(2, 2)
1.25
3.0
(2, 3)
0.75
3.0
(4, 2)
3.25
1.0
(4, 4)
3.75
2.0
(5, 1)
5.25
2.0
(5, 3)
3.75
1.0
CLUSTER 2
(4, 2)
(4, 4)
(5, 1)
(5, 3)
CLUSTERING
CLUSTER 2
(4, 2)
(4, 4)
(5, 1)
(5, 3)
8-35
8-36
K-means is simple!
CLUSTERING
Hill Climbing
Now lets consider using the algorithm with the graph on the following page
8-37
Here, things
dont work out as
expected. If we
start at A on the
graph...
Sometimes thing
Thus, this simple version of the hill-climbing algorithm is not guaranteed to reach the
optimal solution.
The k-means clustering algorithm is like this. There is no guarantee
that it will find the optimal division of the data into clusters. Why?
8-38
CLUSTERING
SSE or Scatter
To determine the quality of a set of clusters we can use the sum of the squared error
(SSE). This is also called scatter. Here is how to compute it: for each point we will square
the distance from that point to its centroid, then add all those squared distances together.
More formally,
k
Lets dissect that. In the first summation sign we are iterating over the clusters. So initially i
equals cluster 1, then i equals cluster 2, up to i equals cluster k. The next summation sign
iterates over the points in that clustersomething like, for each point x in cluster i. Dist is
whatever distance formula we are using (for example, Manhattan, or Euclidean). So we
compute the distance between that point, x, and the centroid for the cluster ci, square that
distance and add it to our total.
Lets say we run our k-means algorithm twice on the same data and for each run we pick a
different set of random initial centroids. Is the set of clusters that were computed during the
first run worse or better than the set computed during the second run? To answer that
question we compute the SSE for both sets of clusters. The set with the smaller SSE is the
better of the two.
8-39
g!
r basic k-means
import math
import random
def getMedian(alist):
"""get median of list"""
tmp = list(alist)
tmp.sort()
alen = len(tmp)
if (alen % 2) == 1:
return tmp[alen // 2]
else:
return (tmp[alen // 2] + tmp[(alen // 2) - 1]) / 2
def normalizeColumn(column):
"""normalize the values of a column using Modified Standard Score
that is (each value - median) / (absolute standard deviation)"""
median = getMedian(column)
asd = sum([abs(x - median) for x in column]) / len(column)
result = [(x - median) / asd for x in column]
return result
class kClusterer:
""" Implementation of kMeans Clustering
This clusterer assumes that the first column of the data is a label
not used in the clustering. The other columns contain numeric data
"""
def __init__(self, filename, k):
""" k is the number of clusters to make
This init method:
1. reads the data from the file named filename
2. stores that data by column in self.data
3. normalizes the data using Modified Standard Score
8-40
CLUSTERING
4. randomly selects the initial centroids
5. assigns points to clusters associated with those centroids
"""
file = open(filename)
self.data = {}
self.k = k
self.counter = 0
self.iterationNumber = 0
# used to keep track of % of points that change cluster membership
# in an iteration
self.pointsChanged = 0
# Sum of Squared Error
self.sse = 0
#
# read data from file
#
lines = file.readlines()
file.close()
header = lines[0].split(',')
self.cols = len(header)
self.data = [[] for i in range(len(header))]
# we are storing the data by column.
# For example, self.data[0] is the data from column 0.
# self.data[0][10] is the column 0 value of item 10.
for line in lines[1:]:
cells = line.split(',')
toggle = 0
for cell in range(self.cols):
if toggle == 0:
self.data[cell].append(cells[cell])
toggle = 1
else:
self.data[cell].append(float(cells[cell]))
self.datasize = len(self.data[1])
self.memberOf = [-1 for x in range(len(self.data[1]))]
#
# now normalize number columns
#
for i in range(1, self.cols):
self.data[i] = normalizeColumn(self.data[i])
# select random centroids from existing points
random.seed()
self.centroids = [[self.data[i][r] for i in range(1, len(self.data))]
for r in random.sample(range(len(self.data[0])),
self.k)]
self.assignPointsToCluster()
8-41
def updateCentroids(self):
"""Using the points in the clusters, determine the centroid
(mean point) of each cluster"""
members = [self.memberOf.count(i) in range(len(self.centroids))]
self.centroids = [[sum([self.data[k][i]
for i in range(len(self.data[0]))
if self.memberOf[i] == centroid])/members[centroid]
for k in range(1, len(self.data))]
for centroid in range(len(self.centroids))]
def assignPointsToCluster(self):
""" assign each data point to a cluster"""
self.pointsChanged = 0
self.sse = 0
self.memberOf = [self.assignPointToCluster(i)
for i in range(len(self.data[1]))]
8-42
CLUSTERING
until the number of points that change cluster membership
is less than 1%.
"""
done = False
while not done:
self.iterationNumber += 1
self.updateCentroids()
self.assignPointsToCluster()
#
# we are done if fewer than 1% of the points change clusters
#
if float(self.pointsChanged) / len(self.memberOf) < 0.01:
done = True
print("Final SSE: %f" % self.sse)
def showMembers(self):
"""Display the results"""
for centroid in range(len(self.centroids)):
print ("\n\nClass %i\n========" % centroid)
for name in [self.data[0][i] for i in range(len(self.data[0]))
if self.memberOf[i] == centroid]:
print (name)
##
## RUN THE K-MEANS CLUSTERER ON THE DOG DATA USING K = 3
###
km = kClusterer('dogs2.csv', 3)
km.kCluster()
km.showMembers()
Lets dissect
that code a bit!
8-43
As with our code for the hierarchical clusterer, we are storing the data by column. Consider
our dog breed data. If we represent the data in spreadsheet form, it would likely look like this
(the height and weight are normalized):
height
weight
-0.1455
Boston Terrier
-0.7213
-0.873
Brittany Spaniel
-0.3607
-0.4365
Bullmastiff
1.2623
2.03704
German Shepherd
0.9016
0.81481
...
...
...
breed
Border Collie
And if we were to transfer this data to Python we would likely make a list that looks like the
following:
So we are storing the data by row. This seems like the common sense approach and the one
we have been using throughout the book. Alternatively, we can store the data column first:
8-44
CLUSTERING
This is what we did for the hierarchical clusterer and what we are doing here for k-means.
The benefit of this approach is that it makes implementing many of the math functions
easier. We can see this in the first two procedures in the code above, getMedian and
normalizeColumn. Because we stored the data by column, these procedures take simple
lists as arguments.
>>> normalizeColumn([8, 6, 4, 2])
[1.5, 0.5, -0.5, -1.5]
The constructor method, __init__ takes as arguments, the filename of the data file and k,
the number of clusters to construct. It reads the data from the file and stores the data by
column. It normalizes the data using the normalizeColumn procedure, which implements
the Modified Standard Score method. Finally, it selects k elements from this data as the
initial centroids and assigns each point to a cluster depending on that points distance to the
initial centroids. It does this assignment using the method assignPointsToCluster.
The method, kCluster actually performs the clustering by repeatedly calling updateCentroids,
which computes the mean of each cluster and assignPointsToCluster until fewer than 1%
of the points change clusters. The method showMembers simply displays the results.
Running the code on the dog breed data yields the following results:
Class 0
========
Bullmastiff
Great Dane
8-45
Class 1
========
Boston Terrier
Chihuahua
Yorkshire Terrier
Class 2
========
Border Collie
Brittany Spaniel
German Shepherd
Golden Retriever
Portuguese Water Dog
Standard Poodle
Wow! For this small dataset the clusterer does extremely well.
You try
How well does the kmeans clusterer work with the cereal dataset with k = 4
Do the sweet cereals cluster together (CapnCrunch, Cocoa Puffs, Froot Loops, Lucky Charms?
Do the bran cereals cluster together (100% Bran, All-Bran, All-Bran with Extra Fiber, Bran Chex?
What does Cheerios cluster with?
Try the clusterer with the auto mpg dataset with different values for k=8?
Does this follow your expectations of how these cars should be grouped?
8-46
CLUSTERING
How well does the kmeans clusterer work with the cereal dataset with k = 4.
Your results may vary from mine but here is what I found out.
Do the sweet cereals cluster together (CapnCrunch, Cocoa Puffs, Froot Loops, Lucky Charms?
Yes, all these sweet cereals (plus Count Chocula, Fruity Pebbles, and others) are in the same sweet
cluster.
Do the bran cereals cluster together (100% Bran, All-Bran, All-Bran with Extra Fiber, Bran Chex?
Again, yes! Included in this cluster are also Raisin Bran and Fruitful Bran.
What does Cheerios cluster with?
Cheerios always seems to be in the same cluster as Special K
Try the clusterer with the auto mpg dataset with different values for k=8?
Does this follow your expectations of how these cars should be grouped?
The clusterer seems to do a reasonable job on this dataset but on rare occasions you will notice one or
more of the clusters are empty.
8-47
5
3
1
7
4
Consider clustering
these points with k = 3.
We randomly pick points 1,
7 & 8 as the initial
centroids.1
5
3
1
2
Next we update the centroids
(shown by the +)
5
3
1
8-48
7
4
8
1. For those of you who are
not looking at this in color,
the pink cluster now contains
points 6 and 7.
CLUSTERING
5
3
1
7
4
In sum, just because we specify how many groups to make does not mean that the k-means
clusterer will produce that many non-empty groups. This may be a good thing. Just looking
at the data above, it appears to be naturally clustered into two groups and our attempt to
cluster the data into three failed. Suppose we have 1,000 instances we would like to cluster
into 10 groups and when we run the clusterer two of the groups are empty. This result may
indicate something about the underlying structure of the data. Perhaps the data does not
naturally divide into ten groups and we can explore other groupings (trying to cluster into
eight groups, for example).
On the other hand, sometimes when we specify 10 clusters we actually want 10 non-empty
clusters. If that is the case, we need to alter the algorithm so it detects an empty cluster. Once
one is detected the algorithm changes that clusters centroid to a different point. One
possibility is to change it to the instance that is furthest from its corresponding centroid. (In
the example above, once we detect the pink cluster is empty, we re-assign the pink centroid
to point 1, since point 1 is the furthest point to its corresponding centroid. That is, I compute
the distances from
1
2
3
4
5
6
7
8
to
to
to
to
to
to
to
to
its
its
its
its
its
its
its
its
centroid
centroid
centroid
centroid
centroid
centroid
centroid
centroid
and pick the point that is furthest from its centroid as the new centroid of the empty cluster.
8-49
k-means++
Even the name makes it sound
newer, better, faster, and more accurate
a turbocharged k-means!
k-means++
In the previous section we examined the k-means algorithm in its original form as it was
developed in the late 50s. As we have seen, it is easy to implement and performs well. It is
still the most widely used clustering algorithm on the planet. But it is not without its flaws. A
major weakness in k-means is in the first step where it randomly picks k of the datapoints
to be the initial centroids. As you can probably tell by my bolding and embiggening the word
random, it is the random part that is the problem. Because it is random, sometimes the
initial centroids are a great pick and lead to near optimal clustering. Other times the initial
centroids are a reasonable pick and lead to good clustering. But sometimesagain, because
we pick randomlysometimes the initial centroids are poor leading to non-optimal
clustering. The k-means++ algorithm fixes this defect by changing the way we pick the initial
centroids. Everything else about k-means remains the same.
8-50
CLUSTERING
Lets dissect the meaning of In a probability proportional to D(dp) select one datapoint to be
a new centroid. To do this, I will present a simple example. Suppose we are in the middle of
this process. We have already selected two initial centroids and are in the process of selecting
another one. So we are on step 3a of the above algorithm. Lets say we have 5 remaining
centroids and their distances to the 2 centroids (c1 and c2) are as follows:
Dc1
Dc2
dp1
dp2
dp3
dp4
dp5
entroid 1
ance to c
t
is
d
s
n
Dc1 mea
ance to
ans dist
e
m
2
c
D
esents
and
dp1 repr
.
2
id
o
r
cent
t 1.
datapoin
8-51
dp2
dp3
dp4
dp5
dp2
10%
dp3
dp4
dp1
0.25
dp2
0.40
dp3
0.10
dp4
0.15
dp5
0.10
sum
1.00
dp5
25%
15%
10%
40%
Let us rough out this idea in Python. Say we have a list tuples containing a datapoint and its
weight
data = [("dp1", 0.25), ("dp2", 0.4), ("dp3", 0.1),
("dp4", 0.15), ("dp5", 0.1)]
8-52
CLUSTERING
The function roulette will now select a datapoint in a probability proportional to its weight:
import random
random.seed()
def roulette(datalist):
!
i = 0
!
soFar = datalist[0][1]
!
ball = random.random()
!
while soFar < ball:
!
i += 1
!
soFar += datalist[i][1]
!
return datalist[i][0]
If the function did pick with this proportion, we would predict that if we picked 100 times, 25
of them would be dp1; 40 of them would be dp2; 10 of them dp3; 15 dp4; and 10, dp5. Lets
see if that is true:
import collections
results = collections.defaultdict(int)
for i in range(100):
!
results[roulette(data)] += 1
print results
{'dp5': 11, 'dp4': 15, 'dp3': 10, 'dp2': 38, 'dp1': 26}
Great! Our function does return datapoints in roughly the correct proportion.
The idea in k-means++ clustering is that, while we still pick the initial centroids randomly,
we prefer centroids that are far away from one another.
Time to do
some coding!
8-53
Code It
8-54
CLUSTERING
Code It -solution
8-55
Summary
Clustering is all about discovery. However, the simple examples we have been using in this
chapter may obscure this fundamental idea. After all, we know how to cluster breakfast
cereals without a computers helpsugary cereals, healthy cereals. And we know how to
cluster car modelsa Ford F150 goes in the truck category, a Mazda Miata in the sports car
category, and a Honda Civic in the fuel efficient category. But consider a task where discovery
IS important.
8-56
CLUSTERING
Brilliant!
Maybe I should practice by trying it
out on some new data.
Good question!
The benefits of K-means is that it is simple and
has fast execution time. It is a great choice in
general. It is also good choice for your first
steps in exploring your data even if you
eventually move to another clustering technique.
However, it does not handle outliers well.
Although, we can remedy this by identifying and
removing the outliers.
8-57
Enron
N
RO
EN
Now you might be thinking Hey, this Enron stuff is sort of interesting but what does it have
to do with data mining?
Hmm. This
Enron stuff is sort
of interesting but
what does it have to
do with data mining?
Well, as part of the
investigation, the U.S. Federal
Energy Regulatory Commision
acquired 600,000 emails from
Enron employees. This database
is now available to researchers.
It may be the largest email
database in the world!
8-58
CLUSTERING
We are going to try to cluster a small part of the Enron corpus. For our simple test corpus, I
have extracted the information of who sent email to whom and represented it in table form as
shown here:
Kay
Chris
Sara
Tana
Steven
Mark
Kay
53
37
12
Chris
53
Sara
37
1144
962
Tana
1144
1201
Steven
Mark
12
962
1201
In the dataset provided on our website, Ive extracted this information for 90 individuals.
Suppose I am interested in clustering people to discover the relationships among these
individuals.
Link analysis
d
alysis devote
called link an
g
in
in
d
m
an
ta
)
da
of
tities
tire subfield
hips among en
en
ns
o
an
ti
is
la
e
re
er
g
Th
valuatin
.
of problem (e
d to this task
to this type
rithms devote
go
al
ed
liz
ia
ec
there are sp
You try
8-59
In the dataset provided on our website, Ive extracted this information for 90 individuals.
We are clustering the people based on similarity of email correspondence. If most of my
email correspondence is with Ann, Ben and Clara, and most of yours is with these people as
well, that provides evidence that we are in the same group. The idea is something like this:
between ->
Ann
Ben
Clara
Dongmei
Emily
Frank
my emails
127
25
119
your emails
172
35
123
Because our rows are similar, we cluster together. A problem arises when we add in our
columns:
between ->
my emails
your emails
me
you
Ann
Ben
Clara
Dongmei
Emily
Frank
190
127
25
119
190
172
35
123
In looking at the me column, you corresponded with me 190 times but I only sent myself
email twice. The you column is similar. Now when we compare our rows they dont look so
similar. Before I included the me and you columns the Euclidean distance was 46 and after
I included them it was 269! To avoid this problem when I compute the Euclidean distance
between two people I eliminate the columns for those two people. This required a slight
change to the distance formula:
def distance(self, i, j):
#enron specific distance formula
sumSquares = 0
for k in range(1, self.cols):
if (k != i) and (k != j) :
sumSquares += (self.data[k][i] - self.data[k][j])**2
return math.sqrt(sumSquares)
8-60
CLUSTERING
cara.semperger@enron.com ------------------+
|--+
michelle.cash@enron.com ----------------+ |
|--+
patrice.mims@enron.com -----------+
|
|--+ |
soblander@carrfut.com ---------+ | | |
|--+ | |
pete.davis@enron.com -------+ |
| |
|--+
| |
judy.hernandez@enron.com ---+
| |
|--+
mike.carson@enron.com ------------+ |
|--+
chris.dorland@enron.com -------+ |
|--+
benjamin.rogers@enron.com --+ |
|--+
larry.campbell@enron.com ---+
I also performed k-means++ on the data, with k = 8. Here are some of the groups it
discovered:
Class 5
========
chris.germany@enron.com
scott.neal@enron.com
marie.heard@enron.com
leslie.hansen@enron.com
mike.carson@enron.com
Class 6
========
sara.shackleton@enron.com
mark.taylor@enron.com
susan.scott@enron.com
Class 7
========
tana.jones@enron.com
louise.kitchen@enron.com
mike.grigsby@enron.com
david.forster@enron.com
m.presto@enron.com
8-61
These results are interesting. Class 5 contains a number of traders. Chris Germany and Leslie
Hansen are traders. Scott Neal is a vice president of trading. Marie Heard is a lawyer. Mike
Carson is a manager of South East trading. The members of Class 7 are also interesting. All I
know about Tana Jones is that she is an executive. Louise Kitchen is President of online
trading. Mike Grigsby was Vice President of Natural Gas. David Forster was a Vice President
of trading. Kevin Presto (m.presto) was also a Vice President and a senior trader.
There are many amazing hidden patterns in this
Enron data. Can you find some? Download the complete
data set and give it a try!
(let me know what you find out)
8-62