All Into One ML
All Into One ML
All Into One ML
4
Learning outcomes
On completion of the course students will be expected to:
Have a good understanding of the fundamental issues
and challenges of machine learning: data collection and
pre-processing, model selection, model complexity, etc.
Have an understanding of the strengths and weaknesses
of many popular machine learning approaches.
Appreciate the underlying mathematical relationships
within and across Machine Learning algorithms and the
paradigms of supervised and un-supervised learning.
Be able to design and implement various machine
learning algorithms in a range of real-world applications.
5
Resources
Lecture slides
Text book:
Machine Learning by Tom M. Mitchell
Introduction to Machine Learning by Ethem
Alpaydin 2nd Ed
Reference Book:
An Introduction to Machine Learning by Miroslav
Kubat 2nd Edition
6
Administrative details
Assessment:
Final numeric score will be weighted as follows:
8
Collaboration on Assignments, etc.
Preparing assignments/homeworks independently is a key
ingredient for understanding the material (and,
consequently, passing the exam :-). So it is highly
recommended you make a serious effort to solve the
problems on your own.
You may discuss, consult, collaborate with other students
on the problem sets, but your solutions must be written up
independently, by you.
You are welcome to consult online and offline sources for
your solutions, but you are (a) expected to give clear cites
of your references, and (b) use a write up of your own.
Recall that Google (or any alternative search engine) is a
double edge sword. 9
Collaboration on Assignments, etc.
Cases of plagiarism that will be detected will be
dealt with severely. (For example, reducing your
grades for the whole course, not just the relevant
assignment, canceling the assignment and/or
reporting the incident to the appropriate
university authority.)
If I suspect X had copied from Y, both will be
regarded as cheaters.
10
Course Structure
We will start from the basic concept of machine
learning and move towards advance topics using
multiple examples from each topic.
This should hopefully expose you to some of the
beautiful topics and ideas in the field.
Naturally we will get into some of the topics in
great depth and their corresponding applications.
Each topic will be accompanied by some examples
from real world (given by me) and tasks (to be done
by you).
11
Course overview
What is this course Introduction to Machine Learning?
Covers basic concepts of ML, in particular focusing on the core
concepts of supervised and unsupervised learning.
In supervised learning we will discuss algorithms which are
trained on input data labelled with a desired output.
Unsupervised learning aims to discover latent structure in an
Input signal where no output labels are available.
Students will learn the algorithms which underpin many popular
Machine Learning techniques, as well as developing an
understanding of the theoretical relationships between these
algorithms.
12
How to succeed?
Put the effort in
Make good use of the available resources
Attend lectures
Read books and online materials
Lot of practice
ASK!!!
Aim to fully understand new concepts
13
HOW TO FAIL?
Relax too much
Don’t assimilate material given in lectures
Absent from lectures
Don’t seek help
“Borrow” from “others”
They can’t help in the learning, exam or class tests!!!!
14
Why “Learn” ?
Machine learning is programming computers to optimize
a performance criterion using example data or past
experience.
There is no need to “learn” to calculate payroll
Learning is used when:
Human expertise does not exist (navigating on Mars),
Humans are unable to explain their expertise (speech
recognition)
Solution changes in time (routing on a computer network)
Solution needs to be adapted to particular cases (user
biometrics)
15
What We Talk About When We
Talk About“Learning”
Learning general models from a data of particular
examples
Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
Example in retail: Customer transactions to consumer
behavior:
People who bought “Blink” also bought “Outliers”
(www.amazon.com)
Build a model that is a good and useful approximation to
the data.
16
A Few Quotes
“A breakthrough in machine learning would be worth
ten Microsofts” (Bill Gates, Chairman, Microsoft)
“Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
Machine learning is the hot new thing”
(John Hennessy, President, Stanford University)
“Web rankings today are mostly a matter of machine
learning” (Prabhakar Raghavan, Dir. Research, Yahoo)
“Machine learning is going to result in a real revolution”
(Greg Papadopoulos, CTO, Sun)
“Machine learning is today’s discontinuity”
(Jerry Yang, CEO, Yahoo)
17
So What Is Machine Learning?
Automating automation
Getting computers to program themselves
Writing software is the bottleneck
Let the data do the work instead!
18
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
19
Magic?
No, more like gardening
Seeds = Algorithms
Nutrients = Data
Gardener = You
Plants = Programs
20
Sample Applications
Web search
Computational biology
Finance
E-commerce
Space exploration
Robotics
Information extraction
Social networks
Debugging
[Your favorite area]
21
ML in a Nutshell
Tens of thousands of machine learning
algorithms
Hundreds new every year
Every machine learning algorithm has three
components:
Representation
Evaluation
Optimization
22
Representation
Decision trees
Sets of rules / Logic programs
Instances
Graphical models (Bayes/Markov nets)
Neural networks
Support vector machines
Model ensembles
Etc.
23
Evaluation
Accuracy
Precision and recall
F-measure
Squared error
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence (Kullback–Leibler divergence)
Etc.
24
Optimization
Combinatorial optimization
E.g.: Greedy search
Convex optimization
E.g.: Gradient descent
Constrained optimization
E.g.: Linear programming
25
Types of Learning
Supervised (inductive) learning
Training data includes desired outputs
goal to make predictions based on examples e.g. email
classification with already labeled emails.
Unsupervised learning
Training data does not include desired outputs
goal to reveal structure in the observed data e.g. cluster similar
documents based on text
class A
class A
Classification Clustering
Regression
Anomaly Detection
Sequence labeling … 26
Types of Learning
Semi-supervised learning (SSL)
Training data includes a few desired outputs
tries to improve predictions based on examples by
making use of the additional “unlabeled” examples
e.g. Speech Analysis: Speech analysis is a classic
example of the value of semi-supervised learning
models
Reinforcement learning (RL)
Delayed and partial feedback, no explicit
guidance
Rewards from sequence of actions
e.g. Using RL, AlphaGo Zero was able to learn the
game of Go from scratch by playing against itself.
After 40 days of self-training, Alpha Go Zero was
able to outperform the version of Alpha Go known
as Master that has defeated world number one Ke Jie. 27
Inductive Learning
Given examples of a function (X, F(X))
Predict function F(X) for new examples X
Discrete F(X): Classification
Continuous F(X): Regression
F(X) = Probability(X): Probability estimation
28
What is Machine Learning?
Optimize a performance criterion using example
data or past experience.
Role of Statistics: Inference from a sample
Role of Computer science: Efficient algorithms to
Solve the optimization problem
Representing and evaluating the model for inference
29
ML in Practice
Understanding domain, prior knowledge, and
goals
Data integration, selection, cleaning, pre-
processing, etc.
Learning models
Interpreting results
Consolidating and deploying discovered
knowledge
Loop
30
Applications
Association
Supervised Learning
Classification
Regression
Decision tree induction
Rule induction
Instance-based learning
Bayesian learning
Neural networks
Support vector machines
Model ensembles
Learning theory
Unsupervised Learning
Clustering
Dimensionality reduction
Reinforcement Learning 31
Learning Associations
Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.
32
Classification
Example: Credit
scoring
Differentiating
between low-risk
and high-risk
customers from their
income and savings
33
Classification: Applications
Aka Pattern recognition
Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
Character recognition: Different handwriting styles.
Speech recognition: Temporal dependency.
Medical diagnosis: From symptoms to illnesses
Biometrics: Recognition/authentication using physical and/or behavioral
characteristics: Face, iris, signature, etc
...
34
Face Recognition
Training examples of a person
Test images
ORL dataset,
AT&T Laboratories, Cambridge UK
35
Regression
Example: Price of a used car
x : car attributes
y = wx+w0
y : price
y = g (x | q )
g ( ) model,
q parameters
36
Supervised Learning: Uses
Prediction of future cases: Use the rule to predict the output for future inputs
Knowledge extraction: The rule is easy to understand
Compression: The rule is simpler than the data it explains
Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
37
Unsupervised Learning
Learning “what normally happens”
No output
Clustering: Grouping similar instances
Example applications
Customer segmentation in CRM software
Image compression: Color quantization
Bioinformatics: Learning motifs
38
Reinforcement Learning
Learning a policy: A sequence of outputs
No supervised output but delayed reward
Credit assignment problem
Game playing
Robot in a maze
Multiple agents, partial observability, ...
39
Resources: Datasets
UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html
UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html
Statlib: http://lib.stat.cmu.edu/
Delve: http://www.cs.utoronto.ca/~delve/
40
Resources: Journals
Journal of Machine Learning Research www.jmlr.org
Machine Learning
Neural Computation
Neural Networks
IEEE Transactions on Neural Networks
IEEE Transactions on Pattern Analysis and Machine Intelligence
Annals of Statistics
Journal of the American Statistical Association
...
41
Resources: Conferences
International Conference on Machine Learning (ICML)
European Conference on Machine Learning (ECML)
Neural Information Processing Systems (NIPS)
Uncertainty in Artificial Intelligence (UAI)
Computational Learning Theory (COLT)
International Conference on Artificial Neural Networks (ICANN)
International Conference on AI & Statistics (AISTATS)
International Conference on Pattern Recognition (ICPR)
...
42
CS-13410
Introduction to Machine Learning
by
Mudasser Naseer
Machine Learning
Machine learning is a field of computer science that uses
statistical techniques to give computer systems the ability to
"learn" (e.g., progressively improve performance on a
specific task) with data, without being explicitly
programmed. (Wikipedia)
Previously we define machine learning as “Optimize a
performance criterion using example data or past
experience”.
Now we will define the learning problem in terms of
computer program.
44
Well-Posed Learning Problems
“A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with
experience E”.
To have a well-defined learning problem, we must identify
these three features: the class of tasks T, the measure of
performance to be improved P, and the source of experience E.
So we can define learning as
Learning = Improving with experience at some task
Improve over task T,
With respect to performance measure P
Based of experience E
45
Learning Problems-Examples
For example, a computer program that learns to play
checkers might improve its performance as measured by
its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing
games against itself.
T : Play checkers
P : % of games won against opponents
E : playing practice games against itself
46
More Learning Problems
A handwriting recognition learning problem:
Task T:
recognizing and classifying handwritten letters/words
within images
Performance measure P:
percent of letters/words correctly classified
Training experience E:
a database of handwritten letters/words with given
classifications
47
A robot driving learning problem:
Task T:
driving on public four-lane highways using vision sensors
Performance measure P:
average distance traveled before an error (as judged by
human overseer)
Training experience E:
a sequence of images and steering commands recorded
while observing a human driver
48
Designing a Learning System
To illustrate some of the basic design issues and approaches to
machine learning, let us consider designing a program to learn
to play checkers, with the goal of entering it in the world
checkers tournament.
https://www.google.com/search?rlz=1C1RNLE_enPK512PK512
&ei=o9aKXcOaC8bRwALO-
6rAAQ&q=how+to+play+checkers&oq=how+to+play+check&g
s_l=psy-
ab.3.0.0l10.3594054.3604913..3608807...0.1..0.321.4143.2-
16j1......0....1..gws-
wiz.......0i71j0i273j0i131j0i67.JEo8Lqm6u78#kpvalbx=_zOSKXfr
LNMGckwXQtpT4Cw30
50
Type of Training Experience
Direct or indirect feedback regarding the choices made by the system
Direct: training examples consists of individual checkers board states and
the correct move for each.
Indirect: information consisting of the move sequences and final
outcomes of various games played. Here the learner faces an additional
problem of credit assignment, (determining the degree to which each
move in the sequence deserves credit or blame for the final outcome).
Teacher or not
Teacher: The learner rely on the teacher to select informative board
states and to provide the correct move for each.
No teacher or assisted by teacher: the learner might itself propose some
confusing board states and ask the teacher for the correct move. Or the
learner may have complete control over both the board states and
(indirect) training classifications i.e. playing against itself with no teacher.
51
Type of Training Experience
A problem: is training experience representative of performance goal?
In general, learning is most reliable when the training examples follow a
distribution similar to that of future test examples.
Note: Most current theories of machine learning rests on the crucial assumption
that the distribution of training examples is identical to the distribution of test
examples. However, this assumption is often violated in practice.
52
A checkers learning problem:
To proceed with our design, let us decide that our system
will train by playing games against itself.
Advantage: No external trainer needed and the system is
allowed to generate as much training data as time permits.
We now have a fully specified learning task.
Task T: playing checkers
Performance measure P: percent of games won in the world
tournament
Training experience E: games played against itself
53
A checkers learning problem:
In order to complete the design we must now choose
1. What exact type of knowledge should be learned?
2. How shall this knowledge be represented?
3. What specific algorithm to learn it? Or the learning
mechanism?
54
1. What exact type of knowledge should be learned?
(Choosing the target Function)
Problem: what type of knowledge is required to be learned and how
this will be used by the performance program.
Assume our checkers-playing program can generate all the legal
moves from any board state.
The program needs only to learn how to choose the best move from
among these legal moves.
This learning task is representative of a large class of tasks for which
the legal moves that define some large search space are known a priori,
but the best search strategy is not known.
Many optimization problems fall into this class, such as the problems of
scheduling and controlling manufacturing processes where the
available manufacturing steps are well understood, but the best
strategy for sequencing them is not. 55
Choosing the Target Function
Now the problem is transformed to learn a program or
function that chooses the best move for any given board
state. Say
ChooseMove : B → M
where : B is set of legal board states
M is set of legal moves
Thus we have reduced the problem of improving
performance P at task T to the problem of learning some
particular targetfunction such as ChooseMove, this
function accepts as input any board from the set of legal
board states B and produces as output some move from
the set of legal moves M.
56
Choosing the Target Function
Given the kind of indirect training experience available here, this
function will turn out to be very difficult to learn.
An alternative easier target function would be an evaluation
function that assigns a numerical score to any given board state.
Let us call this target function 𝑉 such that
𝑉: 𝐵 → ℜ where ℜ is the set of real numbers
This target function intend to assigns higher scores to better board
states. After learning this target function , the system easily use it
to select the best move from any current board position. This can
be accomplished by generating the successor board state produced
by every legal move, then using 𝑉 to choose the best successor
state and therefore the best legal move.
57
Choosing the Target Function
What exactly should be the value of the target function V for any
given board state?
Any evaluation function that assigns higher scores to better board
states will do the trick. One such target function among many may
be defined as,
The target value 𝑉(𝑏) for an arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then 𝑉(𝑏) = 100
2. if b is a final board state that is lost, then 𝑉(𝑏) = -100
3. if b is a final board state that is drawn, then 𝑉(𝑏) = 0
4. if b is a not a final state in the game, then 𝑉 𝑏 = 𝑉(𝑏 ′ ), where 𝑏 ′ is
the best final board state that can be achieved starting from b and
playing optimally until the end of the game (assuming the opponent
plays optimally, as well).
58
Choosing the Target Function
While this recursive definition specifies a value of V(b) for every
board state b, this definition is nonoperational . Except for the
trivial cases (cases 1-3) in which the game has already ended.
Determining the value of 𝑉(𝑏) for a particular board state
requires (case 4) searching ahead for the optimal line of play, all
the way to the end of the game.
The above definition is not efficiently computable, so we can say
that it is nonoperational definition.
Now the goal of learning in this case is to discover an
operational description of 𝑉; that is, a description that can be
used by the checkers-playing program to evaluate states and
select moves within realistic time bounds.
59
A checkers learning problem:
In order to complete the design we must now choose
1. What exact type of knowledge should be learned?
2. How shall this knowledge be represented?
3. What specific algorithm to learn it? Or the learning
mechanism?
60
Choosing the Target Function
Thus, we have reduced the learning task in this case to
the problem of discovering an operational description of
the ideal targetfunction 𝑉.
It is very difficult in general to learn such an operational
form of 𝑉 perfectly.
We often expect learning algorithms to acquire only some
approximation to the target function, often called
function approximation, we call this 𝑉.
61
2. How shall this knowledge be represented?
(Choosing a Representation for the Target Function)
In order to represent 𝑉, we have many choices, such as table of features, collection
of rules against features of the board state, quadratic polynomial etc.
Let us choose a simple representation, a linear combination of the following board
features:
𝑥1 : the number of black pieces on the board
𝑥2 : the number of red pieces on the board
𝑥3 : the number of black kings on the board
𝑥4 : the number of red kings on the board
𝑥5 : the number of black pieces threatened by red (i.e., which can be captured on
red's next turn)
𝑥6 : the number of red pieces threatened by black
Thus, our learning program will represent 𝑉 𝑏 as a linear function of the form
𝑉 𝑏 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
Where 𝑤0 through 𝑤6 are numerical coefficients, or weights, to be chosen by the
learning algorithm. These weights will determine the relative importance of the various
board features in determining the value of the board.
62
Partial design of a checkers learning program
Task T: playing checkers
Performance measure P: percent of games won in the world tournament
Training experience E: games played against itself
Target function: 𝑉: 𝐵 → ℜ
Target function representation:
𝑉 𝑏 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
The first three items correspond to the specification of the learning task, whereas the
final two items constitute design choices for the implementation of the learning program.
Now the problem of learning a checkers strategy is reduced to the problem of learning
values for the coefficients 𝑤0 through 𝑤6 .
63
CS-13410
Introduction to Machine Learning
by
Mudasser Naseer
A checkers learning problem:
In order to complete the design we must now choose
1. What exact type of knowledge should be learned?
2. How shall this knowledge be represented?
3. What specific algorithm to learn it? Or the learning
mechanism?
65
2. How shall this knowledge be represented?
(Choosing a Representation for the Target Function)
In order to represent 𝑉, we have many choices, such as table of features, collection
of rules against features of the board state, quadratic polynomial etc.
Let us choose a simple representation, a linear combination of the following board
features:
𝑥1 : the number of black pieces on the board
𝑥2 : the number of red pieces on the board
𝑥3 : the number of black kings on the board
𝑥4 : the number of red kings on the board
𝑥5 : the number of black pieces threatened by red (i.e., which can be captured on
red's next turn)
𝑥6 : the number of red pieces threatened by black
Thus, our learning program will represent 𝑉 𝑏 as a linear function of the form
𝑉 𝑏 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
Where 𝑤0 through 𝑤6 are numerical coefficients, or weights, to be chosen by the
learning algorithm. These weights will determine the relative importance of the various
board features in determining the value of the board.
66
Partial design of a checkers learning program
Task T: playing checkers
Performance measure P: percent of games won in the
world tournament
Training experience E: games played against itself
Target function: 𝑉: 𝐵 → ℜ
Target function representation:
𝑉 𝑏 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
The first three items correspond to the specification of the learning task,
whereas the final two items constitute design choices for the implementation
of the learning program.
Now the problem of learning a checkers strategy is reduced to the problem of
learning values for the coefficients 𝑤0 through 𝑤6 .
67
A checkers learning problem:
In order to complete the design we must now choose
1. What exact type of knowledge should be learned?
2. How shall this knowledge be represented?
3. What specific algorithm to learn it? Or the learning
mechanism?
68
3. What specific algorithm to learn? Or
Choosing a Function Approximation Algorithm
To learn the target function 𝑉 we require a set of training
examples, each describing a specific board state b and the
training value 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 for b. Thus, each training example is
an ordered pair of the form 𝑏, 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 .
The following training example describes a board state b in
which black has won the game (note 𝑥2 = 0 indicates that red
has no remaining pieces) and the target function value is
therefore +100.
𝑥1 = 3, 𝑥2 = 0, 𝑥3 = 1, 𝑥4 = 0, 𝑥5 = 0, 𝑥6 = 0 , +100
On next slides we describe a procedure that first derives such
training examples from the indirect training experience, then
adjusts the weights 𝑤𝑖 to best fit these training examples.
69
3.1 Estimating Training Values
According to our formulation the only training information
available to our learner is whether the game was
eventually won or lost. Whereas we require training
examples that assign specific scores to specific board
states.
It is easy to assign a value to board states that correspond
to the end of the game.
To assign training values to the more numerous
intermediate board states is a difficult task. Not all the
intermediate states are bad for a lost game or good for a
win game.
70
A simple and surprisingly successful approach is to assign
𝑉 𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏 to 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 i.e. The rule for estimating
training values is
𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 ← 𝑉 𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏
Where 𝑉 is the learner’s current approximation to 𝑉 and
𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏 is the board state following the program’s
move and the opponent's response.
Notice that we are using estimates of the value of the
𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏 to estimate the value of board state 𝑏.
71
3.2 Adjusting the weights
Now all remains is to choose the weights 𝑤𝑖 that best fit
the set of training examples 𝑏, 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 .
One common approach is to define the best hypothesis, or
set of weights, that minimizes the squared error 𝐸 between
the training value and the values predicted by the
hypothesis 𝑉.i.e. minimize
𝐸≡ 𝑏,𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 𝜖 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠(𝑉𝑡𝑟𝑎𝑖𝑛 (𝑏) − 𝑉(𝑏))2
Now we need an algorithm that incrementally refine the
weights as new training examples become available.
Several algorithms available, Least Mean Squares (LMS) is
one of such algorithms.
72
LMS weight update rule
For each observed training example it adjusts the weights
a small amount in the direction that reduces the error on
this training example.
It performs a stochastic gradient-descent search through
the space of possible hypotheses (weight values) to
minimize the squared error 𝐸.
For each training example 𝑏, 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏
Use the current weights to calculate 𝑉(𝑏)
For each weight w𝑖, update it as
𝑤𝑖 ← 𝑤𝑖 + 𝜂(𝑉𝑡𝑟𝑎𝑖𝑛 (𝑏) − 𝑉(𝑏))𝑥𝑖
Here 𝜂 is a small constant (e.g., 0.1) that moderates the
size of the weight update
73
The Final Design
The final design of our checkers learning system
described by the following four modules that represent
the central components in many learning systems.
The Performance System
The Critic
The Generalizer
The Experiment Generator
74
The Performance System
The module that must solve the given performance task,
in this case playing checkers, by using the learned target
function(s).
Takes an instance of a new problem (new game) as input
and produces a trace of its solution (game history) as
output.
we expect its performance to improve as the evaluation
function 𝑉 becomes increasingly accurate.
75
The Critic
It takes as input the history or trace of the game and
produces as output a set of training examples of the
target function.
In our example, the Critic corresponds to the following
training rule
𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 ← 𝑉 𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏
76
Generalizer
It takes as input the training examples and produces an
output hypothesis that is its estimate of the target
function.
Generalizes from the specific training examples,
hypothesizing a general function that covers these
examples and other cases beyond the training examples.
In our example, LMS algorithm is used as Generalizer, and
the output hypothesis is the function 𝑉 described by the
learned weights 𝑤0 , … , 𝑤6 .
77
The Experiment Generator
It takes as input the current hypothesis (currently learned
function) and outputs a new problem ( i.e., initial board
state).
The main role is to pick new practice problems that will
maximize the learning rate of the overall system.
May follow a very simple strategy, always proposes the
same initial game board to begin a new game. More
sophisticated strategies may also be used to explore
particular regions of the state space.
78
Final design of the checkers learning program
80
Some last minute discussion
In above given design choices we have constrained the
learning task in a number of ways.
The type of knowledge is restricted to single linear
evaluation function.
The evaluation function depends on only the six specific
board features provided.
If we improve the function then our program has a better
chance to learn a good approximation of the true 𝑉
function.
81
Some real examples
IBM 701 — the first major commercial computer
In 1949, Arthur Samuel (1901 – 1990) joined IBM’s Poughkeepsie lab
and in 1952 he programmed IBM’s first major computer — IBM
701 — to play checkers. It was the first game-playing program to run
on a computer — it was the first checkers program.
By 1955, Samuel had done something groundbreaking; he had
created a program that could learn — something that no one had
done before — and this was demonstrated, by playing a checkers on
television on February 24, 1956.
He published a seminal paper in 1959, titled ”Some Studies in
Machine Learning Using the Game of Checkers”, where he talked
about how a machine could look ahead “by evaluating the resulting
board positions much as a human player might do”. The computer
started out losing to Samuel and eventually beat Samuel.
Samuel’s program was based on Shannon’s minimax strategy to find
the best move from a given current position.
82
Solving Checkers (2007)
After Samuel’s work on checkers, there was a false
impression that checkers was a “solved” game. As a result,
researchers moved on to chess and mostly ignored
checkers until Jonathan Schaeffer began working on
Chinook in 1989. Schaeffer’s goal was to develop a program
capable of defeating the best checkers player.
In 2007 Schaeffer and his team published a paper in the
journal Science titled Checkers Is Solved: the program
could no longer be defeated by anyone, human or
otherwise.
83
Solving Checkers (2007)
Marion Franklin Tinsley (February 3, 1927 – April 3, 1995) was an
American mathematician and checkers player. He is considered to be
the greatest checkers player who ever lived. Tinsley was world
champion 1955–1958 and 1975–1991 and never lost a world
championship match, and lost only seven games (two of them to
the Chinook computer program) in his 45-year career.
Checkers is extremely complex — it has roughly 500 billion billion
possible positions (5 x 10²⁰). About 10²⁰ possible board positions
compared to Chess (10⁴⁷) or Go (10²⁵⁰). Even though checkers is
simpler than those games, it is still complex enough that a brute force
only approach is impractical.
Checkers is the largest game that has been solved to date, with a
search space of 5×10²⁰. “The number of calculations involved
was 10¹⁴, which were done over a period of 18 years. The process
involved from 200 desktop computers at its peak down to around 50”.
84
Backgammon
Tesauro (1992, 1995) reports a design, similar to the design we discuss
as our final design of the checkers learning program example, for a
program that learns to play the game of Backgammon, by learning a
very similar evaluation function over states of the game.
The program represents the learned
evaluation function using an
artificial neural network that
considers the complete description
of the board state rather than a
subset of board features.
After training on over one million
self-generated training games, his
program was able to play very
competitively with top-ranked
human Backgammon players.
85
A classic example of computer brilliance is Google’s “AlphaGo” which can
beat players at the Chinese game Go. During recent years AlphaGo went on a
spree, defeating Go masters anonymously. In October 2015, AlphaGo became
the first computer Go program to beat a human professional Go
player without handicaps on a full-sized 19×19 board.
At the beginning of 2017, it was revealed to the public, as well as to the
masters it had defeated, that this mysterious Go player was an AI all along.
Originally, the problem with solving Go
was the sheer number of moves that can
be made (approximately 1 x 10^170, which
is more than the total atoms in the
universe).
A computer cannot use brute force for
a winning strategy as a human will
have the upper hand thanks to
intuition. 86
Therefore, the Google team took a different approach and developed a system
that could almost be considered a form of synthetic intuition.
The system essentially runs different tasks, such as identification of important
moves, and found a balance between a neural network, a policy network, and
a value network.
This balancing is similar to how the human brain makes decisions, where
different aspects of a situation are considered. For example,
A person playing chess does not always think of every possible move and plan
according to immediate consequences.
The player could allow moves that may seem bizarre to a computer such as self-sacrifice
which, in the short term, appear to make the player lose but in the long term could be
vital for a victory.
Other techniques include confusion whereby a player that is losing may make “crazy”
moves that can confuse other players and keep their true strategy hidden.
87
AlphaGo uses a Monte Carlo tree search algorithm to find
its moves based on knowledge previously "learned"
by machine learning, specifically by an artificial neural
network (a deep learning method) by extensive training,
both from human and computer play.
88
AlphaZero
AlphaZero is a computer program also developed by artificial
intelligence research company DeepMind to master the games
of chess, shogi and go. The algorithm uses an approach similar
to AlphaGo Zero.
On December 5, 2017, the DeepMind team released
a preprint introducing AlphaZero, which within 24 hours of
training achieved a superhuman level of play in these three
games by defeating world-champion programs Stockfish, elmo,
and the 3-day version of AlphaGo Zero.
89
AlphaZero
90
AlphaZero
AlphaZero was trained solely via "self-play" using 5,000 first-
generation Tensor Processing Units (TPUs) to generate the
games and 64 second-generation TPUs to train the neural
networks, all in parallel, with no access to opening
books or endgame tables.
After four hours of training, DeepMind estimated AlphaZero
was playing at a higher Elo rating than Stockfish 8; after 9
hours of training, the algorithm defeated Stockfish 8 in a time-
controlled 100-game tournament (28 wins, 0 losses, and 72
draws).
The trained algorithm played on a single machine with four
TPUs.
91
CS-13410
Introduction to Machine Learning
(Concept Learning Ch # 2 by Tom Mitchell)
by
Mudasser Naseer
Concept Learning
• Inducing general functions from specific training examples is a
main issue of machine learning.
• Concept Learning: Acquiring the definition of a general category
from given sample positive and negative training examples of the
category.
• Concept Learning can be seen as a problem of searching
through a predefined space of potential hypotheses for the
hypothesis that best fits the training examples.
• The hypothesis space has a general-to-specific ordering of
hypotheses, and the search can be efficiently organized by taking
advantage of a naturally occurring structure over the hypothesis
space.
93
Concept Learning
• A Formal Definition for Concept Learning:
Inferring a boolean-valued function from training examples of
its input and output.
• An example for concept-learning is the learning of bird-concept from the
given examples of birds (positive examples) and non-birds (negative
examples).
• We are trying to learn the definition of a concept from given examples.
• Here we have a domain i.e. animals
• And the concept is a subset of the domain as birds ⊆ animals
• Our task is to acquire intensional concept description from training
examples.
• Generally we can’t look at all objects in the domain.
94
A Concept Learning Task – Enjoy Sport Training Examples
• A set of example days, and each is described by six attributes.
• The task is to learn to predict the value of EnjoySport for
arbitrary day, based on the values of its attribute values.
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
ATTRIBUTES CONCEPT
• Examples: “Days at which my friend Aldo enjoys his favorite water sport”
• Result: classifier for days = description of Aldo’s behavior
95
EnjoySport – Hypothesis Representation
Many possible representations exists, one is given below
• Each hypothesis h consists of a conjuction of
constraints on the instance attributes.
• Here h will be a vector of six constraints, specifying the
values of the six attributes
(Sky, AirTemp, Humidity, Wind, Water, and Forecast).
96
Hypothesis Representation
• For Example : A hypothesis:
Sky AirTemp Humidity Wind Water Forecast
< Sunny, ? , ? , Strong , ? , Same >
• The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ?>
• The most specific hypothesis – that no day is a positive example
< ∅, ∅, ∅, ∅, ∅, ∅ >
• EnjoySport concept learning task requires learning the sets of days for
which EnjoySport = yes, describing this set by a conjunction of
constraints over the instance attributes.
• We write h(x) = 1 for a day x, if x satisfies the description.
• Note that much more expressive languages exists 97
Prototypical Concept Learning Task
Given
– Instances X : set of all possible days, each described by the attributes
• Sky – (values: Sunny, Cloudy, Rainy)
• AirTemp – (values: Warm, Cold)
• Humidity – (values: Normal, High)
• Wind – (values: Strong, Weak)
• Water – (values: Warm, Cold)
• Forecast – (values: Same, Change)
– Target Concept (Function) c : EnjoySport : X {0,1}
– Hypotheses H : Each hypothesis is described by a conjunction of constraints on
the attributes. e.g. (?, Cold, High, ?, ?, ?).
– Training Examples D : positive and negative examples of the target function
( x1, c(x1)), . . . (xm, c(xm))
Determine
– A hypothesis h in H such that h(x) = c(x) for all x in D.
98
The Inductive Learning Hypothesis
• Although the learning task is to determine a hypothesis h
identical to the target concept cover the entire set of instances X,
the only information available about c is its value over the
training examples.
– Inductive learning algorithms can at best guarantee that the output hypothesis fits the
target concept over the training data.
– Lacking any further information, our assumption is that the best hypothesis
regarding unseen instances is the hypothesis that best fits the observed training
data. This is the fundamental assumption of inductive learning.
100
Enjoy Sport - Hypothesis Space
• Assume, that Sky has 3 possible values, and other 5 attributes have 2
possible values.
• There are 96 (= 3.2.2.2.2.2) distinct instances in X (instance space).
• There are 5120 (=5.4.4.4.4.4) syntactically distinct hypotheses in H.
– Two more values for attributes: ? and ∅
• Every hypothesis containing one or more ∅ symbols represents the
empty set of instances; that is, it classifies every instance as negative.
• There are 973 (= 1 + 4.3.3.3.3.3) semantically distinct hypotheses in H.
– Only one more value for attributes: ?, and one hypothesis representing empty
set of instances.
• Although EnjoySport has small, finite hypothesis space, most learning
tasks have much larger (even infinite) hypothesis spaces.
– We need efficient search algorithms on the hypothesis spaces.
101
General-to-Specific Ordering of Hypotheses
• Many algorithms for concept learning organize the search through
the hypothesis space by relying on a general-to-specific ordering of
hypotheses.
• By taking advantage of this naturally occurring structure over the
hypothesis space, we can design learning algorithms that exhaustively
search even infinite hypothesis spaces without explicitly enumerating
every hypothesis.
• Consider two hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
• Now consider the sets of instances that are classified positive by hl and
by h2.
– Because h2 imposes fewer constraints on the instance, it classifies more
instancesas positive.
– In fact, any instance classified positive by hl will also be classified positive by h2.
– Therefore, we say that h2 is more general than hl. 102
More-General-Than Relation
• For any instance x in X and hypothesis h in H, we say that x satisfies h
if and only if h(x) = 1.
• More-General-Than-Or-Equal Relation:
Let h1 and h2 be two boolean-valued functions defined over X.
Then h1 is more-general-than-or-equal-to h2 (written h1 ≥ h2) if
and only if any instance that satisfies h2 also satisfies h1.
103
More-General-Relation
105
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
– For each attribute constraint ai in h
• If the constraint a, is satisfied by x Then do nothing
• Else replace ai in h by the next more general constraint that is
satisfied by x
3. Output hypothesis h
106
FIND-S Algorithm
Initialize h to the most specific hypothesis in H
h ← (∅, ∅, ∅, ∅, ∅, ∅)
Upon observing the first training example, I becomes clear that our
hypothesis is too specific. Non of the “∅” constrains are satisfies by
the is example, so each is replaced by the next more general
constraint that fits the example i.e.
h ← (Sunny, Warm, Normal, Strong, Warm, Same)
This h is still very specific. The second training example (also positive)
forces the algorithm to further generalize h. Replace “?” in place of
any attribute value in h that is not satisfied by the new example. The
refined hypothesis is
h ← (Sunny, Warm, ?, Strong, Warm, Same)
After fourth (positive) example our hypothesis becomes
h ← (Sunny, Warm, ?, Strong, ?, ?) 107
FIND-S Algorithm - Example
Instances X Hypotheses H
h
- 0 Specific
x
3
h1
h 2,3
x 1+ x2+
General
x4+ h4
x 2 = <Sunny Warm High Strong Warm Same>, + h1 = <Sunny Warm Normal Strong Warm Same>
x 3 = <Rainy Cold High Strong Warm Change>, - h2 = <Sunny Warm ? Strong Warm Same>
x 4 = <Sunny Warm High Strong Cool Change>, + h3 = <Sunny Warm ? Strong Warm Same>
h4 = <Sunny Warm ? Strong ? ? >
Now classify the following examples as Y or N
A. (Sunny Warm Normal Strong Cool Change) ?
B. (Rainy Cool Normal Light Warm Same) ?
C. (Sunny Warm Normal Light Warm Same) ?
D. (Sunny Cold Normal Strong Warm Same) ? 108
The Role of Negative Examples
Basically, the negative examples are simply ignored!
What about a negative example like
(Sunny, Warm, High, Strong, Freezing, Change)
Example is incorrectly classified as positive by hypothesis
110
Unanswered Questions by FIND-S Algorithm
• Are the training examples consistent?
– In most practical learning problems there is some chance that the training
examples will contain at least some errors or noise.
– Such inconsistent sets of training examples can severely mislead FIND-S, given
the fact that it ignores negative examples.
– We would prefer an algorithm that could at least detect when the training
data is inconsistent and, preferably, accommodate such errors.
𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡 ℎ, 𝐷 ≡ ∀ 𝑥, 𝑐 𝑥 ∈ 𝐷 ℎ 𝑥 = 𝑐(𝑥)
114
Version Spaces
• The Candidate-Elimination algorithm represents the set of
all hypotheses consistent with the observed training examples.
• This subset of all hypotheses is called the version space with
respect to the hypothesis space H and the training examples D,
because it contains all plausible versions of the target concept.
115
List-Then-Eliminate Algorithm
• List-Then-Eliminate algorithm initializes the version space to contain
all hypotheses in H, then eliminates any hypothesis found inconsistent
with any training example.
• The version space of candidate hypotheses thus shrinks as more
examples are observed, until ideally just one hypothesis remains that
is consistent with all the observed examples.
– Presumably, this is the desired target concept.
– If insufficient data is available to narrow the version space to a single hypothesis,
then the algorithm can output the entire set of hypotheses consistent with the
observed data.
• List-Then-Eliminate algorithm can be applied whenever the
hypothesis space H is finite.
– It has many advantages, including the fact that it is guaranteed to output all
hypotheses consistent with the training data.
– Unfortunately, it requires exhaustively enumerating all hypotheses in H - an
unrealistic requirement for all but the most trivial hypothesis spaces.
116
List-Then-Eliminate Algorithm
117
Compact Representation of Version Spaces
• A version space can be represented with its general and specific
boundary sets.
• The Candidate-Elimination algorithm represents the version space
by storing only its most general members G and its most specific
members S.
• Given only these two sets S and G, it is possible to enumerate all
members of a version space by generating hypotheses that lie between
these two sets in general-to-specific partial ordering over hypotheses.
• Every member of the version space lies between these boundaries
𝑉 𝑆𝐻,𝐷 = {ℎ ∈ 𝐻|(∃𝑠 ∈ 𝑆)(∃𝑔 ∈ 𝐺)(𝑔 ≥ ℎ ≥ 𝑠)}
118
Example Version Space
S: {<Sunny, Warm, ?, Strong, ?, ?> }
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>
119
Representing Version Spaces
S: { <Sunny, Warm, ?, Strong, ?, ?> }
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>
1. The General boundary, G, of version space V SH,D is the set of its maximally
general members that are consistent with the given training set
2. The Specific boundary, S, of version space V SH,D is the set of its maximally
specific members that are consistent with the given training set
3. Every member of the version space lies between these boundaries
V SH,D = {h ∈ H|(∃s ∈ S)(∃g ∈ G)(g ≥ h ≥ s)}
where x ≥ y means x is more general or equal to y 120
Candidate-Elimination Algorithm
• The Candidate-Elimination algorithm computes the version space containing all
hypotheses from H that are consistent with an observed sequence of training
examples.
• It begins by initializing the version space to the set of all hypotheses in H; that
is, by initializing the G boundary set to contain the most general hypothesis in
H
G0 ← { <?, ?, ?, ?, ?, ?> }
and initializing the S boundary set to contain the most specific
hypothesis S0 ← { <, , , , , > }
• These two boundary sets delimit the entire hypothesis space, because every
other hypothesis in H is both more general than S0 and more specific than
G0.
• As each training example is considered, the S and G boundary sets are
generalized and specialized, respectively, to eliminate from the version space any
hypotheses found inconsistent with the new training example.
• After all the examples have been processed, the computed version space
contains all the hypotheses consistent with these examples and only these
121
hypotheses.
Candidate-Elimination Algorithm
Input: training set
Output:
• G = maximally general hypotheses in H
• S = maximally specific hypotheses in H
• Algorithm:
• Initialize G to the set of maximally general hypotheses in H
• Initialize S to the set of maximally specific hypotheses in H
• For each training example d, do
– If d is a positive example (we tend to generalize specific hypothesis)
• Remove from G any hypothesis inconsistent with d ,
• For each hypothesis s in S that is not consistent with d
– Remove s from S
– Add to S all minimal generalizations h of s such that
» h is consistent with d, and some member of G is more general than h
– Remove from S any hypothesis that is more general than another hypothesis in S
122
Candidate-Elimination Algorithm
– If d is a negative example (we tend to make general hypothesis more
specific)
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
– Remove g from G
– Add to G all minimal specializations h of g such that
» h is consistent with d, and some member of S is more specific than h
– Remove from G any hypothesis that is less general than another hypothesis in G
123
Example Trace
S 0: {<Ø, Ø, Ø, Ø, Ø, Ø>}
Expand
Shrink
G0 : {<?, ?, ?, ?, ?, ?>}
Expand
S 2: {<Sunny, Warm, ?, Strong, Warm, Same>}
Training examples:
1.<Sunny, Warm, Normal, Strong, Warm, Same>, Enjoy-Sport? = Yes
2.<Sunny, Warm, High, Strong, Warm, Same>, Enjoy-Sport? = Yes
125
Example Trace
S2 , S { <Sunny, Warm, ?, Strong, Warm, Same> }
3:
Expand
G 2: {<?, ?, ?, ?, ?, ?> }
Training Example:
126
Candidate-Elimination Algorithm - Example
• Given that there are six attributes that could be specified to specialize
G2, why are there only three new hypotheses in G3?
• For example, the hypothesis h = <?, ?, Normal, ?, ?, ?> is a minimal
specialization of G2 that correctly labels the new example as a negative
example, but it is not included in G3.
– The reason this hypothesis is excluded is that it is inconsistent with S2.
– The algorithm determines this simply by noting that h is not more general than the
current specific boundary, S2.
• In fact, the S boundary of the version space forms a summary of the
previously encountered positive examples that can be used to determine
whether any given hypothesis is consistent with these examples.
• The G boundary summarizes the information from previously
encountered negative examples. Any hypothesis more specific than G is
assured to be consistent with past negative examples
127
Example Trace
S 3 : {<Sunny, Warm, ?, Strong, Warm, Same>}
Expand S
S 4: {<Sunny, Warm, ?, Strong, ?, ?>}
and prune G
Training Example:
4.<Sunny, Warm, High, Strong, Cool, Change>, EnjoySport = Yes
128
Candidate-Elimination Algorithm - Example
• The fourth training example further generalizes the S boundary
of the version space.
• It also results in removing one member of the G boundary,
because this member fails to cover the new positive example.
– To understand the rationale for this step, it is useful to consider why
the offending hypothesis must be removed from G.
– Notice it cannot be specialized, because specializing it would not make it
cover the new example.
– It also cannot be generalized, because by the definition of G, any more
general hypothesis will cover at least one negative training example.
– Therefore, the hypothesis must be dropped from the G boundary, thereby
removing an entire branch of the partial ordering from the version space of
hypotheses remaining under consideration
129
Candidate-Elimination Algorithm – Example
Final Version Space
• After processing these four examples, the boundary sets S4 and
G4 delimit the version space of all hypotheses consistent with
the set of incrementally observed training examples.
• This learned version space is independent of the sequence in
which the training examples are presented (because in the end it
contains all hypotheses consistent with the set of examples).
• As further training data is encountered, the S and G boundaries
will move monotonically closer to each other, delimiting a
smaller and smaller version space of candidate hypotheses.
130
Properties of the two Sets
S can be seen as the summary of the positive examples
Any hypothesis more general than S covers all positive
examples
Other hypotheses fail to cover at least one pos. example
G can be seen as the summary of the negative examples
Any hypothesis more specific than G covers no previous
negative example
Other hypothesis cover at least one positive example
131
Resulting Version Space
{<Sunny, Warm, ?, Strong, ?, ?> }
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>
132
Properties
• If there is a consistent hypothesis then the algorithm will
converge to S = G = {h} when enough examples are provided
• False examples may cause the removal of the correct h
• If the examples are inconsistent, S and G become empty
• This can also happen, when the concept to be learned is not in H
133
What Next Training Example?
{<Sunny, Warm, ?, Strong, ?, ?> }
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>
– Rejection or
– Majority vote (used in the following)
Instance A is classified as a positive instance by every hypothesis in the
current version space.
Because the hypotheses in the version space unanimously agree that this is
a positive instance, the learner can classify instance A as positive with the
same confidence it would have if it had already converged to the single,
correct target concept.
Regardless of which hypothesis in the version space is eventually found to
be the correct target concept, it is already clear that it will classify instance
A as a positive example.
136
How Can Partially Learned Concepts Be Used?
Notice furthermore that we need not enumerate every hypothesis in the version
space in order to test whether each classifies the instance as positive.
– This condition will be met if and only if the instance satisfies every member of S.
– The reason is that every other hypothesis in the version space is at least as general as
some member of S.
– By our definition of more-general-than, if the new instance satisfies all members
of S it must also satisfy each of these more general hypotheses.
Instance B is classified as a negative instance by every hypothesis in the
version space.
– This instance can therefore be safely classified as negative, given the partially
learned concept.
– An efficient test for this condition is that the instance satisfies none of the members of
G.
137
How Can Partially Learned Concepts Be Used?
Instance C: Half of the version space hypotheses classify instance C as
positive and half classify it as negative.
– Thus, the learner cannot classify this example with confidence until further
training examples are available.
Instance D is classified as positive by two of the version space hypotheses
and negative by the other four hypotheses.
– In this case we have less confidence in the classification than in the
unambiguous cases of instances A and B.
– Still, the vote is in favor of a negative classification, and one approach we
could take would be to output the majority vote, perhaps with a confidence
rating indicating how close the vote was.
138
Example
Use candidate-elimination algorithm to build Specific and
General boundaries for the following data set.
139
Example-cont…
S0 : { <, , , , >}
S1 : { <Japan, Honda, Blue, 1980, Economy>}
S2 : { <Japan, Honda, Blue, 1980, Economy>}
S3 : { <Japan, ?, Blue, ?, Economy>}
S4 : { <Japan, ?, Blue, ?, Economy>}
S5 : { <Japan, ?, ?, ?, Economy>}
S6 : { <Japan, ?, ?, ?, Economy>}
G 6 : { <Japan, ?, ?, ?, Economy>}
G 5 : { <Japan, ?, ?, ?, Economy>}
G 4 : {<?, ?, Blue, ?, ?> <Japan, ?, ?, ?, Economy> <?, ?,Blue, ?, Economy>} (Remove
from G any hypothesis that is less general than another hypothesis in G)
G 3 : { <?, ?, Blue, ?, ?> <?, ?, ?, ?, Economy>}
G 2 : {<?, Honda, ?, ?, ?> <?, ?, Blue, ?, ?> <?, ?, ?, 1980, ?> <?, ?, ?, ?, Economy>}
G 1 : {<?, ?, ?, ?, ?> }
G 0 : {<?, ?, ?, ?, ?> }
140
Example-Cont…
What will happen if we have another example in our data?
141
Inductive Bias - Fundamental Questions
for Inductive Inference
The Candidate-Elimination Algorithm will converge toward the true target
concept provided it is given accurate training examples and provided its
initial hypothesis space contains the target concept.
142
Inductive Bias - A Biased Hypothesis Space
In EnjoySport example, we restricted the hypothesis space to include
only conjunctions of attribute values.
– Because of this restriction, the hypothesis space is unable to represent even simple
disjunctive target concepts such as "Sky = Sunny or Sky = Cloudy."
• From first two examples S2 : <?, Warm, Normal, Strong, Cool, Change>
• This is inconsistent with third examples, and there are no hypotheses consistent
with these three examples
PROBLEM: We have biased the learner to consider only conjunctive hypotheses.
We require a more expressive hypothesis space.
143
Inductive Bias - An Unbiased Learner
The obvious solution to the problem of assuring that the target concept is in the
hypothesis space H is to provide a hypothesis space capable of representing
every teachable concept.
– Every possible subset of the instances X the power set of X.
144
Inductive Bias - An Unbiased Learner :
Problem
• Let the hypothesis space H to be the power set of X.
– A hypothesis can be represented with disjunctions, conjunctions, and negations of our
earlier hypotheses.
– The target concept "Sky = Sunny or Sky = Cloudy" could then be described as
<Sunny, ?, ?, ?, ?, ?> <Cloudy, ?, ?, ?, ?, ?>
145
Inductive Bias –
Fundamental Property of Inductive Inference
• A learner that makes no a priori assumptions regarding the identity
of the target concept has no rational basis for classifying any unseen
instances.
146
Inductive Bias – Three Learning Algorithms
ROTE-LEARNER: Learning corresponds simply to storing each observed training
example in memory. Subsequent instances are classified by looking them up in
memory. If the instance is found in memory, the stored classification is returned.
Otherwise, the system refuses to classify the new instance.
Inductive Bias: No inductive bias
CANDIDATE-ELIMINATION: New instances are classified only in the case where all
members of the current version space agree on the classification. Otherwise, the
system refuses to classify the new instance.
Inductive Bias: the target concept can be represented in its hypothesis space.
FIND-S: This algorithm, described earlier, finds the most specific hypothesis consistent
with the training examples. It then uses this hypothesis to classify all subsequent
instances.
Inductive Bias: the target concept can be represented in its hypothesis space, and all
instances are negative instances unless the opposite is entailed by its other know1edge.
147
Concept Learning - Summary
Concept learning can be seen as a problem of searching through a
large predefined space of potential hypotheses H.
The general-to-specific partial ordering of hypotheses provides a
useful structure for organizing the search through the hypothesis
space H.
The FIND-S algorithm utilizes this general-to-specific ordering,
performing a specific-to-general search through the hypothesis space
along one branch of the partial ordering, to find the most specific
hypothesis consistent with the training examples.
The CANDIDATE-ELIMINATION algorithm utilizes this general-
to-specific ordering to compute the version space (the set of all
hypotheses consistent with the training data) by incrementally
computing the sets of maximally specific (S) and maximally general
(G) hypotheses.
148
Concept Learning - Summary
Because the S and G sets delimit the entire set of hypotheses
consistent with the data, they provide the learner with a description of
its uncertainty regarding the exact identity of the target concept.
The CANDIDATE-ELIMINATION algorithm is not robust to noisy
data or to situations in which the unknown target concept is not
expressible in the provided hypothesis space.
Inductive learning algorithms are able to classify unseen examples
only because of their implicit inductive bias for selecting one
consistent hypothesis over another.
If the hypothesis space is enriched to the point where there is a
hypothesis corresponding to every possible subset of instances (the
power set of the instances), this will remove any inductive bias from
the CANDIDATE-ELIMINATION algorithm. Unfortunately, this also
removes the ability to classify any instance beyond the observed
training examples. An unbiased learner cannot make inductive leaps to
classify unseen examples. 149
CS-13410
Introduction to Machine Learning
(Supervised Learning – Decision Trees)
by
Mudasser Naseer
Supervised Learning
• Supervised learning is the machine learning task of
• learning a function that
• maps an input to an output
• based on example input-output pairs.
• Infers a function from labeled training data
• consisting of a set of training examples.
• Each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called
the supervisory signal).
(Wikipedia)
151
Supervised Learning – Detailed Definition
In supervised learning we have input variables (X) and an output
variable (Y) and
we use an algorithm to learn the mapping function from the input
to the output. Y = f(X)
The goal is to approximate the mapping function so well that
when you have new input data (X) that you can predict the output
variables (Y) for that data.
It is called supervised learning because the process of an
algorithm learning from the training dataset can be thought of as
a teacher supervising the learning process.
We know the correct answers, the algorithm iteratively makes
predictions on the training data and is corrected by the teacher.
Learning stops when the algorithm achieves an acceptable level
of performance.
152
Supervised Learning
Supervised learning problems can be further grouped into
regression and classification problems.
Classification: A classification problem is when the output
variable is a category, such as “red” or “blue” or “disease”
and “no disease”.
Regression: A regression problem is when the output
variable is a real value, such as “dollars” or “weight”.
153
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat
155
What is classification (cont…)
The target function f is known as a classification model
156
Examples of Classification Tasks
Predicting tumor cells as benign or malignant
157
General approach to classification
Training set consists of records with known class labels
158
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
159
Evaluation of classification models
Counts of test records that are correctly (or incorrectly)
predicted by the classification model
Confusion matrix Predicted Class
Actual Class
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00
161
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
162
Decision Trees
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
163
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
Class labels
Training Data Model: Decision Tree
164
Another Example of Decision Tree
MarSt Single,
Tid Refund Marital Taxable Married Divorced
Status Income Cheat
165
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11
12
No
Yes
Small
Medium
55K
80K
?
?
Decision
Deduction
13 Yes Large 110K ? Tree
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
166
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
167
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
168
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
169
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
170
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
171
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
<= 80K > 80K
NO YES
172
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11
12
No
Yes
Small
Medium
55K
80K
?
?
Decision
Deduction
13 Yes Large 110K ? Tree
14 No Small 95K ?
15 No Large 67K ?
10
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
174
ID3 Algorithm
ID3 (Iterative Dichotomiser 3) is an algorithm
invented by Ross Quinlan used to generate a decision
tree from a dataset.
C4.5 is its successor.
These algorithms employs a top-down, greedy search
through the space of possible decision trees.
175
Which Attribute is the Best Classifier?
The central choice in the ID3 algorithm is selecting
which attribute should be tested at the root of the tree and
then at each node in the tree
177
Entropy
Entropy (D)
Entropy of data set D is denoted by H(D)
Ci s are the possible classes
pi = fraction of records from D that have class C
H ( D) pi log 2 pi
ci
The range of entropy is from 0 to log2(m), where m is
the number of classes. Maximum value is attained
when all the classes have equal proportion.
178
Entropy Examples
Example:
10 records have class A
20 records have class B
30 records have class C
40 records have class D
Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log .3) + (.4 log .4)]
Entropy = 1.846
179
Splitting Criterion
Example:
Two classes, +/-
100 records overall (50 +s and 50 -s)
A and B are two binary attributes
Records with A=0: 48+, 2-
Records with A=1: 2+, 48-
Records with B=0: 26+, 24-
Records with B=1: 24+, 26-
Splitting on A is better than splitting on B
A does a good job of separating +s and -s
B does a poor job of separating +s and -s
180
Which Attribute is the Best Classifier?
Information Gain
The expected information needed to classify a tuple in
D is
𝐼𝑛𝑓𝑜 𝐷 = − 𝑚 𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐻(𝐷)
How much more information would we still need
(after partitioning at attribute A) to arrive at an exact
classification? This amount is measured by
𝑚 𝐷𝑗
𝐼𝑛𝑓𝑜𝐴 𝐷 = − 𝑗=1 𝐷 × 𝐼𝑛𝑓𝑜 𝐷𝑗 = 𝐻(𝐷, 𝐴)
𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 (𝐷, 𝐴) = 𝐻(𝐷) – 𝐻(𝐷, 𝐴)
In general, we write 𝐺𝑎𝑖𝑛 (𝐷, 𝐴) , where D is the
collection of examples & A is an attribute
181
Entropy (for two classes i.e. m=2)
182
Information Gain
Gain of an attribute split: compare the impurity of
the parent node with the average impurity of the
child nodes
𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 – 𝐻 𝐷, 𝐴
𝑚
𝐷𝑗
= 𝐻(𝑝𝑎𝑟𝑒𝑛𝑡) − × 𝐻 𝐷𝑗
𝐷
𝑗=1
Maximizing the gain Minimizing the weighted
average impurity measure of children nodes
183
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
184
DECISION TREES
185
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
The information gain obtained by separating the examples
according to the attribute Wind is calculated as:
𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣
𝑆
𝑣∈ 𝑊𝑒𝑎𝑘,𝑆𝑡𝑟𝑜𝑛𝑔
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 8 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑊𝑒𝑎𝑘 − 6 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑆𝑡𝑟𝑜𝑛𝑔 )
186
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
We calculate the Info Gain for each attribute and select
the attribute having the highest Info Gain
187
DECISION TREES
Example
189
DECISION TREES
Example
191
DECISION TREES
Example
192
DECISION TREES
From Decision Trees to Rules
Next Step: Make rules from the decision tree
Example
A B A∧ B
F F F
T F F
F T F
T T T
194
Lecture 07
CS-13410
Introduction to Machine Learning
(Decision Trees – Ch # 3 by Tom Mitchell)
by
Mudasser Naseer
Measuring Node Impurity
p(i|t): fraction of records associated
with node t belonging to class i
c
Entropy (t ) p(i | t ) log p(i | t )
i 1
• Used in ID3 and C4.5
c
Gini (t ) 1 p(i | t )
2
i 1
196
Example
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C1 0 Gini = 1 – (P(C1))2 – (P(C2))2 = 1 – 0 – 1 = 0
C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
Error = 1 – max (0, 1) = 1 – 1 = 0
198
Splitting Based on GINI
When a node p is split into k partitions
(children), the quality of split is computed as,
k
ni
GINI split GINI (i)
i 1 n
199
Binary Attributes: Computing GINI Index
• Splits into two partitions
• Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
N1 N2
Gini(Children)
C1 5 1
Gini(N1) = 7/12 * 0.194 + 5/12 * 0.528
C2 2 4
= 1 – (5/7)2 – (2/7)2 = 0.194 = 0.333
Gini=0.333
Gini(N2) This is the quality of split
= 1 – (1/5)2 – (4/5)2 = 0.528 for Variable B 200
Categorical Attributes
201
Continuous Attributes
Use Binary Decisions based on one value Tid Refund Marital
Status
Taxable
Income Cheat
Computationally Inefficient!
Repetition of work.
Continuous Attributes
For efficient computation: for each attribute,
Sort the attribute on values
Linearly scan these values, each time updating the count matrix
and computing impurity
Choose the split position that has the least impurity
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
203
Splitting based on impurity
Impurity measures favor attributes with
large number of categories
204
Gain Ratio
The information gain measure tends to prefer attributes with
large numbers of possible categories.
Gain ratio: a modification of the information gain that reduces
its bias on high‐branch attributes.
Gian ratio should be
Large when data is evenly spread
Small when all data belong to one branch
Gain ratio takes number and size of branches into account
when choosing an attribute
It corrects the information gain by taking the intrinsic
information of a split into account
𝐷𝑖 𝐷𝑖
𝑆𝑝𝑙𝑖𝑡 𝐷, 𝐴 ≡ − 𝑙𝑜𝑔2 Or if we use S in place of D
𝐷 𝐷
Gain Ratio
Adjusts Information Gain by the entropy of the partitioning
(SplitINFO). Higher entropy partitioning (large number of
small partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of impurity
Example (Play tennis) :
More on the gain ratio
“Outlook” still comes out top
However: “ID code” has greater gain ratio
Standard fix: In particular applications we can
use an ad hoc test to prevent splitting on that
type of attribute
Problem with gain ratio: it may overcompensate
May choose an attribute just because its intrinsic
information is very low
Standard fix:
• First, only consider attributes with greater than average
information gain
• Then, compare them on gain ratio
207
Comparing Attribute Selection Measures
The three measures, in general, return good
results but
Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one
partition is much small than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is
large
Tends to favor tests that result in equal-sized
partitions and purity in both partitions
208
Stopping Criteria for Tree Induction
209
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
210
Example: C4.5
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each
node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
Evaluation
212
Underfitting and Overfitting
Underfitting Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex it models the details of the training set 213
and fails on the test set
Overfitting due to Noise
215
How to Address Overfitting:
Tree Pruning
Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available features
(e.g., using 2 test)
• Stop if expanding the current node does not improve impurity measures (e.g.,
Gini or information gain) or it falls below a threshold value.
Upon halting, the node becomes a leaf
The leaf may hold the most frequent class among the subset
tuples.
Problem 216
• Difficult to choose an appropriate threshold
How to Address Overfitting…
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a
bottom-up fashion
If generalization error improves after
trimming, replace sub-tree by a leaf node.
Class label of leaf node is determined from
majority class of instances in the sub-tree
217
Prune the Tree OR Prune the Rule
In order to Reduce the complexity of
decision procedure we have two options
(i) either we can prune the tree first and
then develop the rules or
(ii) we can develop the rules and then
prune the rules.
Which is better?
(ii) is better, why?
218
CS-13410
Introduction to Machine Learning
Lecture # 08
PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
Class=Yes a b c: FP (false positive)
ACTUAL d: TN (true negative)
Class=No c d
CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
𝑎+𝑑 𝑇𝑃 + 𝑇𝑁
Accuracy = =
𝑎 + 𝑏 + 𝑐 + 𝑑 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
@data
Sunny, 85, 85, FALSE, no
Sunny, 80, 90, TRUE, no
Overcast, 83, 86, FALSE, yes
Rainy, 70, 96, FALSE, yes
...
Numeric attribute and Missing Value
@relation weather
@data
Sunny, 85, 85, FALSE, no
Sunny, 80, 90, TRUE, no
Overcast, 83, 86, FALSE, ?
Rainy, 70, 96, ?, yes
...
Explorer: building “classifiers”
Classifiers in WEKA are models for
predicting nominal or numeric quantities
Implemented learning schemes include:
Decision trees and lists, instance-based
classifiers, support vector machines, multi-
layer perceptrons, logistic regression, Bayes’
nets, …
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Explorer: clustering data
WEKA contains “clusterers” for finding
groups of similar instances in a dataset
Implemented schemes are:
k-Means, EM, Cobweb, X-means,
FarthestFirst
Clusters can be visualized
Evaluation based on loglikelihood if
clustering scheme produces a probability
distribution
Explorer: finding associations
WEKA contains an implementation of the
Apriori algorithm for learning association
rules
Works only with discrete data
Can identify statistical dependencies
between groups of attributes:
milk, butter bread, eggs (with confidence
0.9)
Apriorican compute all rules that have a
given minimum support and exceed a
given confidence
Explorer: attribute selection
Panel that can be used to investigate
which (subsets of) attributes are the most
predictive ones
Attribute selection methods contain two
parts:
A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
Very flexible: WEKA allows (almost)
arbitrary combinations of these two
Explorer: data visualization
Visualization very useful in practice: e.g.
helps to determine difficulty of the learning
problem
WEKA can visualize single attributes (1-d)
and pairs of attributes (2-d)
To do: rotating 3-d visualizations (Xgobi-style)
Color-coded class values
“Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)
Performing experiments
Experimentermakes it easy to compare
the performance of different learning
schemes
Evaluationoptions: cross-validation,
learning curve
Resources:
WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
Tutorial.
http://prdownloads.sourceforge.net/weka/wek
a.ppt
CS-13410
Introduction to Machine Learning
(k-Nearest Neighbors)
Lecture # 10
by
Mudasser Naseer
WHY NEAREST NEIGHBOR?
Used to classify objects based on closest
training examples in the feature space
Feature space: raw data transformed into sample
vectors of fixed length using feature extraction
(Training Data)
Top 10 Data Mining Algorithm
ICDM paper – December 2007
Among the simplest of all Data Mining
Algorithms
Classification Method
Implementation of lazy learner ?
320
k NEAREST NEIGHBOR
k = 1:
Belongs to square class
k = 3:
? Belongs to triangle class
k = 7:
Belongs to square class
322
Distance formula
323
Incomparable ranges
324
Example revisited
325
Non-numeric data
Feature values are not always numbers
Example
Boolean values: Yes or no, presence or
absence of an attribute
Categories: Colors, educational attainment,
gender
How do these values factor into the
computation of distance?
326
Dealing with non-numeric data
Boolean values => convert to 0 or 1
Applies to yes-no/presence-absence attributes
Non-binary characterizations
Use natural progression when applicable; e.g.,
educational attainment: JS, HS, College, MS, PHD
=> 1,2,3,4,5
Assign arbitrary numbers but be careful about
distances; e.g., color: red, yellow, blue => 1,2,3
How about unavailable data?
(0 value not always the answer)
327
Preprocessing your dataset
Dataset may need to be preprocessed
to ensure more reliable data mining
results
Conversion of non-numeric data to
numeric data
Calibration of numeric data to reduce
effects of disparate ranges
Particularly when using the Euclidean
distance metric
328
k-NN variations
Value of k
Larger k increases confidence in prediction
Note that if k is too large, decision may be
skewed
Weighted evaluation of nearest neighbors
Plain majority may unfairly skew decision
Revise algorithm so that closer neighbors have
greater “vote weight”
Other distance measures
329
Types of Attributes
There are different types of attributes
Nominal (Categorical)
• Examples: ID numbers (Student-ID, Customer-ID), eye color,
zip codes
• Binary Attribute is a type of Nominal Attribute with only two
categories
Ordinal (meaningful order or ranking among them)
• Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
Numeric
• Interval (measured on a scale of equal-size units)
– Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio-Scaled (with an inherent zero-point)
– Examples: temperature in Kelvin, length, time, counts 330
Properties of Attribute Values
The type of an attribute depends on which of
the following properties it possesses:
Distinctness: =
Order: < >
Addition: + -
Multiplication: */
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye contingency
nominal attributes provide only color, gender: correlation, 2
enough information to distinguish {male, female} test
one object from another. (=, )
Ordinal hardness of median,
The values of an ordinal
minerals, {good, percentiles, rank
attribute provide enough
better, best}, correlation, run
information to order objects. (<,
grades, street tests, sign tests
>)
numbers
Interval For interval attributes, the calendar dates, mean, standard
differences between values are temperature in deviation,
meaningful, i.e., a unit of Celsius or Pearson's
measurement exists. Fahrenheit correlation, t and
(+, - ) F tests
Ratio For ratio variables, both temperature in geometric mean,
differences and ratios are Kelvin, monetary harmonic mean,
meaningful. (*, /) quantities, counts, percent variation
age, mass, length,
electrical current
332
Attribute Transformation Comments
Level
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Continuous attributes are typically represented as floating-
point variables. 334
Non-Euclidean Distances
Proximity Measure for Nominal Attributes
Jaccard measure for binary vectors
Distance measure for Ordinal variables
Cosine measure = angle between vectors from
the origin to the points in question.
Dissimilarity Measure for Mixed type attributes.
Edit distance = number of inserts and deletes to
change one string into another.
335
Proximity Measure for Nominal Attributes
The dissimilarity between two objects i
and j having nominal attributes can be
compute based on the ratio of
mismatches pm
d (i, j)
p
where m is the number of matches and p
is the total number of attributes.
m
sim(i, j) 1 d (i, j)
p
Example:
336
Jaccard Measure
A note about Binary variables first
Symmetric binary variable
• If both states are equally valuable and carry the same weight, that is,
there is no preference on which outcome should be coded as 0 or 1.
• Like “gender” having the states male and female
Asymmetric binary variable
• If the outcomes of the states are not equally important, such as the
positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and the other
by 0 (HIV negative).
• Given two asymmetric binary variables, the agreement of two 1s (a
positive match) is then considered more important than that of two 0s
(a negative match).
337
Jaccard Measure
A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a c b d p
Lecture # 11
by
Mudasser Naseer
Distance for Ordinal variables
The rank of the ordinal variable f for the ith object is rif.
Where variable f has Mf ordered states.
rif Є {1…Mf}
Since each ordinal variable can have a different number
of states, therefore map the range of each variable onto
[0.0, 1.0], so that each variable has equal weight. This
can be achieved using the following formula.
For each value rif in ordinal variable f , replace it by zif
𝑟𝑖𝑓 −1
where 𝑧𝑖𝑓 =
𝑀𝑓 −1
342
Dissimilarity for attributes of Mixed Type
Suppose the data set contains p attributes of mixed type. The
dissimilarity between object i and j is
pf 1 ij d ij
(f) (f)
344
Attributes of Mixed Type - Example
Consider the data in the table:
The dissimilarity matrices for
the nominal and ordinal data
are shown here, computed
using the methods discussed
before
To compute dissimilarity 0
matrix for the numeric 1
0
attribute, maxhxh=64, d (1)
minhxh=22. Using the formula
ij
1 1 0
from previous slide, the
0 1 1 0
dissimilarity matrix is obtained
as shown below: 0
0 1.0 0
0.55 d (2)
0 ij
0.5 0.5 0
d (3)
ij
0.45 1.00 0 0 1.0 0.5 0
345
0.40 0.14 0.86 0
Attributes of Mixed Type - Example
The three dissimilarity matrices can now be used to compute the
overall dissimilarity between two objects using the equation
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
(𝑓)
The indicator 𝛿𝑖𝑗 = 1 for each of the three attributes, f. We get, for example,
0
0.85 0
dij
0.65 0.83 0
0.13 0.71 0.79 0 346
Cosine Measure (similarity)
Think of a point as a vector from the origin (0,0,…,0) to
its location.
Two points’ vectors make an angle, whose cosine is the
normalized dot-product of the vectors.
Where x and y are column vectors n
xt . y x y
sim(i, j)
i i
i 1
x. y n
x y
2
n 2
i i
i 1 i 1
Example:
Find cosine similarity between x and y where
2
4
x 5
x 2
8
7
347
Cosine Measure (similarity)
Cosine similarity is a measure of similarity that can be used to compare
documents.
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document
called term-frequency vector.
LCS(x,y) = bcde.
D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –
2*4 = 3.
350
k-NN Time Complexity
Suppose there are m instances and n
features in the dataset
Nearest neighbor algorithm requires
computing m distances
Each distance computation involves
scanning through each feature value
Running time complexity is proportional to
m×n
351
k NEAREST NEIGHBOR
Accuracy of all NN based classification,
prediction, or recommendations depends solely on
a data model, no matter what specific NN
algorithm is used.
Scaling issues
Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes.
Examples
• Height of a person may vary from 4’ to 6’
• Weight of a person may vary from 100lbs to 300lbs
• Income of a person may vary from $10k to $500k
Nearest Neighbor classifiers are lazy learners
No pre-constructed models for classification 352
k NEAREST NEIGHBOR ADVANTAGES
Simple technique that is easily implemented
Building model is inexpensive
Extremely flexible classification scheme
does not involve preprocessing
Well suited for
Multi-modal classes (classes of multiple forms)
Records with multiple class labels
Asymptotic Error rate at most twice Bayes rate
Cover & Hart paper (1967)
Can sometimes be the best method
Michihiro Kuramochi and George Karypis, Gene Classification
using Expression Profiles: A Feasibility Study, International
Journal on Artificial Intelligence Tools. Vol. 14, No. 4, pp. 641-
660, 2005
K nearest neighbor outperformed SVM for protein function
353
prediction using expression profiles
k NEAREST NEIGHBOR DISADVANTAGES
Classifying unknown records are relatively
expensive
Requires distance computation of k-nearest neighbors
Computationally intensive, especially when the size of
the training set grows
Accuracy can be severely degraded by the
presence of noisy or irrelevant features
NN classification expects class conditional
probability to be locally constant
bias of high dimensions
354
How to select the optimal k value?
There are no pre-defined statistical methods to find the
most favorable value of k.
Initialize a random k value and start computing.
Choosing a small value of k leads to unstable decision
boundaries.
The substantial k value is better for classification as it
leads to smoothening the decision boundaries.
Derive a plot between error rate and k denoting
values in a defined range. Then choose the k
value as having a minimum error rate.
Another simple approach to select k is set k =sqrt(n)
355
Lecture 12
CS-13410
Introduction to Machine Learning
(Artificial Neural Networks – Ch # 4 by Tom
Mitchell)
by
Mudasser Naseer
• Brain
• A marvelous piece of
architecture and design.
• In association with a
nervous system, it
controls the life patterns,
communications,
interactions, growth and
development of
hundreds of million of life
forms.
There are about 1010 to 1014 nerve cells
(called neurons) in an adult human brain.
Neurons are highly connected with each
other. Each nerve cell is connected to
hundreds of thousands of other nerve cells.
Passage of information between neurons is
slow (in comparison to transistors in an IC).
It takes place in the form of electrochemical
signals between two neurons in milliseconds.
Energy consumption per neuron is low
(approximately 10-6 Watts).
Look more like some
blobs of ink… aren’t they!
Cell nucleus or
Axon
Soma is a cable
processes
Synapse is the
that is used
the information
connection by
between
neurons
received
an tofrom
axon and send
other
information.
dendrites.
neuron dendrites.
Synapse
Dendrites from Cell Body
Another neuron
x0=1
x1 w1
w0
w2
x2 o
. n
. wi xi
. i=0
xn wn 𝒏
𝑶 𝒙𝒊 = 𝟏 𝒊𝒇 𝒘𝒊 𝒙𝒊 > 𝟎
𝒊=𝟎
= −𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
363
A Neuron (= a perceptron)
- qj
x0 w0
x1
w1
f output y
xn wn
For Example
Input weight weighted Activation n
366
Perceptron Training
A single perceptron can be used to represent
many boolean functions
𝒏
𝑶𝒖𝒕𝒑𝒖𝒕 = 𝟏 𝒊𝒇 𝒊=𝟏 𝒘𝒊 𝒙𝒊
>𝒕
= 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
𝒘𝒉𝒆𝒓𝒆 𝒕 ≠ 𝟎
Linear threshold is used.
W - weight value
367
t - threshold value
Simple network
𝟏 𝒊𝒇 𝒘𝒊 𝒙𝒊 > 𝒕
-1 𝑶𝒖𝒕𝒑𝒖𝒕 =
𝒊=𝟎
W0 = 1.5 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
W1 = 1
X t = 0.0
W2 = 1
Y
368
OR Function Using A Perceptron
369
XOR Function Using A Perceptron
Notice that the fourth equation contradicts the second and the third
equation. Point is, there are no perceptron solutions for non-linearly
separated data. So the key take away is that single perceptron 370
cannot learn to separate the data that are non-linear in nature.
Training Perceptron
-1 For AND
W0 = ? A B Output
00 0
x 01 0
W1 = ? t = 0.0
10 0
11 1
W2 = ?
y
371
Training Perceptron
For AND
-1
W0 = 0.3 A B Output
00 0
x 01 0
W1 = 0.5 t = 0.0
10 0
W2 = -0.4 11 1
y
For practice purpose find the final values of w1 and w2. Using
wi = wi + wi
wi = (t - o) xi taking = 0.1 372
Also develop perceptrons for OR and NOT.
Learning algorithm
Epoch : Presentation of the entire training set to the neural
network.
In the case of AND function an epoch consists of
four sets of inputs being presented to the network
(i.e. [0,0], [0,1], [1,0], [1,1])
Error: The error value is the amount by which the value
output by the network differs from the target
value. For example, if we required the network to
output 0 and it output a 1, then Error = -1
373
Learning algorithm
Target Value, T : When we are training a network we
not only present it with the input but also with a value
that we require the network to produce. For example,
if we present the network with [1,1] for the AND
function the training value will be 1
Output , O : The output value from the neuron
Ij : Inputs being presented to the neuron
Wj : Weight from input neuron (Ij) to the output neuron
LR : The learning rate. This dictates how quickly the
network converges. It is set by a matter of
experimentation. It is typically 0.1
374
Decision boundaries
375
Linear Separability
X1
A
A
A
B Decision
A Boundary
A B
B
B
A B
A B B
X2
B
376
Decision Surface of a
Perceptron
x2 x2
+
+ + -
+ -
- x1
x1
+ - - +
-
Linearly separable Non-Linearly separable
378
Different Non-Linearly
Separable Problems
Types of Exclusive-OR Classes with Most General
Structure
Decision Regions Problem Meshed regions Region Shapes
Three-Layer Arbitrary
(Complexity A B
B
Limited by No. A
of Nodes) B A
379
Multilayer Perceptron (MLP)
Output Values
Output Layer
Adjustable
Weights
Input Layer
w21
X1 Y1
w31
w12
X2 w22 Y2
w32
w13
Xn w2m Ym
wnm
Input Output
Units Units
w11 v11
X1 Y1
wi1 vj1
Z1
wn1 vp1
w1j v1k
Xi wij Zj vjk Yk
wnj vpk
Biological
Neurons
w1p v1m
In Action wip
Zp vjm
Xn wnp vpm Ym
Input Hidden Output
Units Units Units
1 w11 1
X1 Y1
w1n
v1n v1m
w1m
Xn Ym
wnm
1 1
Activation functions
• Transforms neuron’s input into output.
• Features of activation functions:
• A squashing effect is required
• Prevents accelerating growth of activation
levels through the network.
• Simple and easy to calculate
387
Standard activation functions
• The hard-limiting threshold function
– Corresponds to the biological paradigm
• either fires or not
There will be this sudden change in the decision (from 0 to 1) when z value
crosses the threshold (-w0). For most real-world applications we would
expect a smoother decision function which gradually changes from 0 to 1.
Standard activation functions
• Sigmoid functions: where the output function is
much smoother than the step function
– A mathematical function with a characteristic 'S'-shaped
curve, also called the sigmoid curve.
– There are many functions that can do the job for you,
some are shown below:
389
Sigmoid function
• One of the simplest one to work with is the logistic function.
392
Back-Propagation
• A training procedure which allows multi-layer
feedforward Neural Networks to be trained;
• Can theoretically perform “any” input-output
mapping;
• Can learn to solve linearly inseparable
problems.
393
Activation functions and training
394
Lecture 13
CS-13410
Introduction to Machine Learning
(Artificial Neural Networks-
Backpropagation)
Basics of an Artificial Neural
Network
An artificial neural network is an information
processing system that has certain performance
characteristics in common with biological neural
networks.
An ANN can be characterized by:
1. Architecture: The pattern of connections between
different neurons.
2. Training or Learning Algorithms: The method of
determining weights on the connections.
3. Activation Function: The nature of function used by a
neuron to become activated.
Back-Propagation
• A training procedure which allows multi-layer
feedforward Neural Networks to be trained;
• Can theoretically perform “any” input-output
mapping;
• Can learn to solve linearly inseparable
problems.
397
Classification by Backpropagation
Backpropagation: A neural network learning algorithm
Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
A neural network: A set of connected input/output units
where each connection has a weight associated with it
During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class label of the input tuples
Also referred to as connectionist learning due to the
connections between units
398
A Multi-Layer Feed-Forward Neural Network
404
Example (Cont…)
405
Terminating condition:
Training stops when
All wij in the previous epoch are so small as
to be below some specified threshold, or
The percentage of tuples misclassified in the
previous epoch is below some threshold, or
A prespecified number of epochs has expired.
In practice, several hundreds of thousands of
epochs may be required before the weights
will converge.
406
Backpropagation and Interpretability
Efficiency of backpropagation: Each epoch (one iteration through the
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in the worst
case
Rule extraction from networks: network pruning
Simplify the network structure by removing weighted links that
have the least effect on the trained network
Then perform link, unit, or activation value clustering
The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit
layers
Sensitivity analysis: assess the impact that a given input variable has
on a network output. The knowledge gained from this analysis can be
represented in rules
407
Neural Network as a Classifier
Weakness
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure."
Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units" in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of
rules from trained neural networks
408
Lecture 13
CS-13410
Introduction to Machine Learning
(Gradient Descent and ANN-
Backpropagation)
Gradient Descent Method
inflection point
local minimum
global minimum
x
Gradient Descent Learning Rule
Consider linear unit without threshold and
continuous output o (not just –1,1)
o=w0 + w1 x1 + … + wn xn
Train the wi’s such that they minimize the
squared error
E[w1,…,wn] = ½ dD (td-od)2
where D is the set of training examples
412
Gradient Descent (used in Delta Rule)
D={<(1,1),1>,<(-1,-1),1>,
<(1,-1),-1>,<(-1,1),-1>}
(w1,w2)
Gradient:
E[w]=[E/w0,… E/wn]
415
Gradient Descent Algorithm and
Its Variants
416
Comparison Perceptron and
Gradient Descent Rule
Perceptron learning rule guaranteed to succeed if
Training examples are linearly separable
Sufficiently small learning rate
417
Expressive Capabilities of ANN
Boolean functions
Every boolean function can be represented by
network with single hidden layer
But might require exponential (in number of inputs)
hidden units
Continuous functions
Every bounded continuous function can be
approximated with arbitrarily small error, by network
with one hidden layer [Cybenko 1989, Hornik 1989]
Any function can be approximated to arbitrary
accuracy by a network with two hidden layers
[Cybenko 1988]
418
Applications
• The properties of neural networks define
where they are useful.
– Can learn complex mappings from inputs to
outputs, based solely on samples
– Difficult to analyse: firm predictions about neural
network behaviour difficult;
• Unsuitable for safety-critical applications.
– Require limited understanding from trainer, who
can be guided by heuristics.
419
Neural network for OCR
feedforward
A
B
network C
D
trained using E
Back- propagation
Hidden
Layer Output
Layer
Input
Layer 420
OCR for 8x10 characters
10 10 10
8 8 8
NN are able to generalise
learning involves generating a partitioning of the input space
for single layer network input space must be linearly
separable
what is the dimension of this input space?
how many points in the input space?
this network is binary(uses binary values)
networks may also be continuous 421
Engine management
422
ALVINN
Drives 70 mph on a public highway
30 outputs
for steering
30x32 weights
4 hidden
into one out of
units
four hidden
30x32 pixels unit
as inputs
423
Signature recognition
424
Sonar target recognition
425
Stock market prediction
426
Mortgage assessment
427
Neural Network Problems
428
Parameter setting
• Number of layers
• Number of neurons
• too many neurons, require more training time
• Learning rate
• from experience, value should be small ~0.1
• Momentum term
• ..
429
Over-fitting
430
Training time
431
Literature & Resources
A very good book:
”Neural Networks for Pattern Recognition”, Bishop, C.M.,
1996
Software:
Neural Networks for Face Recognition
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html
SNNS Stuttgart Neural Networks Simulator
http://www-ra.informatik.uni-tuebingen.de/SNNS
Neural Networks at your fingertips
http://www.geocities.com/CapeCanaveral/1624/
http://www.stats.gla.ac.uk/~ernest/files/NeuralAppl.html 432