All Into One ML

CS-13410
Introduction to Machine Learning

by
Mudasser Naseer
Course information
Instructor: Dr. Mudasser Naseer
Email: mudasser.naseer@cs.uol.edu.pk
Room # G03
Lectures: Sec-8C: Mon. 03:30 – 05:00 (LB3-04)
Tue. 02:00 – 03:30 (LB3-03)
Sec-6C: Wed. 03:30 – 05:00 (CS-002)
Thur. 03:30 – 05:00 (LB3-106)
Office Hrs: Mon. to Fri. : 10:30 – 12:30

(or by appointment)
Recommended Prerequisite: Statistics, Programming (Java,
C++ or Python), AI 2
Travel Advisory
 You are about to take the “Introduction to Machine
Learning” course.
 It is intended primarily for senior semester Computer
Science students.
 The course is mainly a theoretical and mathematical
course. Broad coverage of different machine learning
topics (some important topics in depth).
 We will discuss numerical examples so that the students
will be able to implement different algorithms for their
FYPs.
 The course will require a substantial amount of work by
yourselves to digest and pass successfully.
3
Aim and Objectives
There are four primary objectives for the course:
 To provide a broad survey of approaches and
techniques in machine learning;
 To develop a deeper understanding of some major
topics in machine learning;
 To develop the design and programming skills that will
help you to build intelligent, adaptive artifacts;
 To develop the basic skills necessary to pursue
research in machine learning.
4
Learning outcomes
On completion of the course students will be expected to:
 Have a good understanding of the fundamental issues
and challenges of machine learning: data collection and
pre-processing, model selection, model complexity, etc.
 Have an understanding of the strengths and weaknesses
of many popular machine learning approaches.
 Appreciate the underlying mathematical relationships
within and across Machine Learning algorithms and the
paradigms of supervised and un-supervised learning.
 Be able to design and implement various machine
learning algorithms in a range of real-world applications.
5
Resources
 Lecture slides
 Text book:
 Machine Learning by Tom M. Mitchell
 Introduction to Machine Learning by Ethem
Alpaydin 2nd Ed
 Reference Book:
 An Introduction to Machine Learning by Miroslav
Kubat 2nd Edition
6
Administrative details
Assessment:
Final numeric score will be weighted as follows:
Homework/Assignments (4): 15%

Quizzes (4): 15%
Midterm Exam: 30%
Final Exam: 40%
(This is a tentative distribution and might change slightly)
Your final grade will follow University of Lahore conventions,

i.e. A, A-, B+, etc.
7
Administrative details
 Grade determined by mid, final exams, quizzes,
and assignments.
 The assignment/homework should be submitted
individually.
 Each “normal” week (not counting holidays,
strikes, ceremonies, wars, etc.) there will be 2
lectures of theory.
8
Collaboration on Assignments, etc.
 Preparing assignments/homeworks independently is a key
ingredient for understanding the material (and,
consequently, passing the exam :-). So it is highly
recommended you make a serious effort to solve the
problems on your own.
 You may discuss, consult, collaborate with other students
on the problem sets, but your solutions must be written up
independently, by you.
 You are welcome to consult online and offline sources for
your solutions, but you are (a) expected to give clear cites
of your references, and (b) use a write up of your own.
 Recall that Google (or any alternative search engine) is a
double edge sword. 9
Collaboration on Assignments, etc.
 Cases of plagiarism that will be detected will be
dealt with severely. (For example, reducing your
grades for the whole course, not just the relevant
assignment, canceling the assignment and/or
reporting the incident to the appropriate
university authority.)
 If I suspect X had copied from Y, both will be
regarded as cheaters.
10
Course Structure
 We will start from the basic concept of machine
learning and move towards advance topics using
multiple examples from each topic.
 This should hopefully expose you to some of the
beautiful topics and ideas in the field.
 Naturally we will get into some of the topics in
great depth and their corresponding applications.
 Each topic will be accompanied by some examples
from real world (given by me) and tasks (to be done
by you).
11
Course overview
 What is this course Introduction to Machine Learning?
 Covers basic concepts of ML, in particular focusing on the core
concepts of supervised and unsupervised learning.
 In supervised learning we will discuss algorithms which are
trained on input data labelled with a desired output.
 Unsupervised learning aims to discover latent structure in an
Input signal where no output labels are available.
 Students will learn the algorithms which underpin many popular
Machine Learning techniques, as well as developing an
understanding of the theoretical relationships between these
algorithms.
12
How to succeed?
Put the effort in
Make good use of the available resources
Attend lectures
Read books and online materials
Lot of practice
ASK!!!
Aim to fully understand new concepts
13
HOW TO FAIL?
Relax too much
Don’t assimilate material given in lectures
Absent from lectures
Don’t seek help
“Borrow” from “others”
They can’t help in the learning, exam or class tests!!!!
14
Why “Learn” ?
 Machine learning is programming computers to optimize
a performance criterion using example data or past
experience.
 There is no need to “learn” to calculate payroll
 Learning is used when:
 Human expertise does not exist (navigating on Mars),
 Humans are unable to explain their expertise (speech
recognition)
 Solution changes in time (routing on a computer network)
 Solution needs to be adapted to particular cases (user
biometrics)
15
What We Talk About When We
Talk About“Learning”
 Learning general models from a data of particular
examples
 Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
 Example in retail: Customer transactions to consumer
behavior:
People who bought “Blink” also bought “Outliers”
(www.amazon.com)
 Build a model that is a good and useful approximation to
the data.
16
A Few Quotes
 “A breakthrough in machine learning would be worth
ten Microsofts” (Bill Gates, Chairman, Microsoft)
 “Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
 Machine learning is the hot new thing”
(John Hennessy, President, Stanford University)
 “Web rankings today are mostly a matter of machine
learning” (Prabhakar Raghavan, Dir. Research, Yahoo)
 “Machine learning is going to result in a real revolution”
(Greg Papadopoulos, CTO, Sun)
 “Machine learning is today’s discontinuity”
(Jerry Yang, CEO, Yahoo)
17
So What Is Machine Learning?
 Automating automation
 Getting computers to program themselves
 Writing software is the bottleneck
 Let the data do the work instead!
18
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
19
Magic?
No, more like gardening
 Seeds = Algorithms
 Nutrients = Data
 Gardener = You
 Plants = Programs
20
Sample Applications
 Web search
 Computational biology
 Finance
 E-commerce
 Space exploration
 Robotics
 Information extraction
 Social networks
 Debugging
 [Your favorite area]
21
ML in a Nutshell
 Tens of thousands of machine learning
algorithms
 Hundreds new every year
 Every machine learning algorithm has three
components:
 Representation
 Evaluation
 Optimization
22
Representation
 Decision trees
 Sets of rules / Logic programs
 Instances
 Graphical models (Bayes/Markov nets)
 Neural networks
 Support vector machines
 Model ensembles
 Etc.
23
Evaluation
 Accuracy
 Precision and recall
 F-measure
 Squared error
 Likelihood
 Posterior probability
 Cost / Utility
 Margin
 Entropy
 K-L divergence (Kullback–Leibler divergence)
 Etc.
24
Optimization
 Combinatorial optimization
 E.g.: Greedy search
 Convex optimization
 E.g.: Gradient descent
 Constrained optimization
 E.g.: Linear programming
25
Types of Learning
 Supervised (inductive) learning
 Training data includes desired outputs
 goal to make predictions based on examples e.g. email
classification with already labeled emails.
 Unsupervised learning
 Training data does not include desired outputs
 goal to reveal structure in the observed data e.g. cluster similar
documents based on text
class A
class A
Classification Clustering
Regression
Anomaly Detection
Sequence labeling … 26
Types of Learning
 Semi-supervised learning (SSL)
 Training data includes a few desired outputs
 tries to improve predictions based on examples by
making use of the additional “unlabeled” examples
e.g. Speech Analysis: Speech analysis is a classic
example of the value of semi-supervised learning
models
 Reinforcement learning (RL)
 Delayed and partial feedback, no explicit
guidance
 Rewards from sequence of actions
e.g. Using RL, AlphaGo Zero was able to learn the
game of Go from scratch by playing against itself.
After 40 days of self-training, Alpha Go Zero was
able to outperform the version of Alpha Go known
as Master that has defeated world number one Ke Jie. 27
Inductive Learning
 Given examples of a function (X, F(X))
 Predict function F(X) for new examples X
 Discrete F(X): Classification
 Continuous F(X): Regression
 F(X) = Probability(X): Probability estimation
28
What is Machine Learning?
 Optimize a performance criterion using example
data or past experience.
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient algorithms to
 Solve the optimization problem
 Representing and evaluating the model for inference
29
ML in Practice
 Understanding domain, prior knowledge, and
goals
 Data integration, selection, cleaning, pre-
processing, etc.
 Learning models
 Interpreting results
 Consolidating and deploying discovered
knowledge
 Loop
30
Applications
 Association
 Supervised Learning
 Classification
 Regression
 Decision tree induction
 Rule induction
 Instance-based learning
 Bayesian learning
 Neural networks
 Support vector machines
 Model ensembles
 Learning theory
 Unsupervised Learning
 Clustering
 Dimensionality reduction
 Reinforcement Learning 31
Learning Associations
 Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.
Example: P ( chips | coke ) = 0.7
32
Classification
 Example: Credit
scoring
 Differentiating
between low-risk
and high-risk
customers from their
income and savings
Discriminant: IF income > θ1 AND savings > θ2

THEN low-risk ELSE high-risk
33
Classification: Applications
 Aka Pattern recognition
 Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
 Character recognition: Different handwriting styles.
 Speech recognition: Temporal dependency.
 Medical diagnosis: From symptoms to illnesses
 Biometrics: Recognition/authentication using physical and/or behavioral
characteristics: Face, iris, signature, etc
 ...
34
Face Recognition
Training examples of a person
Test images
ORL dataset,
AT&T Laboratories, Cambridge UK
35
Regression
 Example: Price of a used car
 x : car attributes
y = wx+w0
y : price
y = g (x | q )
g ( ) model,
q parameters
36
Supervised Learning: Uses
 Prediction of future cases: Use the rule to predict the output for future inputs
 Knowledge extraction: The rule is easy to understand
 Compression: The rule is simpler than the data it explains
 Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
37
Unsupervised Learning
 Learning “what normally happens”
 No output
 Clustering: Grouping similar instances
 Example applications
 Customer segmentation in CRM software
 Image compression: Color quantization
 Bioinformatics: Learning motifs
38
Reinforcement Learning
 Learning a policy: A sequence of outputs
 No supervised output but delayed reward
 Credit assignment problem
 Game playing
 Robot in a maze
 Multiple agents, partial observability, ...
39
Resources: Datasets
 UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html
 UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html
 Statlib: http://lib.stat.cmu.edu/
 Delve: http://www.cs.utoronto.ca/~delve/
40
Resources: Journals
 Journal of Machine Learning Research www.jmlr.org
 Machine Learning
 Neural Computation
 Neural Networks
 IEEE Transactions on Neural Networks
 IEEE Transactions on Pattern Analysis and Machine Intelligence
 Annals of Statistics
 Journal of the American Statistical Association
 ...
41
Resources: Conferences
 International Conference on Machine Learning (ICML)
 European Conference on Machine Learning (ECML)
 Neural Information Processing Systems (NIPS)
 Uncertainty in Artificial Intelligence (UAI)
 Computational Learning Theory (COLT)
 International Conference on Artificial Neural Networks (ICANN)
 International Conference on AI & Statistics (AISTATS)
 International Conference on Pattern Recognition (ICPR)
 ...
42
CS-13410
by
Mudasser Naseer
Machine Learning
 Machine learning is a field of computer science that uses
statistical techniques to give computer systems the ability to
"learn" (e.g., progressively improve performance on a
specific task) with data, without being explicitly
programmed. (Wikipedia)
 Previously we define machine learning as “Optimize a
performance criterion using example data or past
experience”.
Now we will define the learning problem in terms of
computer program.
44
Well-Posed Learning Problems
 “A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with
experience E”.
 To have a well-defined learning problem, we must identify
these three features: the class of tasks T, the measure of
performance to be improved P, and the source of experience E.
 So we can define learning as
Learning = Improving with experience at some task
 Improve over task T,
 With respect to performance measure P
 Based of experience E
45
Learning Problems-Examples
 For example, a computer program that learns to play
checkers might improve its performance as measured by
its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing
games against itself.
 T : Play checkers
 P : % of games won against opponents
 E : playing practice games against itself
46
More Learning Problems
A handwriting recognition learning problem:
 Task T:
 recognizing and classifying handwritten letters/words
within images
 Performance measure P:
 percent of letters/words correctly classified
 Training experience E:
 a database of handwritten letters/words with given
classifications
47
A robot driving learning problem:
 Task T:
 driving on public four-lane highways using vision sensors
 Performance measure P:
 average distance traveled before an error (as judged by
human overseer)
 Training experience E:
 a sequence of images and steering commands recorded
while observing a human driver
48
Designing a Learning System
 To illustrate some of the basic design issues and approaches to
machine learning, let us consider designing a program to learn
to play checkers, with the goal of entering it in the world
checkers tournament.
 https://www.google.com/search?rlz=1C1RNLE_enPK512PK512
&ei=o9aKXcOaC8bRwALO-
6rAAQ&q=how+to+play+checkers&oq=how+to+play+check&g
s_l=psy-
ab.3.0.0l10.3594054.3604913..3608807...0.1..0.321.4143.2-
16j1......0....1..gws-
wiz.......0i71j0i273j0i131j0i67.JEo8Lqm6u78#kpvalbx=_zOSKXfr
LNMGckwXQtpT4Cw30
 The obvious performance measure is: the percent of games it

wins in this world tournament.
49
Designing a Learning System
 T : Play checkers
 P : Percent of games won in world tournament
 What type of experience?
50
Type of Training Experience
 Direct or indirect feedback regarding the choices made by the system
 Direct: training examples consists of individual checkers board states and
the correct move for each.
 Indirect: information consisting of the move sequences and final
outcomes of various games played. Here the learner faces an additional
problem of credit assignment, (determining the degree to which each
move in the sequence deserves credit or blame for the final outcome).
 Teacher or not
 Teacher: The learner rely on the teacher to select informative board
states and to provide the correct move for each.
 No teacher or assisted by teacher: the learner might itself propose some
confusing board states and ask the teacher for the correct move. Or the
learner may have complete control over both the board states and
(indirect) training classifications i.e. playing against itself with no teacher.
51
Type of Training Experience
 A problem: is training experience representative of performance goal?
 In general, learning is most reliable when the training examples follow a
distribution similar to that of future test examples.
 Note: Most current theories of machine learning rests on the crucial assumption
that the distribution of training examples is identical to the distribution of test
examples. However, this assumption is often violated in practice.
52
A checkers learning problem:
To proceed with our design, let us decide that our system
will train by playing games against itself.
 Advantage: No external trainer needed and the system is
allowed to generate as much training data as time permits.
 We now have a fully specified learning task.
 Task T: playing checkers
 Performance measure P: percent of games won in the world
tournament
 Training experience E: games played against itself
53
In order to complete the design we must now choose
1. What exact type of knowledge should be learned?
2. How shall this knowledge be represented?
3. What specific algorithm to learn it? Or the learning
mechanism?
54
(Choosing the target Function)
Problem: what type of knowledge is required to be learned and how
this will be used by the performance program.
 Assume our checkers-playing program can generate all the legal
moves from any board state.
 The program needs only to learn how to choose the best move from
among these legal moves.
 This learning task is representative of a large class of tasks for which
 the legal moves that define some large search space are known a priori,
but the best search strategy is not known.
 Many optimization problems fall into this class, such as the problems of
scheduling and controlling manufacturing processes where the
available manufacturing steps are well understood, but the best
strategy for sequencing them is not. 55
Choosing the Target Function
 Now the problem is transformed to learn a program or
function that chooses the best move for any given board
state. Say
ChooseMove : B → M
where : B is set of legal board states
M is set of legal moves
 Thus we have reduced the problem of improving
performance P at task T to the problem of learning some
particular targetfunction such as ChooseMove, this
function accepts as input any board from the set of legal
board states B and produces as output some move from
the set of legal moves M.
56
 Given the kind of indirect training experience available here, this
function will turn out to be very difficult to learn.
 An alternative easier target function would be an evaluation
function that assigns a numerical score to any given board state.
 Let us call this target function 𝑉 such that
𝑉: 𝐵 → ℜ where ℜ is the set of real numbers
This target function intend to assigns higher scores to better board
states. After learning this target function , the system easily use it
to select the best move from any current board position. This can
be accomplished by generating the successor board state produced
by every legal move, then using 𝑉 to choose the best successor
state and therefore the best legal move.
57
 What exactly should be the value of the target function V for any
given board state?
 Any evaluation function that assigns higher scores to better board
states will do the trick. One such target function among many may
be defined as,
The target value 𝑉(𝑏) for an arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then 𝑉(𝑏) = 100
2. if b is a final board state that is lost, then 𝑉(𝑏) = -100
3. if b is a final board state that is drawn, then 𝑉(𝑏) = 0
4. if b is a not a final state in the game, then 𝑉 𝑏 = 𝑉(𝑏 ′ ), where 𝑏 ′ is
the best final board state that can be achieved starting from b and
playing optimally until the end of the game (assuming the opponent
plays optimally, as well).
58
 While this recursive definition specifies a value of V(b) for every
board state b, this definition is nonoperational . Except for the
trivial cases (cases 1-3) in which the game has already ended.
 Determining the value of 𝑉(𝑏) for a particular board state
requires (case 4) searching ahead for the optimal line of play, all
the way to the end of the game.
 The above definition is not efficiently computable, so we can say
that it is nonoperational definition.
 Now the goal of learning in this case is to discover an
operational description of 𝑉; that is, a description that can be
used by the checkers-playing program to evaluate states and
select moves within realistic time bounds.
59
mechanism?
60
 Thus, we have reduced the learning task in this case to
the problem of discovering an operational description of
the ideal targetfunction 𝑉.
 It is very difficult in general to learn such an operational
form of 𝑉 perfectly.
 We often expect learning algorithms to acquire only some
approximation to the target function, often called
function approximation, we call this 𝑉.
61
(Choosing a Representation for the Target Function)
 In order to represent 𝑉, we have many choices, such as table of features, collection
of rules against features of the board state, quadratic polynomial etc.
 Let us choose a simple representation, a linear combination of the following board
features:
 𝑥1 : the number of black pieces on the board
 𝑥2 : the number of red pieces on the board
 𝑥3 : the number of black kings on the board
 𝑥4 : the number of red kings on the board
 𝑥5 : the number of black pieces threatened by red (i.e., which can be captured on
red's next turn)
 𝑥6 : the number of red pieces threatened by black
Thus, our learning program will represent 𝑉 𝑏 as a linear function of the form
𝑉 𝑏 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
Where 𝑤0 through 𝑤6 are numerical coefficients, or weights, to be chosen by the
learning algorithm. These weights will determine the relative importance of the various
board features in determining the value of the board.
62
Partial design of a checkers learning program
 Performance measure P: percent of games won in the world tournament
 Target function: 𝑉: 𝐵 → ℜ
 Target function representation:
 𝑉 𝑏 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
 The first three items correspond to the specification of the learning task, whereas the
final two items constitute design choices for the implementation of the learning program.
 Now the problem of learning a checkers strategy is reduced to the problem of learning
values for the coefficients 𝑤0 through 𝑤6 .
63
CS-13410
by
Mudasser Naseer
mechanism?
65
(Choosing a Representation for the Target Function)
 In order to represent 𝑉, we have many choices, such as table of features, collection
of rules against features of the board state, quadratic polynomial etc.
 Let us choose a simple representation, a linear combination of the following board
features:
 𝑥1 : the number of black pieces on the board
 𝑥2 : the number of red pieces on the board
 𝑥3 : the number of black kings on the board
 𝑥4 : the number of red kings on the board
 𝑥5 : the number of black pieces threatened by red (i.e., which can be captured on
red's next turn)
 𝑥6 : the number of red pieces threatened by black
Thus, our learning program will represent 𝑉 𝑏 as a linear function of the form
Where 𝑤0 through 𝑤6 are numerical coefficients, or weights, to be chosen by the
learning algorithm. These weights will determine the relative importance of the various
board features in determining the value of the board.
66
Partial design of a checkers learning program
 Performance measure P: percent of games won in the
world tournament
 Target function: 𝑉: 𝐵 → ℜ
 Target function representation:
The first three items correspond to the specification of the learning task,
whereas the final two items constitute design choices for the implementation
of the learning program.
Now the problem of learning a checkers strategy is reduced to the problem of
learning values for the coefficients 𝑤0 through 𝑤6 .
67
mechanism?
68
3. What specific algorithm to learn? Or
Choosing a Function Approximation Algorithm
 To learn the target function 𝑉 we require a set of training
examples, each describing a specific board state b and the
training value 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 for b. Thus, each training example is
an ordered pair of the form 𝑏, 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 .
 The following training example describes a board state b in
which black has won the game (note 𝑥2 = 0 indicates that red
has no remaining pieces) and the target function value is
therefore +100.
𝑥1 = 3, 𝑥2 = 0, 𝑥3 = 1, 𝑥4 = 0, 𝑥5 = 0, 𝑥6 = 0 , +100
On next slides we describe a procedure that first derives such
training examples from the indirect training experience, then
adjusts the weights 𝑤𝑖 to best fit these training examples.
69
3.1 Estimating Training Values
 According to our formulation the only training information
available to our learner is whether the game was
eventually won or lost. Whereas we require training
examples that assign specific scores to specific board
states.
 It is easy to assign a value to board states that correspond
to the end of the game.
 To assign training values to the more numerous
intermediate board states is a difficult task. Not all the
intermediate states are bad for a lost game or good for a
win game.
70
 A simple and surprisingly successful approach is to assign
𝑉 𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏 to 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 i.e. The rule for estimating
training values is
𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 ← 𝑉 𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏
Where 𝑉 is the learner’s current approximation to 𝑉 and
𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏 is the board state following the program’s
move and the opponent's response.
Notice that we are using estimates of the value of the
𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏 to estimate the value of board state 𝑏.
71
3.2 Adjusting the weights
 Now all remains is to choose the weights 𝑤𝑖 that best fit
the set of training examples 𝑏, 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 .
 One common approach is to define the best hypothesis, or
set of weights, that minimizes the squared error 𝐸 between
the training value and the values predicted by the
hypothesis 𝑉.i.e. minimize
𝐸≡ 𝑏,𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 𝜖 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠(𝑉𝑡𝑟𝑎𝑖𝑛 (𝑏) − 𝑉(𝑏))2
 Now we need an algorithm that incrementally refine the
weights as new training examples become available.
Several algorithms available, Least Mean Squares (LMS) is
one of such algorithms.
72
LMS weight update rule
 For each observed training example it adjusts the weights
a small amount in the direction that reduces the error on
this training example.
 It performs a stochastic gradient-descent search through
the space of possible hypotheses (weight values) to
minimize the squared error 𝐸.
 For each training example 𝑏, 𝑉𝑡𝑟𝑎𝑖𝑛 𝑏
 Use the current weights to calculate 𝑉(𝑏)
 For each weight w𝑖, update it as
𝑤𝑖 ← 𝑤𝑖 + 𝜂(𝑉𝑡𝑟𝑎𝑖𝑛 (𝑏) − 𝑉(𝑏))𝑥𝑖
Here 𝜂 is a small constant (e.g., 0.1) that moderates the
size of the weight update
73
The Final Design
 The final design of our checkers learning system
described by the following four modules that represent
the central components in many learning systems.
 The Performance System
 The Critic
 The Generalizer
 The Experiment Generator
74
The Performance System
 The module that must solve the given performance task,
in this case playing checkers, by using the learned target
function(s).
 Takes an instance of a new problem (new game) as input
and produces a trace of its solution (game history) as
output.
 we expect its performance to improve as the evaluation
function 𝑉 becomes increasingly accurate.
75
The Critic
 It takes as input the history or trace of the game and
produces as output a set of training examples of the
target function.
 In our example, the Critic corresponds to the following
training rule
𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 ← 𝑉 𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟 𝑏
76
Generalizer
 It takes as input the training examples and produces an
output hypothesis that is its estimate of the target
function.
 Generalizes from the specific training examples,
hypothesizing a general function that covers these
examples and other cases beyond the training examples.
 In our example, LMS algorithm is used as Generalizer, and
the output hypothesis is the function 𝑉 described by the
learned weights 𝑤0 , … , 𝑤6 .
77
The Experiment Generator
 It takes as input the current hypothesis (currently learned
function) and outputs a new problem ( i.e., initial board
state).
 The main role is to pick new practice problems that will
maximize the learning rate of the overall system.
 May follow a very simple strategy, always proposes the
same initial game board to begin a new game. More
sophisticated strategies may also be used to explore
particular regions of the state space.
78
Final design of the checkers learning program
 Many machine learning systems can-be usefully

characterized in terms of these four generic modules 79
Summary of choices in designing
80
Some last minute discussion
 In above given design choices we have constrained the
learning task in a number of ways.
 The type of knowledge is restricted to single linear
evaluation function.
 The evaluation function depends on only the six specific
board features provided.
 If we improve the function then our program has a better
chance to learn a good approximation of the true 𝑉
function.
81
Some real examples
IBM 701 — the first major commercial computer
 In 1949, Arthur Samuel (1901 – 1990) joined IBM’s Poughkeepsie lab
and in 1952 he programmed IBM’s first major computer — IBM
701 — to play checkers. It was the first game-playing program to run
on a computer — it was the first checkers program.
 By 1955, Samuel had done something groundbreaking; he had
created a program that could learn — something that no one had
done before — and this was demonstrated, by playing a checkers on
television on February 24, 1956.
 He published a seminal paper in 1959, titled ”Some Studies in
Machine Learning Using the Game of Checkers”, where he talked
about how a machine could look ahead “by evaluating the resulting
board positions much as a human player might do”. The computer
started out losing to Samuel and eventually beat Samuel.
 Samuel’s program was based on Shannon’s minimax strategy to find
the best move from a given current position.
82
Solving Checkers (2007)
 After Samuel’s work on checkers, there was a false
impression that checkers was a “solved” game. As a result,
researchers moved on to chess and mostly ignored
checkers until Jonathan Schaeffer began working on
Chinook in 1989. Schaeffer’s goal was to develop a program
capable of defeating the best checkers player.
 In 2007 Schaeffer and his team published a paper in the
journal Science titled Checkers Is Solved: the program
could no longer be defeated by anyone, human or
otherwise.
83
Solving Checkers (2007)
 Marion Franklin Tinsley (February 3, 1927 – April 3, 1995) was an
American mathematician and checkers player. He is considered to be
the greatest checkers player who ever lived. Tinsley was world
champion 1955–1958 and 1975–1991 and never lost a world
championship match, and lost only seven games (two of them to
the Chinook computer program) in his 45-year career.
 Checkers is extremely complex — it has roughly 500 billion billion
possible positions (5 x 10²⁰). About 10²⁰ possible board positions
compared to Chess (10⁴⁷) or Go (10²⁵⁰). Even though checkers is
simpler than those games, it is still complex enough that a brute force
only approach is impractical.
 Checkers is the largest game that has been solved to date, with a
search space of 5×10²⁰. “The number of calculations involved
was 10¹⁴, which were done over a period of 18 years. The process
involved from 200 desktop computers at its peak down to around 50”.
84
Backgammon
 Tesauro (1992, 1995) reports a design, similar to the design we discuss
as our final design of the checkers learning program example, for a
program that learns to play the game of Backgammon, by learning a
very similar evaluation function over states of the game.
 The program represents the learned
evaluation function using an
artificial neural network that
considers the complete description
of the board state rather than a
subset of board features.
 After training on over one million
self-generated training games, his
program was able to play very
competitively with top-ranked
human Backgammon players.
85
 A classic example of computer brilliance is Google’s “AlphaGo” which can
beat players at the Chinese game Go. During recent years AlphaGo went on a
spree, defeating Go masters anonymously. In October 2015, AlphaGo became
the first computer Go program to beat a human professional Go
player without handicaps on a full-sized 19×19 board.
 At the beginning of 2017, it was revealed to the public, as well as to the
masters it had defeated, that this mysterious Go player was an AI all along.
 Originally, the problem with solving Go
was the sheer number of moves that can
be made (approximately 1 x 10^170, which
is more than the total atoms in the
universe).
 A computer cannot use brute force for
a winning strategy as a human will
have the upper hand thanks to
intuition. 86
 Therefore, the Google team took a different approach and developed a system
that could almost be considered a form of synthetic intuition.
 The system essentially runs different tasks, such as identification of important
moves, and found a balance between a neural network, a policy network, and
a value network.
 This balancing is similar to how the human brain makes decisions, where
different aspects of a situation are considered. For example,
 A person playing chess does not always think of every possible move and plan
according to immediate consequences.
 The player could allow moves that may seem bizarre to a computer such as self-sacrifice
which, in the short term, appear to make the player lose but in the long term could be
vital for a victory.
 Other techniques include confusion whereby a player that is losing may make “crazy”
moves that can confuse other players and keep their true strategy hidden.
87
 AlphaGo uses a Monte Carlo tree search algorithm to find
its moves based on knowledge previously "learned"
by machine learning, specifically by an artificial neural
network (a deep learning method) by extensive training,
both from human and computer play.
88
AlphaZero
 AlphaZero is a computer program also developed by artificial
intelligence research company DeepMind to master the games
of chess, shogi and go. The algorithm uses an approach similar
to AlphaGo Zero.
 On December 5, 2017, the DeepMind team released
a preprint introducing AlphaZero, which within 24 hours of
training achieved a superhuman level of play in these three
games by defeating world-champion programs Stockfish, elmo,
and the 3-day version of AlphaGo Zero.
89
AlphaZero
90
AlphaZero
 AlphaZero was trained solely via "self-play" using 5,000 first-
generation Tensor Processing Units (TPUs) to generate the
games and 64 second-generation TPUs to train the neural
networks, all in parallel, with no access to opening
books or endgame tables.
 After four hours of training, DeepMind estimated AlphaZero
was playing at a higher Elo rating than Stockfish 8; after 9
hours of training, the algorithm defeated Stockfish 8 in a time-
controlled 100-game tournament (28 wins, 0 losses, and 72
draws).
 The trained algorithm played on a single machine with four
TPUs.
91
CS-13410
(Concept Learning Ch # 2 by Tom Mitchell)
by
Mudasser Naseer
Concept Learning
• Inducing general functions from specific training examples is a
main issue of machine learning.
• Concept Learning: Acquiring the definition of a general category
from given sample positive and negative training examples of the
category.
• Concept Learning can be seen as a problem of searching
through a predefined space of potential hypotheses for the
hypothesis that best fits the training examples.
• The hypothesis space has a general-to-specific ordering of
hypotheses, and the search can be efficiently organized by taking
advantage of a naturally occurring structure over the hypothesis
space.
93
Concept Learning
• A Formal Definition for Concept Learning:
Inferring a boolean-valued function from training examples of
its input and output.
• An example for concept-learning is the learning of bird-concept from the
given examples of birds (positive examples) and non-birds (negative
examples).
• We are trying to learn the definition of a concept from given examples.
• Here we have a domain i.e. animals
• And the concept is a subset of the domain as birds ⊆ animals
• Our task is to acquire intensional concept description from training
examples.
• Generally we can’t look at all objects in the domain.
94
A Concept Learning Task – Enjoy Sport Training Examples
• A set of example days, and each is described by six attributes.
• The task is to learn to predict the value of EnjoySport for
arbitrary day, based on the values of its attribute values.
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same YES
2 Sunny Warm High Strong Warm Same YES
3 Rainy Cold High Strong Warm Change NO
4 Sunny Warm High Strong Cool Change YES
ATTRIBUTES CONCEPT
• Examples: “Days at which my friend Aldo enjoys his favorite water sport”
• Result: classifier for days = description of Aldo’s behavior
95
EnjoySport – Hypothesis Representation
Many possible representations exists, one is given below
• Each hypothesis h consists of a conjuction of
constraints on the instance attributes.
• Here h will be a vector of six constraints, specifying the
values of the six attributes
(Sky, AirTemp, Humidity, Wind, Water, and Forecast).
• Each attribute/constrain will be:

• don’t care - ? – indicating any value is acceptable for the attribute (e.g. Water=?)
• Single (specific) value – specifying a single required value (e.g. Water = Warm)
• no value - ∅ – indicating no value is acceptable for the attribute (e.g., Water = ∅)
96
Hypothesis Representation
• For Example : A hypothesis:
Sky AirTemp Humidity Wind Water Forecast
< Sunny, ? , ? , Strong , ? , Same >
• The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ?>
• The most specific hypothesis – that no day is a positive example
< ∅, ∅, ∅, ∅, ∅, ∅ >
• EnjoySport concept learning task requires learning the sets of days for
which EnjoySport = yes, describing this set by a conjunction of
constraints over the instance attributes.
• We write h(x) = 1 for a day x, if x satisfies the description.
• Note that much more expressive languages exists 97
Prototypical Concept Learning Task
Given
– Instances X : set of all possible days, each described by the attributes
• Sky – (values: Sunny, Cloudy, Rainy)
• AirTemp – (values: Warm, Cold)
• Humidity – (values: Normal, High)
• Wind – (values: Strong, Weak)
• Water – (values: Warm, Cold)
• Forecast – (values: Same, Change)
– Target Concept (Function) c : EnjoySport : X  {0,1}
– Hypotheses H : Each hypothesis is described by a conjunction of constraints on
the attributes. e.g. (?, Cold, High, ?, ?, ?).
– Training Examples D : positive and negative examples of the target function
( x1, c(x1)), . . . (xm, c(xm))
Determine
– A hypothesis h in H such that h(x) = c(x) for all x in D.
98
The Inductive Learning Hypothesis
• Although the learning task is to determine a hypothesis h
identical to the target concept cover the entire set of instances X,
the only information available about c is its value over the
training examples.
– Inductive learning algorithms can at best guarantee that the output hypothesis fits the
target concept over the training data.
– Lacking any further information, our assumption is that the best hypothesis
regarding unseen instances is the hypothesis that best fits the observed training
data. This is the fundamental assumption of inductive learning.
• The Inductive Learning Hypothesis - Any hypothesis found to

approximate the target function well over a sufficiently large set
of training examples will also approximate the target function
well over other unobserved examples.
– i.e. the training set needs to ’represent’ the whole domain (which may be infinite)
– Even if we have a ’good’ training set, we can still construct bad hypotheses!
– Can this be defined formally? 99
Concept Learning As Search
• Concept learning can be viewed as the task of searching

through a large space of hypotheses implicitly defined
by the hypothesis representation.
• The goal of this search is to find the hypothesis that
best fits the training examples.
• By selecting a hypothesis representation, the designer of
the learning algorithm implicitly defines the space of all
hypotheses that the program can ever represent and
therefore can ever learn.
100
Enjoy Sport - Hypothesis Space
• Assume, that Sky has 3 possible values, and other 5 attributes have 2
possible values.
• There are 96 (= 3.2.2.2.2.2) distinct instances in X (instance space).
• There are 5120 (=5.4.4.4.4.4) syntactically distinct hypotheses in H.
– Two more values for attributes: ? and ∅
• Every hypothesis containing one or more ∅ symbols represents the
empty set of instances; that is, it classifies every instance as negative.
• There are 973 (= 1 + 4.3.3.3.3.3) semantically distinct hypotheses in H.
– Only one more value for attributes: ?, and one hypothesis representing empty
set of instances.
• Although EnjoySport has small, finite hypothesis space, most learning
tasks have much larger (even infinite) hypothesis spaces.
– We need efficient search algorithms on the hypothesis spaces.
101
General-to-Specific Ordering of Hypotheses
• Many algorithms for concept learning organize the search through
the hypothesis space by relying on a general-to-specific ordering of
hypotheses.
• By taking advantage of this naturally occurring structure over the
hypothesis space, we can design learning algorithms that exhaustively
search even infinite hypothesis spaces without explicitly enumerating
every hypothesis.
• Consider two hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
• Now consider the sets of instances that are classified positive by hl and
by h2.
– Because h2 imposes fewer constraints on the instance, it classifies more
instancesas positive.
– In fact, any instance classified positive by hl will also be classified positive by h2.
– Therefore, we say that h2 is more general than hl. 102
More-General-Than Relation
• For any instance x in X and hypothesis h in H, we say that x satisfies h
if and only if h(x) = 1.
• More-General-Than-Or-Equal Relation:
Let h1 and h2 be two boolean-valued functions defined over X.
Then h1 is more-general-than-or-equal-to h2 (written h1 ≥ h2) if
and only if any instance that satisfies h2 also satisfies h1.
• h1 is more-general-than h2 ( h1 > h2) if and only if h1 ≥ h2 is true and

h2 ≥ h1 is false. We also say h2 is more-specific-than h1.
103
More-General-Relation
• h2 > h1 and h2 > h3

• But there is no more-general relation between h1 and h3
104
FIND-S Algorithm
• FIND-S Algorithm starts from the most specific hypothesis
and generalize it by considering only positive examples.
• FIND-S algorithm ignores negative examples.
– As long as the hypothesis space contains a hypothesis that describes the true target
concept, and the training data contains no errors, ignoring negative examples does
not cause to any problem.
• FIND-S algorithm finds the most specific hypothesis within H
that is consistent with the positive training examples.
– The final hypothesis will also be consistent with negative examples if the correct
target concept is in H, and the training examples are correct.
105
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
– For each attribute constraint ai in h
• If the constraint a, is satisfied by x Then do nothing
• Else replace ai in h by the next more general constraint that is
satisfied by x
3. Output hypothesis h
106
FIND-S Algorithm
 Initialize h to the most specific hypothesis in H
h ← (∅, ∅, ∅, ∅, ∅, ∅)
 Upon observing the first training example, I becomes clear that our
hypothesis is too specific. Non of the “∅” constrains are satisfies by
the is example, so each is replaced by the next more general
constraint that fits the example i.e.
h ← (Sunny, Warm, Normal, Strong, Warm, Same)
 This h is still very specific. The second training example (also positive)
forces the algorithm to further generalize h. Replace “?” in place of
any attribute value in h that is not satisfied by the new example. The
refined hypothesis is
h ← (Sunny, Warm, ?, Strong, Warm, Same)
After fourth (positive) example our hypothesis becomes
h ← (Sunny, Warm, ?, Strong, ?, ?) 107
FIND-S Algorithm - Example
Instances X Hypotheses H
h
- 0 Specific
x
3
h1
h 2,3
x 1+ x2+
General
x4+ h4
x 1 = <Sunny Warm Normal Strong Warm Same>, + h0 = <, , , , , >
x 2 = <Sunny Warm High Strong Warm Same>, + h1 = <Sunny Warm Normal Strong Warm Same>
x 3 = <Rainy Cold High Strong Warm Change>, - h2 = <Sunny Warm ? Strong Warm Same>
x 4 = <Sunny Warm High Strong Cool Change>, + h3 = <Sunny Warm ? Strong Warm Same>
h4 = <Sunny Warm ? Strong ? ? >
Now classify the following examples as Y or N
A. (Sunny Warm Normal Strong Cool Change) ?
B. (Rainy Cool Normal Light Warm Same) ?
C. (Sunny Warm Normal Light Warm Same) ?
D. (Sunny Cold Normal Strong Warm Same) ? 108
The Role of Negative Examples
 Basically, the negative examples are simply ignored!
 What about a negative example like
(Sunny, Warm, High, Strong, Freezing, Change)
 Example is incorrectly classified as positive by hypothesis
(Sunny, Warm, ?, Strong, ?, ?)
 Correct hypothesis would be
(Sunny, Warm, ?, Strong, (Warm ∨ Cool), ?)
 Is not in hypothesis set H

 If we assume the existence of a consistent hypothesis in H,
then negative examples can be safely ignored.
109
Unanswered Questions by FIND-S Algorithm
• Has FIND-S converged to the correct target concept?
– Although FIND-S will find a hypothesis consistent with the training data, it
has no way to determine whether it has found the only hypothesis in H
consistent with the data (i.e., the correct target concept), or whether there
are many other consistent hypotheses as well.
– We would prefer a learning algorithm that could determine whether it had
converged and, if not, at least characterize its uncertainty regarding the true
identity of the target concept.
• Why prefer the most specific hypothesis?

– In case there are multiple hypotheses consistent with the training
examples, FIND-S will find the most specific.
– It is unclear whether we should prefer this hypothesis over, say, the most
general, or some other hypothesis of intermediate generality.
110
Unanswered Questions by FIND-S Algorithm
• Are the training examples consistent?
– In most practical learning problems there is some chance that the training
examples will contain at least some errors or noise.
– Such inconsistent sets of training examples can severely mislead FIND-S, given
the fact that it ignores negative examples.
– We would prefer an algorithm that could at least detect when the training
data is inconsistent and, preferably, accommodate such errors.
• What if there are several maximally specific consistent

hypotheses?
– In the hypothesis space H for the EnjoySport task, there is always a unique,
most specific hypothesis consistent with any set of positive examples.
– However, for other hypothesis spaces there can be several maximally specific
hypotheses consistent with the data.
– In this case, FIND-S must be extended to allow it to backtrack on its choices of how
to generalize the hypothesis, to accommodate the possibility that the target concept
lies along a different branch of the partial ordering than the branch it has selected. 111
CS-13410
(Concept Learning Ch # 2 by Tom Mitchell)
by
Mudasser Naseer
Candidate-Elimination Algorithm
• FIND-S outputs a hypothesis from H, that is consistent
with the training examples, this is just one of many
hypotheses from H that might fit the training data equally
well.
• The key idea in the Candidate-Elimination algorithm
is to output a description of the set of all hypotheses
consistent with the training examples.
– Candidate-Elimination algorithm computes the description
of this set without explicitly enumerating all of its members.
– This is accomplished by using the more-general-than partial
ordering and maintaining a compact representation of the set
of consistent hypotheses.
113
Consistent Hypothesis
A hypothesis h is consistent with a set of training examples D of target
concept c if and only if h(x) = c (x) for each training example (x, c (x)) in
D.
𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡 ℎ, 𝐷 ≡ ∀ 𝑥, 𝑐 𝑥 ∈ 𝐷 ℎ 𝑥 = 𝑐(𝑥)
• The key difference between this definition of consistent and

satisfies is;
• An example x is said to satisfy hypothesis h when h(x) = 1,
regardless of whether x is a positive or negative example of
the target concept.
• However, whether such an example is consistent with h depends
on the target concept, and in particular, whether h(x) = c(x).
114
Version Spaces
• The Candidate-Elimination algorithm represents the set of
all hypotheses consistent with the observed training examples.
• This subset of all hypotheses is called the version space with
respect to the hypothesis space H and the training examples D,
because it contains all plausible versions of the target concept.
Definition : The version space, V SH,D, with respect to

hypothesis space H and training examples D, is the
subset of hypotheses from H consistent with all training
examples in D.
𝑉 𝑆𝐻,𝐷 ≡ {ℎ ∈ 𝐻|𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡 ℎ, 𝐷 }
115
List-Then-Eliminate Algorithm
• List-Then-Eliminate algorithm initializes the version space to contain
all hypotheses in H, then eliminates any hypothesis found inconsistent
with any training example.
• The version space of candidate hypotheses thus shrinks as more
examples are observed, until ideally just one hypothesis remains that
is consistent with all the observed examples.
– Presumably, this is the desired target concept.
– If insufficient data is available to narrow the version space to a single hypothesis,
then the algorithm can output the entire set of hypotheses consistent with the
observed data.
• List-Then-Eliminate algorithm can be applied whenever the
hypothesis space H is finite.
– It has many advantages, including the fact that it is guaranteed to output all
hypotheses consistent with the training data.
– Unfortunately, it requires exhaustively enumerating all hypotheses in H - an
unrealistic requirement for all but the most trivial hypothesis spaces.
116
List-Then-Eliminate Algorithm
1. VersionSpace ← a list containing every hypothesis in H

2. For each training example, 𝑥, 𝑐(𝑥)
remove from VersionSpace any hypothesis h for which
h(x) ≠ c(x)
3. Output the list of hypotheses in VersionSpace
117
Compact Representation of Version Spaces
• A version space can be represented with its general and specific
boundary sets.
• The Candidate-Elimination algorithm represents the version space
by storing only its most general members G and its most specific
members S.
• Given only these two sets S and G, it is possible to enumerate all
members of a version space by generating hypotheses that lie between
these two sets in general-to-specific partial ordering over hypotheses.
• Every member of the version space lies between these boundaries
𝑉 𝑆𝐻,𝐷 = {ℎ ∈ 𝐻|(∃𝑠 ∈ 𝑆)(∃𝑔 ∈ 𝐺)(𝑔 ≥ ℎ ≥ 𝑠)}
where x ≥ y means x is more general or equal to y.
118
Example Version Space
S: {<Sunny, Warm, ?, Strong, ?, ?> }
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }
• A version space with its general and specific boundary sets.

• The version space includes all six hypotheses shown here,
but can be represented more simply by S and G.
119
Representing Version Spaces
S: { <Sunny, Warm, ?, Strong, ?, ?> }
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }
1. The General boundary, G, of version space V SH,D is the set of its maximally
general members that are consistent with the given training set
2. The Specific boundary, S, of version space V SH,D is the set of its maximally
specific members that are consistent with the given training set
3. Every member of the version space lies between these boundaries
V SH,D = {h ∈ H|(∃s ∈ S)(∃g ∈ G)(g ≥ h ≥ s)}
where x ≥ y means x is more general or equal to y 120
• The Candidate-Elimination algorithm computes the version space containing all
hypotheses from H that are consistent with an observed sequence of training
examples.
• It begins by initializing the version space to the set of all hypotheses in H; that
is, by initializing the G boundary set to contain the most general hypothesis in
H
G0 ← { <?, ?, ?, ?, ?, ?> }
and initializing the S boundary set to contain the most specific
hypothesis S0 ← { <, , , , ,  > }
• These two boundary sets delimit the entire hypothesis space, because every
other hypothesis in H is both more general than S0 and more specific than
G0.
• As each training example is considered, the S and G boundary sets are
generalized and specialized, respectively, to eliminate from the version space any
hypotheses found inconsistent with the new training example.
• After all the examples have been processed, the computed version space
contains all the hypotheses consistent with these examples and only these
121
hypotheses.
Input: training set
Output:
• G = maximally general hypotheses in H
• S = maximally specific hypotheses in H
• Algorithm:
• Initialize G to the set of maximally general hypotheses in H
• Initialize S to the set of maximally specific hypotheses in H
• For each training example d, do
– If d is a positive example (we tend to generalize specific hypothesis)
• Remove from G any hypothesis inconsistent with d ,
• For each hypothesis s in S that is not consistent with d
– Remove s from S
– Add to S all minimal generalizations h of s such that
» h is consistent with d, and some member of G is more general than h
– Remove from S any hypothesis that is more general than another hypothesis in S
122
– If d is a negative example (we tend to make general hypothesis more
specific)
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
– Remove g from G
– Add to G all minimal specializations h of g such that
» h is consistent with d, and some member of S is more specific than h
– Remove from G any hypothesis that is less general than another hypothesis in G
123
Example Trace
S 0: {<Ø, Ø, Ø, Ø, Ø, Ø>}
Expand
• S0 and G0 are the initial boundary sets

corresponding to the most specific and
most general hypotheses.
Shrink
G0 : {<?, ?, ?, ?, ?, ?>}
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same YES
2 Sunny Warm High Strong Warm Same YES
3 Rainy Cold High Strong Warm Change NO
4 Sunny Warm High Strong Cool Change YES
124
Example Trace
{<Sunny, Warm, Normal, Strong, Warm, Same>}
S 1:
Expand
S 2: {<Sunny, Warm, ?, Strong, Warm, Same>}
• Training examples 1 and 2 force the S

boundary to become more general.
Shrink
• They have no effect on the G boundary
G 1, G 2 : {<?, ?, ?, ?, ?, ?>}
Training examples:
1.<Sunny, Warm, Normal, Strong, Warm, Same>, Enjoy-Sport? = Yes
2.<Sunny, Warm, High, Strong, Warm, Same>, Enjoy-Sport? = Yes
125
Example Trace
S2 , S { <Sunny, Warm, ?, Strong, Warm, Same> }
3:
Expand
G 3: {<Sunny, ?, ?, ?, ?, ?> <?, Warm, ?, ?, ?, ?> <?, ?, ?, ?, ?, Same>}

Shrink
G 2: {<?, ?, ?, ?, ?, ?> }
Training Example:
3. <Rainy, Cold, High, Strong, Warm, Change>, EnjoySport=No
126
Candidate-Elimination Algorithm - Example
• Given that there are six attributes that could be specified to specialize
G2, why are there only three new hypotheses in G3?
• For example, the hypothesis h = <?, ?, Normal, ?, ?, ?> is a minimal
specialization of G2 that correctly labels the new example as a negative
example, but it is not included in G3.
– The reason this hypothesis is excluded is that it is inconsistent with S2.
– The algorithm determines this simply by noting that h is not more general than the
current specific boundary, S2.
• In fact, the S boundary of the version space forms a summary of the
previously encountered positive examples that can be used to determine
whether any given hypothesis is consistent with these examples.
• The G boundary summarizes the information from previously
encountered negative examples. Any hypothesis more specific than G is
assured to be consistent with past negative examples
127
Example Trace
S 3 : {<Sunny, Warm, ?, Strong, Warm, Same>}
Expand S
S 4: {<Sunny, Warm, ?, Strong, ?, ?>}
and prune G
G 4: {<Sunny, ?, ?, ?, ?, ?> <?, Warm, ?, ?, ?, ?>}

Shrink G
G 3: {<Sunny, ?, ?, ?, ?, ?> <?, Warm, ?, ?, ?, ?> <?, ?, ?, ?, ?, Same>}
Training Example:
4.<Sunny, Warm, High, Strong, Cool, Change>, EnjoySport = Yes
Note: When positive example comes to G, prune G, if necessary
128
Candidate-Elimination Algorithm - Example
• The fourth training example further generalizes the S boundary
of the version space.
• It also results in removing one member of the G boundary,
because this member fails to cover the new positive example.
– To understand the rationale for this step, it is useful to consider why
the offending hypothesis must be removed from G.
– Notice it cannot be specialized, because specializing it would not make it
cover the new example.
– It also cannot be generalized, because by the definition of G, any more
general hypothesis will cover at least one negative training example.
– Therefore, the hypothesis must be dropped from the G boundary, thereby
removing an entire branch of the partial ordering from the version space of
hypotheses remaining under consideration
129
Candidate-Elimination Algorithm – Example
Final Version Space
• After processing these four examples, the boundary sets S4 and
G4 delimit the version space of all hypotheses consistent with
the set of incrementally observed training examples.
• This learned version space is independent of the sequence in
which the training examples are presented (because in the end it
contains all hypotheses consistent with the set of examples).
• As further training data is encountered, the S and G boundaries
will move monotonically closer to each other, delimiting a
smaller and smaller version space of candidate hypotheses.
130
Properties of the two Sets
 S can be seen as the summary of the positive examples
 Any hypothesis more general than S covers all positive
examples
 Other hypotheses fail to cover at least one pos. example
 G can be seen as the summary of the negative examples
 Any hypothesis more specific than G covers no previous
negative example
 Other hypothesis cover at least one positive example
131
Resulting Version Space
{<Sunny, Warm, ?, Strong, ?, ?> }
S:
{<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }

G:
132
Properties
• If there is a consistent hypothesis then the algorithm will
converge to S = G = {h} when enough examples are provided
• False examples may cause the removal of the correct h
• If the examples are inconsistent, S and G become empty
• This can also happen, when the concept to be learned is not in H
133
What Next Training Example?
S:
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }
• If the algorithm is allowed to select the next example, which is best?

ideally, choose an instance that is classified positive by half and
negative by the other half of the hypothesis in VS. In either case
(positive or negative example), this will eliminate half of the
hypothesis. E.g: (Sunny Warm Normal Light Warm Same)
134
How Should These Be Classified?
S:
{<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }

G:
A. ( Sunny Warm Normal Strong Cool Change) ?

B. (Rainy Cool Normal Light Warm Same) ?
C. (Sunny Warm Normal Light Warm Same) ?
D. (Sunny Cold Normal Strong Warm Same) ?
135
Classification
 Classify a new example as positive or negative, if all hypotheses in the
version space agree in their classification
 Otherwise:
– Rejection or
– Majority vote (used in the following)
 Instance A is classified as a positive instance by every hypothesis in the
current version space.
 Because the hypotheses in the version space unanimously agree that this is
a positive instance, the learner can classify instance A as positive with the
same confidence it would have if it had already converged to the single,
correct target concept.
 Regardless of which hypothesis in the version space is eventually found to
be the correct target concept, it is already clear that it will classify instance
A as a positive example.
136
How Can Partially Learned Concepts Be Used?
 Notice furthermore that we need not enumerate every hypothesis in the version
space in order to test whether each classifies the instance as positive.
– This condition will be met if and only if the instance satisfies every member of S.
– The reason is that every other hypothesis in the version space is at least as general as
some member of S.
– By our definition of more-general-than, if the new instance satisfies all members
of S it must also satisfy each of these more general hypotheses.
 Instance B is classified as a negative instance by every hypothesis in the
version space.
– This instance can therefore be safely classified as negative, given the partially
learned concept.
– An efficient test for this condition is that the instance satisfies none of the members of
G.
137
How Can Partially Learned Concepts Be Used?
 Instance C: Half of the version space hypotheses classify instance C as
positive and half classify it as negative.
– Thus, the learner cannot classify this example with confidence until further
training examples are available.
 Instance D is classified as positive by two of the version space hypotheses
and negative by the other four hypotheses.
– In this case we have less confidence in the classification than in the
unambiguous cases of instances A and B.
– Still, the vote is in favor of a negative classification, and one approach we
could take would be to output the majority vote, perhaps with a confidence
rating indicating how close the vote was.
138
Example
Use candidate-elimination algorithm to build Specific and
General boundaries for the following data set.
139
Example-cont…
S0 : { <, , , , >}
S1 : { <Japan, Honda, Blue, 1980, Economy>}
S2 : { <Japan, Honda, Blue, 1980, Economy>}
S3 : { <Japan, ?, Blue, ?, Economy>}
S4 : { <Japan, ?, Blue, ?, Economy>}
S5 : { <Japan, ?, ?, ?, Economy>}
S6 : { <Japan, ?, ?, ?, Economy>}
G 6 : { <Japan, ?, ?, ?, Economy>}
G 5 : { <Japan, ?, ?, ?, Economy>}
G 4 : {<?, ?, Blue, ?, ?> <Japan, ?, ?, ?, Economy> <?, ?,Blue, ?, Economy>} (Remove
from G any hypothesis that is less general than another hypothesis in G)
G 3 : { <?, ?, Blue, ?, ?> <?, ?, ?, ?, Economy>}
G 2 : {<?, Honda, ?, ?, ?> <?, ?, Blue, ?, ?> <?, ?, ?, 1980, ?> <?, ?, ?, ?, Economy>}
G 1 : {<?, ?, ?, ?, ?> }
G 0 : {<?, ?, ?, ?, ?> }
140
Example-Cont…
 What will happen if we have another example in our data?
 The new example is inconsistent with the version-space we

have already found.
G cannot be specialized.
S cannot be generalized.
 The version space collapses (if we have 7 examples in our
data).
 Conclusion: No conjunctive hypothesis is consistent with
the data set consisting of above given 7 examples.
141
Inductive Bias - Fundamental Questions
for Inductive Inference
 The Candidate-Elimination Algorithm will converge toward the true target
concept provided it is given accurate training examples and provided its
initial hypothesis space contains the target concept.
 What if the target concept is not contained in the hypothesis space?

 Can we avoid this difficulty by using a hypothesis space that includes every
possible hypothesis?
 How does the size of this hypothesis space influence the ability of the
algorithm to generalize to unobserved instances?
 How does the size of the hypothesis space influence the number of training
examples that must be observed?
142
Inductive Bias - A Biased Hypothesis Space
 In EnjoySport example, we restricted the hypothesis space to include
only conjunctions of attribute values.
– Because of this restriction, the hypothesis space is unable to represent even simple
disjunctive target concepts such as "Sky = Sunny or Sky = Cloudy."
• From first two examples  S2 : <?, Warm, Normal, Strong, Cool, Change>
• This is inconsistent with third examples, and there are no hypotheses consistent
with these three examples
PROBLEM: We have biased the learner to consider only conjunctive hypotheses.
 We require a more expressive hypothesis space.
143
Inductive Bias - An Unbiased Learner
 The obvious solution to the problem of assuring that the target concept is in the
hypothesis space H is to provide a hypothesis space capable of representing
every teachable concept.
– Every possible subset of the instances X  the power set of X.
 What is the size of the hypothesis space H (the power set of X) ?

– In EnjoySport, the size of the instance space X is 96.
– The size of the power set of X is 2|X|  The size of H is 296
– Our conjunctive hypothesis space is able to represent only 973 of these
hypotheses.
 a very biased hypothesis space
144
Inductive Bias - An Unbiased Learner :
Problem
• Let the hypothesis space H to be the power set of X.
– A hypothesis can be represented with disjunctions, conjunctions, and negations of our
earlier hypotheses.
– The target concept "Sky = Sunny or Sky = Cloudy" could then be described as
<Sunny, ?, ?, ?, ?, ?>  <Cloudy, ?, ?, ?, ?, ?>
NEW PROBLEM: our concept learning algorithm is now completely

unable to generalize beyond the observed examples.
– three positive examples (xl,x2,x3) and two negative examples (x4,x5) to the learner.
– S : { x1  x2  x3 } and G : {  (x4  x5) }  NO GENERALIZATION
– Therefore, the only examples that will be unambiguously classified by S and G are the
observed training examples themselves.
145
Inductive Bias –
Fundamental Property of Inductive Inference
• A learner that makes no a priori assumptions regarding the identity
of the target concept has no rational basis for classifying any unseen
instances.
• Inductive Leap: A learner should be able to generalize training data

using prior assumptions in order to classify unseen instances.
• The generalization is known as inductive leap and our prior
assumptions are the inductive bias of the learner.
• Inductive Bias (prior assumptions) of Candidate-Elimination Algorithm
is that the target concept can be represented by a conjunction of attribute
values, the target concept is contained in the hypothesis space and
training examples are correct.
146
Inductive Bias – Three Learning Algorithms
ROTE-LEARNER: Learning corresponds simply to storing each observed training
example in memory. Subsequent instances are classified by looking them up in
memory. If the instance is found in memory, the stored classification is returned.
Otherwise, the system refuses to classify the new instance.
Inductive Bias: No inductive bias
CANDIDATE-ELIMINATION: New instances are classified only in the case where all
members of the current version space agree on the classification. Otherwise, the
system refuses to classify the new instance.
Inductive Bias: the target concept can be represented in its hypothesis space.
FIND-S: This algorithm, described earlier, finds the most specific hypothesis consistent
with the training examples. It then uses this hypothesis to classify all subsequent
instances.
Inductive Bias: the target concept can be represented in its hypothesis space, and all
instances are negative instances unless the opposite is entailed by its other know1edge.
147
Concept Learning - Summary
 Concept learning can be seen as a problem of searching through a
large predefined space of potential hypotheses H.
 The general-to-specific partial ordering of hypotheses provides a
useful structure for organizing the search through the hypothesis
space H.
 The FIND-S algorithm utilizes this general-to-specific ordering,
performing a specific-to-general search through the hypothesis space
along one branch of the partial ordering, to find the most specific
hypothesis consistent with the training examples.
 The CANDIDATE-ELIMINATION algorithm utilizes this general-
to-specific ordering to compute the version space (the set of all
hypotheses consistent with the training data) by incrementally
computing the sets of maximally specific (S) and maximally general
(G) hypotheses.
148
Concept Learning - Summary
 Because the S and G sets delimit the entire set of hypotheses
consistent with the data, they provide the learner with a description of
its uncertainty regarding the exact identity of the target concept.
 The CANDIDATE-ELIMINATION algorithm is not robust to noisy
data or to situations in which the unknown target concept is not
expressible in the provided hypothesis space.
 Inductive learning algorithms are able to classify unseen examples
only because of their implicit inductive bias for selecting one
consistent hypothesis over another.
 If the hypothesis space is enriched to the point where there is a
hypothesis corresponding to every possible subset of instances (the
power set of the instances), this will remove any inductive bias from
the CANDIDATE-ELIMINATION algorithm. Unfortunately, this also
removes the ability to classify any instance beyond the observed
training examples. An unbiased learner cannot make inductive leaps to
classify unseen examples. 149
CS-13410
(Supervised Learning – Decision Trees)
by
Mudasser Naseer
Supervised Learning
• Supervised learning is the machine learning task of
• learning a function that
• maps an input to an output
• based on example input-output pairs.
• Infers a function from labeled training data
• consisting of a set of training examples.
• Each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called
the supervisory signal).
(Wikipedia)
151
Supervised Learning – Detailed Definition
 In supervised learning we have input variables (X) and an output
variable (Y) and
 we use an algorithm to learn the mapping function from the input
to the output. Y = f(X)
 The goal is to approximate the mapping function so well that
 when you have new input data (X) that you can predict the output
variables (Y) for that data.
 It is called supervised learning because the process of an
algorithm learning from the training dataset can be thought of as
a teacher supervising the learning process.
 We know the correct answers, the algorithm iteratively makes
predictions on the training data and is corrected by the teacher.
 Learning stops when the algorithm achieves an acceptable level
of performance.
152
Supervised Learning
 Supervised learning problems can be further grouped into
regression and classification problems.
 Classification: A classification problem is when the output
variable is a category, such as “red” or “blue” or “disease”
and “no disease”.
 Regression: A regression problem is when the output
variable is a real value, such as “dollars” or “weight”.
153
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Tax-return data for year 2019
3 No Single 70K No
4 Yes Married 120K No A new tax return for 2021
5 No Divorced 95K Yes
Is this a cheating tax return?
6 No Married 60K No
7 Yes Divorced 220K No Refund Marital Taxable
Status Income Cheat
8 No Single 85K Yes
No Married 80K ?
9 No Married 75K No 10
10 No Single 90K Yes

10
An instance of the classification problem: learn a method for

discriminating between records of different classes (cheaters vs
non-cheaters)
154
What is classification?
 Classification is the task of learning a target function f that
maps attribute set x to one of the predefined class labels y

Status Income Cheat One of the attributes is the class attribute
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
155
What is classification (cont…)
 The target function f is known as a classification model
 Descriptive modeling: Explanatory tool to distinguish

between objects of different classes (e.g., understand
why people cheat on their taxes)
 Predictive modeling: Predict a class of a previously

unseen record
156
Examples of Classification Tasks
 Predicting tumor cells as benign or malignant
 Classifying credit card transactions as legitimate or

fraudulent
 Categorizing news stories as finance, weather, sports,

entertainment, etc
 Identifying spam email, spam web pages, adult content
 Understanding if a web query has commercial intent or not
157
General approach to classification
 Training set consists of records with known class labels
 Training set is used to build a classification model
 A labeled test set of previously unseen data records is

used to evaluate the quality of the model.
 The classification model is applied to new records with

unknown class labels
158
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
159
Evaluation of classification models
 Counts of test records that are correctly (or incorrectly)
predicted by the classification model
 Confusion matrix Predicted Class
Actual Class
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00
# correct prediction s f11  f 00

Accuracy  
total # of prediction s f11  f10  f 01  f 00
# wrong prediction s f10  f 01

Error rate  
total # of prediction s f11  f10  f 01  f 00
160
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
161
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
162
Decision Trees
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
163
Example of a Decision Tree
Splitting Attributes
Status Income Cheat

2 No Married 100K No Refund
Yes No
3 No Single 70K No Test outcome
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes <= 80K > 80K
9 No Married 75K No
NO YES
10
Class labels
Training Data Model: Decision Tree
164
Another Example of Decision Tree
MarSt Single,
Tid Refund Marital Taxable Married Divorced
Status Income Cheat
1 Yes Single 125K No NO Refund

Yes No
2 No Married 100K No
3 No Single 70K No
NO TaxInc
4 Yes Married 120K No
<= 80K > 80K
6 No Married 60K No NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10
10 No Single 90K Yes fits the same data!
165
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Model
11
12
No
Yes
Small
Medium
55K
80K
?
?
Decision
Deduction
13 Yes Large 110K ? Tree
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
166
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
167
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
<= 80K > 80K
NO YES
168
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
<= 80K > 80K
NO YES
169
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
<= 80K > 80K
NO YES
170
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
<= 80K > 80K
NO YES
171
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
<= 80K > 80K
NO YES
172
Decision Tree Classification Task
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Model
11
12
No
Yes
Small
Medium
55K
80K
?
?
Decision
Deduction
13 Yes Large 110K ? Tree
14 No Small 95K ?
15 No Large 67K ?
10
Test Set 173

Tree Induction
 Finding the best decision tree is NP-hard
 Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ, SPRINT
174
ID3 Algorithm
 ID3 (Iterative Dichotomiser 3) is an algorithm
invented by Ross Quinlan used to generate a decision
tree from a dataset.
 C4.5 is its successor.
 These algorithms employs a top-down, greedy search
through the space of possible decision trees.
175
Which Attribute is the Best Classifier?
The central choice in the ID3 algorithm is selecting
which attribute should be tested at the root of the tree and
then at each node in the tree
We would like to select the attribute which is most useful

for classifying examples
For this we need a good quantitative measure
For this purpose a statistical property, called information

gain is used
176
Definition of Entropy
 In order to define information gain precisely, we
begin by defining entropy
 Entropy is a measure commonly used in information

theory.
 Entropy characterizes the impurity of an arbitrary

collection of examples
177
Entropy
 Entropy (D)
 Entropy of data set D is denoted by H(D)
 Ci s are the possible classes
 pi = fraction of records from D that have class C
H ( D)   pi log 2 pi
ci
 The range of entropy is from 0 to log2(m), where m is
the number of classes. Maximum value is attained
when all the classes have equal proportion.
178
Entropy Examples
 Example:
 10 records have class A
 20 records have class B
 30 records have class C
 40 records have class D
 Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log .3) + (.4 log .4)]
 Entropy = 1.846
179
Splitting Criterion
 Example:
 Two classes, +/-
 100 records overall (50 +s and 50 -s)
 A and B are two binary attributes
 Records with A=0: 48+, 2-
Records with A=1: 2+, 48-
 Records with B=0: 26+, 24-
Records with B=1: 24+, 26-
 Splitting on A is better than splitting on B
 A does a good job of separating +s and -s
 B does a poor job of separating +s and -s
180
Information Gain
The expected information needed to classify a tuple in
D is
𝐼𝑛𝑓𝑜 𝐷 = − 𝑚 𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐻(𝐷)
How much more information would we still need
(after partitioning at attribute A) to arrive at an exact
classification? This amount is measured by
𝑚 𝐷𝑗
𝐼𝑛𝑓𝑜𝐴 𝐷 = − 𝑗=1 𝐷 × 𝐼𝑛𝑓𝑜 𝐷𝑗 = 𝐻(𝐷, 𝐴)
𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 (𝐷, 𝐴) = 𝐻(𝐷) – 𝐻(𝐷, 𝐴)
In general, we write 𝐺𝑎𝑖𝑛 (𝐷, 𝐴) , where D is the
collection of examples & A is an attribute
181
Entropy (for two classes i.e. m=2)
182
Information Gain
 Gain of an attribute split: compare the impurity of
the parent node with the average impurity of the
child nodes
 𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 – 𝐻 𝐷, 𝐴
𝑚
𝐷𝑗
= 𝐻(𝑝𝑎𝑟𝑒𝑛𝑡) − × 𝐻 𝐷𝑗
𝐷
𝑗=1
 Maximizing the gain  Minimizing the weighted
average impurity measure of children nodes
183
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
184
DECISION TREES

The collection of examples has 9 positive values and 5
negative ones
Entropy(D) = Entropy(S) = 0.940
Eight (6 positive and 2 negative ones) of these examples
have the attribute value Wind = Weak
Six (3 positive and 3 negative ones) of these examples

have the attribute value Wind = Strong
185
DECISION TREES
The information gain obtained by separating the examples
according to the attribute Wind is calculated as:
𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣
𝑆
𝑣∈ 𝑊𝑒𝑎𝑘,𝑆𝑡𝑟𝑜𝑛𝑔
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 8 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑊𝑒𝑎𝑘 − 6 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑆𝑡𝑟𝑜𝑛𝑔 )
= 0.94 − 8 14 ∗ 0.811 − 6 14 ∗ (1.0)
186
DECISION TREES
We calculate the Info Gain for each attribute and select
the attribute having the highest Info Gain
187
DECISION TREES
Example
Which attribute should be selected as the first test?
“Outlook” provides the most information, so we put “Outlook”

at root node of the decision tree.
188
DECISION TREES
189
DECISION TREES
Example
The process of selecting a new attribute is now

repeated for each (non-terminal) descendant
node, this time using only training examples
associated with that node
Attributes that have been incorporated higher in

the tree are excluded, so that any given attribute
can appear at most once along any path through
the tree
190
DECISION TREES
Example
This process continues for each new leaf node

until either:
1. Every attribute has already been included along

this path through the tree
2. The training examples associated with a leaf

node have zero entropy
191
DECISION TREES
Example
192
DECISION TREES
From Decision Trees to Rules
Next Step: Make rules from the decision tree
After making the identification tree, we trace each path

from the root node to leaf node, recording the test
outcomes as antecedents and the leaf node classification as
the consequent
Simple way: one rule for each leaf
For our example we have:
If the Outlook is Sunny and the Humidity is High then No

If the Outlook is Sunny and the Humidity is Normal then Yes
... 193
DECISION TREES FOR BOOLEAN FUNCTIOS
Example
Develop decision tree for the following Boolean function.

A∧ B
A B A∧ B
F F F
T F F
F T F
T T T
194
Lecture 07
CS-13410
(Decision Trees – Ch # 3 by Tom Mitchell)
by
Mudasser Naseer
Measuring Node Impurity
 p(i|t): fraction of records associated
with node t belonging to class i
c
Entropy (t )   p(i | t ) log p(i | t )
i 1
• Used in ID3 and C4.5
c
Gini (t )  1    p(i | t )
2
i 1
• Used in CART, SLIQ, SPRINT.

Classification error(t )  1  max i  p(i | t )
196
Example
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C1 0 Gini = 1 – (P(C1))2 – (P(C2))2 = 1 – 0 – 1 = 0
C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6

C1 1 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6

C1 2 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 197
Impurity measures
 All of the impurity measures take value zero
(minimum) for the case of a pure node where a
single value has probability 1
 All of the impurity measures take maximum

value when the class distribution in a node is
uniform.
198
Splitting Based on GINI
 When a node p is split into k partitions
(children), the quality of split is computed as,
k
ni
GINI split   GINI (i)
i 1 n
where, ni = number of records at child i,

n = number of records at node p.
199
Binary Attributes: Computing GINI Index
• Splits into two partitions
• Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
N1 N2
Gini(Children)
C1 5 1
Gini(N1) = 7/12 * 0.194 + 5/12 * 0.528
C2 2 4
= 1 – (5/7)2 – (2/7)2 = 0.194 = 0.333
Gini=0.333
Gini(N2) This is the quality of split
= 1 – (1/5)2 – (4/5)2 = 0.528 for Variable B 200
Categorical Attributes
 For binary values split in two

 For multivalued attributes, for each distinct value,
gather counts for each class in the dataset
 Use the count matrix to make decisions
Multi-way split Two-way split

(find best partition of values)
CarType CarType CarType

Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
201
Continuous Attributes
 Use Binary Decisions based on one value Tid Refund Marital
Status
Taxable
Income Cheat

 Choices for the splitting value 2 No Married 100K No
 Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No
 Each splitting value has a count matrix 6 No Married 60K No
associated with it 7 Yes Divorced 220K No

8 No Single 85K Yes
 Class counts in each of the partitions, 9 No Married 75K No
A ≤ v and A > v 10 No Single 90K Yes
10
 Exhaustive method to choose best v Taxable

Income
 For each v, scan the database to > 80K?
gather count matrix and compute the
impurity index Yes No
 Computationally Inefficient!
Repetition of work.
Continuous Attributes
 For efficient computation: for each attribute,
 Sort the attribute on values
 Linearly scan these values, each time updating the count matrix
and computing impurity
 Choose the split position that has the least impurity
Cheat No No No Yes Yes Yes No No No No

Taxable Income
Sorted Values 60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
203
Splitting based on impurity
Impurity measures favor attributes with
large number of categories
A test condition with large number of

outcomes may not be desirable
 # of records in each partition is too small
to make predictions
204
Gain Ratio
The information gain measure tends to prefer attributes with
large numbers of possible categories.
Gain ratio: a modification of the information gain that reduces
its bias on high‐branch attributes.
Gian ratio should be
 Large when data is evenly spread
 Small when all data belong to one branch
 Gain ratio takes number and size of branches into account
when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
𝐷𝑖 𝐷𝑖
𝑆𝑝𝑙𝑖𝑡 𝐷, 𝐴 ≡ − 𝑙𝑜𝑔2 Or if we use S in place of D
𝐷 𝐷
Gain Ratio
 Adjusts Information Gain by the entropy of the partitioning
(SplitINFO). Higher entropy partitioning (large number of
small partitions) is penalized!
 Used in C4.5
 Designed to overcome the disadvantage of impurity
Example (Play tennis) :
More on the gain ratio
“Outlook” still comes out top
However: “ID code” has greater gain ratio
 Standard fix: In particular applications we can
use an ad hoc test to prevent splitting on that
type of attribute
Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix:
• First, only consider attributes with greater than average
information gain
• Then, compare them on gain ratio
207
Comparing Attribute Selection Measures
The three measures, in general, return good
results but
 Information Gain
 Biased towards multivalued attributes
 Gain Ratio
 Tends to prefer unbalanced splits in which one
partition is much small than the other
 Gini Index
 Biased towards multivalued attributes
 Has difficulties when the number of classes is
large
 Tends to favor tests that result in equal-sized
partitions and purity in both partitions
208
Stopping Criteria for Tree Induction
Stop expanding a node when all the

records belong to the same class
Stop expanding a node when all the

records have similar attribute values
209
Decision Tree Based Classification
Advantages:
 Inexpensive to construct
 Extremely fast at classifying unknown records
 Easy to interpret for small-sized trees
 Accuracy is comparable to other classification
techniques for many simple data sets
210
Example: C4.5
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each
node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
 Needs out-of-core sorting.
You can download the software from:

http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
211
Practical Issues of Classification
Underfitting and Overfitting
Evaluation
212
Underfitting and Overfitting
Underfitting Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex it models the details of the training set 213
and fails on the test set
Overfitting due to Noise
Decision boundary is distorted by noise point 214

Notes on Overfitting
Overfitting results in decision trees that
are more complex than necessary
Training error no longer provides a
good estimate of how well the tree will
perform on previously unseen records
 The model does not generalize well
Need new ways for estimating errors
215
How to Address Overfitting:
Tree Pruning
 Pre-Pruning (Early Stopping Rule)
 Stop the algorithm before it becomes a fully-grown tree
 Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
 More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available features
(e.g., using  2 test)
• Stop if expanding the current node does not improve impurity measures (e.g.,
Gini or information gain) or it falls below a threshold value.
 Upon halting, the node becomes a leaf
 The leaf may hold the most frequent class among the subset
tuples.
 Problem 216
• Difficult to choose an appropriate threshold
How to Address Overfitting…
Post-pruning
 Grow decision tree to its entirety
 Trim the nodes of the decision tree in a
bottom-up fashion
 If generalization error improves after
trimming, replace sub-tree by a leaf node.
 Class label of leaf node is determined from
majority class of instances in the sub-tree
217
Prune the Tree OR Prune the Rule
 In order to Reduce the complexity of
decision procedure we have two options
 (i) either we can prune the tree first and
then develop the rules or
 (ii) we can develop the rules and then
prune the rules.
Which is better?
(ii) is better, why?
218
CS-13410
Lecture # 08
Classification - Model Evaluation

Model Evaluation
Metrics for Performance Evaluation
 How to evaluate the performance of a
model?
Methods for Performance Evaluation

 How to obtain reliable estimates?
Model Evaluation
model?

Metrics for Performance Evaluation
 Focus on the predictive capability of a model
 Rather than how fast it takes to classify or build
models, scalability, etc.
 Success: instance’s class is predicted correctly
 Error: instance’s class is predicted incorrectly
 Error rate: proportion of errors made over the
whole set of instances
 Accuracy: proportion of correctly classified
instances over the whole set of instances
accuracy = 1 – error rate
Metrics for Performance Evaluation
Confusion Matrix:
PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
Class=Yes a b c: FP (false positive)
ACTUAL d: TN (true negative)
Class=No c d
CLASS
TP: true FN: false

positives negatives
FP: false TN: true

positives negatives
Type I error: false positive

Type II error: false negative
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
𝑎+𝑑 𝑇𝑃 + 𝑇𝑁
Accuracy = =
𝑎 + 𝑏 + 𝑐 + 𝑑 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Most widely-used metric:

Limitation of Accuracy
Consider a 2-class problem
 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10
If model predicts everything to be class

0, accuracy is 9990/10000 = 99.9 %
 Accuracy is misleading because model does
not detect any class 1 example
Dealing with class Imbalance
If the class we are interested in is very
rare, then the classifier will ignore it.
 The class imbalance problem
Solution
 We can modify the optimization criterion
by using a cost sensitive metric
 We can balance the class distribution
• Sample from the larger class so that the size of
the two classes is the same
• Replicate the data of the class of interest so
that the classes are balanced
– Over-fitting issues
Precision-Recall
Count PREDICTED CLASS
Class=Yes Class=No
𝑎 𝑇𝑃 Class=Yes a b
Precision (p) = =
𝑎 + 𝑐 𝑇𝑃 + 𝐹𝑃 ACTUAL Class=No c d
𝑎 𝑇𝑃 CLASS
Recall (r) = =
𝑎 + 𝑏 𝑇𝑃 + 𝐹𝑁
1 2𝑟𝑝 2𝑎 2𝑇𝑃
F−measure (F𝟏) = = = =
1/𝑟 + 1/𝑝 𝑟 + 𝑝 2𝑎 + 𝑏 + 𝑐 2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
2
 Precision is biased towards C(Yes|Yes) & C(Yes|No)

 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
Precision-Recall plot
Usually for parameterized models, it
controls the precision/recall tradeoff
Sensitivity and Specificity
Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly
identified), while specificity is the true negative rate (i.e., the
proportion of negative tuples that are correctly identified).
These measures are defined as
𝑇𝑃 𝑎
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = =
𝑇𝑃+𝐹𝑁 𝑎+𝑏
𝑇𝑁 𝑑
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = =
𝐹𝑃+𝑇𝑁 𝑐+𝑑
It can be shown that accuracy is a function of sensitivity and
specificity
𝑃 𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
(𝑃 + 𝑁) (𝑃 + 𝑁)
229
Model Evaluation
model?

Methods for Performance Evaluation
How to obtain a reliable estimate of

performance?
Performance of a model may depend
on other factors besides the learning
algorithm:
 Class distribution
 Cost of misclassification
 Size of training and test sets
Methods of Reliable Estimation
Holdout
 Reserve 2/3 for training and 1/3 for testing
Random subsampling
 One sample may be biased -- Repeated holdout
Cross validation
 Partition data into k disjoint subsets
 k-fold: train on k-1 partitions, test on the remaining one
 Leave-one-out: k=n
 Guarantees that each record is used the same number of
times for training and testing
Bootstrap
 Sampling with replacement
 ~63% of records used for training, ~37% for testing
Lecture # 09
Main Features
 Weka contains tools for data pre-processing,

classification, clustering, association rules, and
visualization. (Weka Knowledge Explorer)
 Environment for comparing learning algorithms

(Experimental)
 It is also well-suited for developing new data

mining or machine learning schemes.
WEKA: versions
 There are several versions of WEKA:
 WEKA 3.0: “command-line”
 WEKA 3.2: “GUI version” adds graphical user
interfaces
 WEKA 3.3: “development version” with lots of
improvements
 This
talk is based on the latest snapshot of
WEKA 3.3 (soon to be WEKA 3.4)
WEKA Knowledge Explorer
 Preprocess Choose and modify the data
 Classify Train and test learning schemes that classify
 Cluster Learn clusters for the data
 Association Learn association rules for the data
 Select attributes Most relevant attributes in the data

 Visualize View an interactive 2D plot of the data
WEKA Explorer: Pre-processing the Data
 Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
 Data can also be read from a URL or from
an SQL database (using JDBC)
 Pre-processing tools in WEKA are called
“filters”
 WEKA contains filters for:
 Discretization, normalization, attribute
selection, transforming, …
WEKA only deals with “flat” files
 Thedata must be converted to ARFF format
before applying any algorithm.
 The dataset’s name: @relation
 The attribute information: @attribute
 The data section begins with @data
 Data: a list of instances with the attribute values
being separated by commas.
 By default, the class is the last attribute in the
ARFF file.
Numeric attribute and Missing Value
@relation weather
@attribute outlook {sunny, overcast, rainy}

@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {YES,NO}
@data
Sunny, 85, 85, FALSE, no
Sunny, 80, 90, TRUE, no
Overcast, 83, 86, FALSE, yes
Rainy, 70, 96, FALSE, yes
...
Numeric attribute and Missing Value
@relation weather
@attribute outlook {sunny, overcast, rainy}

@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FASE}
@attribute play {YES,NO}
@data
Sunny, 85, 85, FALSE, no
Sunny, 80, 90, TRUE, no
Overcast, 83, 86, FALSE, ?
Rainy, 70, 96, ?, yes
...
Explorer: building “classifiers”
 Classifiers in WEKA are models for
predicting nominal or numeric quantities
 Implemented learning schemes include:
 Decision trees and lists, instance-based
classifiers, support vector machines, multi-
layer perceptrons, logistic regression, Bayes’
nets, …
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Explorer: clustering data
 WEKA contains “clusterers” for finding
groups of similar instances in a dataset
 Implemented schemes are:
 k-Means, EM, Cobweb, X-means,
FarthestFirst
 Clusters can be visualized
 Evaluation based on loglikelihood if
clustering scheme produces a probability
distribution
Explorer: finding associations
 WEKA contains an implementation of the
Apriori algorithm for learning association
rules
 Works only with discrete data
 Can identify statistical dependencies
between groups of attributes:
 milk, butter  bread, eggs (with confidence
0.9)
 Apriorican compute all rules that have a
given minimum support and exceed a
given confidence
Explorer: attribute selection
 Panel that can be used to investigate
which (subsets of) attributes are the most
predictive ones
 Attribute selection methods contain two
parts:
 A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
 An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
 Very flexible: WEKA allows (almost)
arbitrary combinations of these two
Explorer: data visualization
 Visualization very useful in practice: e.g.
helps to determine difficulty of the learning
problem
 WEKA can visualize single attributes (1-d)
and pairs of attributes (2-d)
 To do: rotating 3-d visualizations (Xgobi-style)
 Color-coded class values
 “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)
Performing experiments
 Experimentermakes it easy to compare
the performance of different learning
schemes
 For classification and regression problems
 Results can be written into file or database
 Evaluationoptions: cross-validation,
learning curve
Resources:
 WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
 Also has a list of projects based on

WEKA
Tutorial.
http://prdownloads.sourceforge.net/weka/wek
a.ppt
CS-13410
(k-Nearest Neighbors)
Lecture # 10
by
Mudasser Naseer
WHY NEAREST NEIGHBOR?
Used to classify objects based on closest
training examples in the feature space
 Feature space: raw data transformed into sample
vectors of fixed length using feature extraction
(Training Data)
Top 10 Data Mining Algorithm
 ICDM paper – December 2007
Among the simplest of all Data Mining
Algorithms
 Classification Method
Implementation of lazy learner ?
 All computation deferred until

classification
318
k NEAREST NEIGHBOR
 Requires 3 things:
 Feature Space (Training Data)
 Distance metric
• to compute distance between
records/instances
 The value of k
• the number of nearest
neighbors to retrieve from
? which to get majority class
 To classify an unknown record:
 Compute distance to other
training records
 Identify k nearest neighbors
 Use class labels of nearest
neighbors to determine
the class label of unknown
record
319
k NEAREST NEIGHBOR
Common Distance Metrics:
 Euclidean distance (continuous distribution)
d(p,q) = √∑(pi – qi)2
 Discrete Metric(Boolean metric)
if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1
Determine the class from k nearest

neighbor list
 Take the majority vote of class labels among the k-
nearest neighbors
 Weighted factor
w =1/d(generalized linear interpolation) or 1/d2
320
k NEAREST NEIGHBOR
 k = 1:
 Belongs to square class
 k = 3:
?  Belongs to triangle class
 k = 7:
 Belongs to square class
 Choosing the value of k:

 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include points from
other classes
 Choose an odd value for k, to eliminate ties 321
Example 1
Determining decision on scholarship application
based on the following features:
 Household income (annual income in thousands of
dollars)
 Number of siblings in family
 High school grade (on a QPI scale of 1.0 – 4.0)
Intuition (reflected on data set): award
scholarships to high-performers and to those
with financial need
322
Distance formula
Euclidian distance: square root of sum

of squares of differences
for two features: (x)2 + (y)2
Intuition: similar samples should be

close to each other
 May not always apply
(example: quota and actual sales)
323
Incomparable ranges
The Euclidian distance formula has the implicit

assumption that the different dimensions are
comparable
Features that span wider ranges affect the distance
value more than features with limited ranges
324
Example revisited
Suppose household income was instead indicated in

thousands of dollars per month and that grades are
given on a 70-100 scale
Note different results produced by kNN algorithm on
the same dataset
325
Non-numeric data
Feature values are not always numbers
Example
 Boolean values: Yes or no, presence or
absence of an attribute
 Categories: Colors, educational attainment,
gender
How do these values factor into the
computation of distance?
326
Dealing with non-numeric data
Boolean values => convert to 0 or 1
 Applies to yes-no/presence-absence attributes
Non-binary characterizations
 Use natural progression when applicable; e.g.,
educational attainment: JS, HS, College, MS, PHD
=> 1,2,3,4,5
 Assign arbitrary numbers but be careful about
distances; e.g., color: red, yellow, blue => 1,2,3
How about unavailable data?
(0 value not always the answer)
327
Preprocessing your dataset
Dataset may need to be preprocessed
to ensure more reliable data mining
results
Conversion of non-numeric data to
numeric data
Calibration of numeric data to reduce
effects of disparate ranges
 Particularly when using the Euclidean
distance metric
328
k-NN variations
Value of k
 Larger k increases confidence in prediction
 Note that if k is too large, decision may be
skewed
Weighted evaluation of nearest neighbors
 Plain majority may unfairly skew decision
 Revise algorithm so that closer neighbors have
greater “vote weight”
Other distance measures
329
Types of Attributes
 There are different types of attributes
 Nominal (Categorical)
• Examples: ID numbers (Student-ID, Customer-ID), eye color,
zip codes
• Binary Attribute is a type of Nominal Attribute with only two
categories
 Ordinal (meaningful order or ranking among them)
• Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
 Numeric
• Interval (measured on a scale of equal-size units)
– Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio-Scaled (with an inherent zero-point)
– Examples: temperature in Kelvin, length, time, counts 330
Properties of Attribute Values
The type of an attribute depends on which of
the following properties it possesses:
 Distinctness: = 
 Order: < >
 Addition: + -
 Multiplication: */
 Nominal attribute: distinctness

 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties 331
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye contingency
nominal attributes provide only color, gender: correlation, 2
enough information to distinguish {male, female} test
one object from another. (=, )
Ordinal hardness of median,
The values of an ordinal
minerals, {good, percentiles, rank
attribute provide enough
better, best}, correlation, run
information to order objects. (<,
grades, street tests, sign tests
>)
numbers
Interval For interval attributes, the calendar dates, mean, standard
differences between values are temperature in deviation,
meaningful, i.e., a unit of Celsius or Pearson's
measurement exists. Fahrenheit correlation, t and
(+, - ) F tests
Ratio For ratio variables, both temperature in geometric mean,
differences and ratios are Kelvin, monetary harmonic mean,
meaningful. (*, /) quantities, counts, percent variation
age, mass, length,
electrical current
332
Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID numbers

were reassigned, would it
make any difference?
Ordinal An order preserving change of An attribute encompassing

values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
{1, 2, 3} or by { 0.5, 1, 10}.
Interval new_value =a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
333
Discrete and Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and
represented using a finite number of digits.
 Continuous attributes are typically represented as floating-
point variables. 334
Non-Euclidean Distances
Proximity Measure for Nominal Attributes
Jaccard measure for binary vectors
Distance measure for Ordinal variables
Cosine measure = angle between vectors from
the origin to the points in question.
Dissimilarity Measure for Mixed type attributes.
Edit distance = number of inserts and deletes to
change one string into another.
335
Proximity Measure for Nominal Attributes
The dissimilarity between two objects i
and j having nominal attributes can be
compute based on the ratio of
mismatches pm
d (i, j) 
p
where m is the number of matches and p
is the total number of attributes.
m
sim(i, j) 1 d (i, j) 
p
Example:
336
Jaccard Measure
A note about Binary variables first
 Symmetric binary variable
• If both states are equally valuable and carry the same weight, that is,
there is no preference on which outcome should be coded as 0 or 1.
• Like “gender” having the states male and female
 Asymmetric binary variable
• If the outcomes of the states are not equally important, such as the
positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and the other
by 0 (HIV negative).
• Given two asymmetric binary variables, the agreement of two 1s (a
positive match) is then considered more important than that of two 0s
(a negative match).
337
Jaccard Measure
 A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a  c b  d p
 Simple matching coefficient (invariant, if the binary

variable is symmetric): d (i, j)  bc
a b  c  d
 Jaccard coefficient (noninvariant if the binary variable
is asymmetric): d (i, j)  b  c
a bc
338
Jaccard Measure Example
 Example
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack Y N P N N N
Mary Y N P N P N
Jim Y P N N N N
 All attributes are asymmetric binary

 let the values Y and P be set to 1, and the value N be set to 0
0 1
d ( jack , mary )   0.33 1 0 sum
2 1 0 1 a b a b
11
d ( jack , jim)   0.67 0 c d cd
111
sum a  c b  d p
1 2
d ( jim, mary )   0.75
11 2 d (i, j)  bc
a bc
sim(i, j) 1 d (i, j)  a
a b  c 339
340
CS-13410
(k-Nearest Neighbors-II)
Lecture # 11
by
Mudasser Naseer
Distance for Ordinal variables
 The rank of the ordinal variable f for the ith object is rif.
Where variable f has Mf ordered states.
 rif Є {1…Mf}
 Since each ordinal variable can have a different number
of states, therefore map the range of each variable onto
[0.0, 1.0], so that each variable has equal weight. This
can be achieved using the following formula.
 For each value rif in ordinal variable f , replace it by zif
𝑟𝑖𝑓 −1
where 𝑧𝑖𝑓 =
𝑀𝑓 −1
 After calculating zif , calculate the distance using

Euclidean distance formulas for numeric values.
Example:
342
Dissimilarity for attributes of Mixed Type
 Suppose the data set contains p attributes of mixed type. The
dissimilarity between object i and j is
 pf 1  ij d ij
(f) (f)
d (i, j)  where the indicator

 pf 1  ij
(f)
 ij( f )  0 if (i) xif or xjf is missing or

(ii) xif =xjf =0 and attribute is symmetric binary
 ij( f )  1 otherwise
xif  x jf
 If f is numeric: d (f)
 where h runs
max h xhf  min h xhf
ij
over all the non-missing objects for attribute f.

 If f is nominal or binary: ij  0 xif  x jf ; otherwise 1.
(f)
d if
343
Dissimilarity for attributes of Mixed Type
 For ordinal attributes, use the technique mentioned earlier
to compute dissimilarity matrix i.e. compute the ranks rif and
𝑟𝑖𝑓 −1
𝑧𝑖𝑓 = , and treat Zif as numeric.
𝑀𝑓 −1
344
Attributes of Mixed Type - Example
 Consider the data in the table:
 The dissimilarity matrices for
the nominal and ordinal data
are shown here, computed
using the methods discussed
before
 To compute dissimilarity 0 
matrix for the numeric 1 
0
attribute, maxhxh=64, d (1)
 
minhxh=22. Using the formula
ij
1 1 0 
from previous slide, the  
0 1 1 0
dissimilarity matrix is obtained
as shown below:  0 
 0  1.0 0 
 0.55  d (2)
 
0 ij
 0.5 0.5 0 
d (3)
   
ij
 0.45 1.00 0   0 1.0 0.5 0
  345
 0.40 0.14 0.86 0
Attributes of Mixed Type - Example
 The three dissimilarity matrices can now be used to compute the
overall dissimilarity between two objects using the equation
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
(𝑓)
 The indicator 𝛿𝑖𝑗 = 1 for each of the three attributes, f. We get, for example,
d (3,1)  1(1)  1(0.50)

3
 1(0.45)
 0.65
 The resulting dissimilarity matrix is
 0 
 0.85 0 
dij   
 0.65 0.83 0 
 
 0.13 0.71 0.79 0 346
Cosine Measure (similarity)
Think of a point as a vector from the origin (0,0,…,0) to
its location.
Two points’ vectors make an angle, whose cosine is the
normalized dot-product of the vectors.
Where x and y are column vectors n
xt . y x y
sim(i, j)  
i i
i 1
x. y n
x  y
2
n 2
i i
i 1 i 1
Example:
Find cosine similarity between x and y where

2 
4
 
   
x  5
 x  2

   

8


7

347
Cosine Measure (similarity)
 Cosine similarity is a measure of similarity that can be used to compare
documents.
 A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document
called term-frequency vector.
 Cosine measure: If d1 and d2 are two term-frequency vectors, then

cos(x, y) = (x  y) /||x|| ||y|| ,
where  indicates dot product
 A cosine value of 0 means that the two vectors are at 90 degrees to each other
and there is no match.
 The closer the cosine value to 1, the greater the match between the vectors 348
Edit Distance
The edit distance of two strings is the
number of inserts and deletes of
characters needed to turn one into the
other.
Equivalently, d(x,y) = |x| + |y| -

2|LCS(x,y)|.
 LCS = longest common subsequence =
longest string obtained both by deleting from
x and deleting from y.
349
Example
x = abcde ; y = bcduve.
LCS(x,y) = bcde.
D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –
2*4 = 3.
350
k-NN Time Complexity
Suppose there are m instances and n
features in the dataset
Nearest neighbor algorithm requires
computing m distances
Each distance computation involves
scanning through each feature value
Running time complexity is proportional to
m×n
351
k NEAREST NEIGHBOR
Accuracy of all NN based classification,
prediction, or recommendations depends solely on
a data model, no matter what specific NN
algorithm is used.
Scaling issues
 Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes.
 Examples
• Height of a person may vary from 4’ to 6’
• Weight of a person may vary from 100lbs to 300lbs
• Income of a person may vary from $10k to $500k
Nearest Neighbor classifiers are lazy learners
 No pre-constructed models for classification 352
k NEAREST NEIGHBOR ADVANTAGES
Simple technique that is easily implemented
Building model is inexpensive
Extremely flexible classification scheme
 does not involve preprocessing
Well suited for
 Multi-modal classes (classes of multiple forms)
 Records with multiple class labels
Asymptotic Error rate at most twice Bayes rate
 Cover & Hart paper (1967)
Can sometimes be the best method
 Michihiro Kuramochi and George Karypis, Gene Classification
using Expression Profiles: A Feasibility Study, International
Journal on Artificial Intelligence Tools. Vol. 14, No. 4, pp. 641-
660, 2005
 K nearest neighbor outperformed SVM for protein function
353
prediction using expression profiles
k NEAREST NEIGHBOR DISADVANTAGES
Classifying unknown records are relatively
expensive
 Requires distance computation of k-nearest neighbors
 Computationally intensive, especially when the size of
the training set grows
Accuracy can be severely degraded by the
presence of noisy or irrelevant features
NN classification expects class conditional
probability to be locally constant
 bias of high dimensions
354
How to select the optimal k value?
 There are no pre-defined statistical methods to find the
most favorable value of k.
 Initialize a random k value and start computing.
 Choosing a small value of k leads to unstable decision
boundaries.
 The substantial k value is better for classification as it
leads to smoothening the decision boundaries.
 Derive a plot between error rate and k denoting
values in a defined range. Then choose the k
value as having a minimum error rate.
Another simple approach to select k is set k =sqrt(n)
355
Lecture 12
CS-13410
(Artificial Neural Networks – Ch # 4 by Tom
Mitchell)
by
Mudasser Naseer
• Brain
• A marvelous piece of
architecture and design.
• In association with a
nervous system, it
controls the life patterns,
communications,
interactions, growth and
development of
hundreds of million of life
forms.
There are about 1010 to 1014 nerve cells
(called neurons) in an adult human brain.
Neurons are highly connected with each
other. Each nerve cell is connected to
hundreds of thousands of other nerve cells.
Passage of information between neurons is
slow (in comparison to transistors in an IC).
It takes place in the form of electrochemical
signals between two neurons in milliseconds.
Energy consumption per neuron is low
(approximately 10-6 Watts).
Look more like some
blobs of ink… aren’t they!
Taking a more closer look

reveals that there is a
large collection of different
molecules, working together
coherently, in an organized
manner.
Put together, they form the

best information processing
system in the known universe.
Axons Terminals Axon Nucleus or Soma
(release output signals)
Cell nucleus or
Axon
Soma is a cable
processes
Synapse is the
that is used
the information
connection by
between
neurons
received
an tofrom
axon and send
other
information.
dendrites.
neuron dendrites.
Synapse
Dendrites from Cell Body
Another neuron
Axon terminal from Dendrites

Another neuron (receive input signals)
Mind you, a neuron is a three dimensional entity!

A few Neurons and
Flow of Information their synaptic junctions
Perceptron
Linear threshold unit (LTU)
x0=1
x1 w1
w0
w2
x2  o
. n

. wi xi
. i=0
xn wn 𝒏
𝑶 𝒙𝒊 = 𝟏 𝒊𝒇 𝒘𝒊 𝒙𝒊 > 𝟎
𝒊=𝟎
= −𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
363
A Neuron (= a perceptron)
- qj
x0 w0
x1

w1
f output y
xn wn
For Example
Input weight weighted Activation n
vector x vector w sum function y  sign( wi xi  q j )

i 0
 The n-dimensional input vector x is mapped into variable y by

means of the scalar product and a nonlinear function mapping
364
Perceptron Learning Rule
wi = wi + wi
wi =  (t - o) xi
t=c(x) is the target value
o is the perceptron output
 is a small constant (e.g. 0.1) called learning rate)
• If the output is correct (t=o) the weights wi are not
changed
• If the output is incorrect (to) the weights wi are
changed such that the output of the perceptron for the
new weights is closer to t.
• The algorithm converges to the correct classification
• if the training data is linearly separable
• and  is sufficiently small
365
Supervised Learning
Training and test data sets
Training set; input & target
366
Perceptron Training
A single perceptron can be used to represent
many boolean functions
𝒏
𝑶𝒖𝒕𝒑𝒖𝒕 = 𝟏 𝒊𝒇 𝒊=𝟏 𝒘𝒊 𝒙𝒊
>𝒕
= 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
𝒘𝒉𝒆𝒓𝒆 𝒕 ≠ 𝟎
Linear threshold is used.
W - weight value
367
t - threshold value
Simple network
𝟏 𝒊𝒇 𝒘𝒊 𝒙𝒊 > 𝒕
-1 𝑶𝒖𝒕𝒑𝒖𝒕 =
𝒊=𝟎
W0 = 1.5 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
W1 = 1
X t = 0.0
W2 = 1
Y
368
OR Function Using A Perceptron
369
XOR Function Using A Perceptron
 Notice that the fourth equation contradicts the second and the third
equation. Point is, there are no perceptron solutions for non-linearly
separated data. So the key take away is that single perceptron 370
cannot learn to separate the data that are non-linear in nature.
Training Perceptron
-1 For AND
W0 = ? A B Output
00 0
x 01 0
W1 = ? t = 0.0
10 0
11 1
W2 = ?
y
•What are the weight values?

•Initialize with random weight values
371
Training Perceptron
For AND
-1
W0 = 0.3 A B Output
00 0
x 01 0
W1 = 0.5 t = 0.0
10 0
W2 = -0.4 11 1
y
I1 I2=A I3=B Summation Output

-1 0 0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 0
-1 0 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0
-1 1 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1
-1 1 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2 0
For practice purpose find the final values of w1 and w2. Using
wi = wi + wi
wi =  (t - o) xi taking  = 0.1 372
Also develop perceptrons for OR and NOT.
Learning algorithm
Epoch : Presentation of the entire training set to the neural
network.
In the case of AND function an epoch consists of
four sets of inputs being presented to the network
(i.e. [0,0], [0,1], [1,0], [1,1])
Error: The error value is the amount by which the value
output by the network differs from the target
value. For example, if we required the network to
output 0 and it output a 1, then Error = -1
373
Learning algorithm
Target Value, T : When we are training a network we
not only present it with the input but also with a value
that we require the network to produce. For example,
if we present the network with [1,1] for the AND
function the training value will be 1
Output , O : The output value from the neuron
Ij : Inputs being presented to the neuron
Wj : Weight from input neuron (Ij) to the output neuron
LR : The learning rate. This dictates how quickly the
network converges. It is set by a matter of
experimentation. It is typically 0.1
374
Decision boundaries
• In simple cases, divide feature space by

drawing a hyperplane across it.
• Known as a decision boundary.
• Discriminant function: returns different values
on opposite sides. (straight line)
• Problems which can be thus classified are
linearly separable.
375
Linear Separability
X1
A
A
A
B Decision
A Boundary
A B
B
B
A B
A B B
X2
B
376
Decision Surface of a
Perceptron
x2 x2
+
+ + -
+ -
- x1
x1
+ - - +
-
Linearly separable Non-Linearly separable
• Perceptron is able to represent some useful functions

• AND(x1,x2) choose weights w0=-1.5, w1=1, w2=1
• But functions that are not linearly separable (e.g. XOR)
are not representable
377
Hyperplane partitions
• An extra layer models a convex hull
– “An area with no dents in it”
– Perceptron models, but can’t learn
– Sigmoid function learning of convex hulls
– Two layers add convex hulls together
– Sufficient to classify anything “sane”.
• In theory, further layers add nothing
• In practice, extra layers may be better
378
Different Non-Linearly
Separable Problems
Types of Exclusive-OR Classes with Most General
Structure
Decision Regions Problem Meshed regions Region Shapes
Single-Layer Half Plane A B

Bounded By B
A
Hyperplane B A
Two-Layer Convex Open A B

Or B
A
Closed Regions B A
Three-Layer Arbitrary
(Complexity A B
B
Limited by No. A
of Nodes) B A
379
Multilayer Perceptron (MLP)
Output Values
Output Layer
Adjustable
Weights
Input Layer
Input Signals (External Stimuli)

380
Types of Layers
• The input layer.
– Introduces input values into the network.
– No activation function or other processing.
• The hidden layer(s).
– Perform classification of features
– Two hidden layers are sufficient to solve any problem
– Features imply more layers may be better
• The output layer.
– Functionally just like the hidden layers
– Outputs are passed on to the world outside the neural
network. 381
Basics of an Artificial Neural
Network
 An artificial neural network is an information
processing system that has certain performance
characteristics in common with biological neural
networks.
 An ANN can be characterized by:
1. Architecture: The pattern of connections between
different neurons.
2. Training or Learning Algorithms: The method of
determining weights on the connections.
3. Activation Function: The nature of function used by a
neuron to become activated.
Architecture
 There are two basic categories:
1. Feed-forward Neural Networks
 These are the nets in which the signals flow
from the input units to the output units, in a
forward direction.
 They are further classified as:
1. Single Layer Nets
2. Multi-layer Nets
2. Recurrent Neural Networks
 These are the nets in which the signals can
flow in both directions from the input to the
output or vice versa.
w11
w21
X1 Y1
w31
w12
X2 w22 Y2
 w32 
 
 
w13
Xn w2m Ym
wnm
Input Output
Units Units
w11 v11
X1 Y1
wi1 vj1
Z1
 wn1 vp1 
  
  

w1j v1k
Xi wij Zj vjk Yk
wnj  vpk
Biological 

 
Neurons 

w1p v1m 

In Action wip
Zp vjm
Xn wnp vpm Ym
Input Hidden Output
Units Units Units
1 w11 1
X1 Y1
w1n
v1n v1m
w1m
Xn Ym
wnm
1 1
Activation functions
• Transforms neuron’s input into output.
• Features of activation functions:
• A squashing effect is required
• Prevents accelerating growth of activation
levels through the network.
• Simple and easy to calculate
387
Standard activation functions
• The hard-limiting threshold function
– Corresponds to the biological paradigm
• either fires or not
There will be this sudden change in the decision (from 0 to 1) when z value
crosses the threshold (-w0). For most real-world applications we would
expect a smoother decision function which gradually changes from 0 to 1.
Standard activation functions
• Sigmoid functions: where the output function is
much smoother than the step function
– A mathematical function with a characteristic 'S'-shaped
curve, also called the sigmoid curve.
– There are many functions that can do the job for you,
some are shown below:
389
Sigmoid function
• One of the simplest one to work with is the logistic function.
• No sharp transition around the w0. Also, the output is no

longer binary but a real value between 0 and 1 which can
be interpreted as a probability. So instead of yes/no
decision, we get the probability of yes. The output here
is smooth, continuous and differentiable and just how any
390
learning algorithm likes it.
Training
Supervised Training
 Training is accomplished by presenting a sequence of
training vectors or patterns, each with an associated
target output vector.
 The weights are then adjusted according to a learning
algorithm.
 During training, the network develops an associative
memory. It can then recall a stored pattern when it is
given an input vector that is sufficiently similar to a
vector it has learned.
Unsupervised Training
 A sequence of input vectors is provided, but no target
vectors are specified in this case.
 The net modifies its weights and biases, so that the
most similar input vectors are assigned to the same
output unit.
Training Algorithms
• Adjust neural network weights to map inputs
to outputs.
• Use a set of sample patterns where the
desired output (given the inputs presented) is
known.
• The purpose is to learn to generalize
– Recognize features which are common to
good and bad exemplars
392
Back-Propagation
• A training procedure which allows multi-layer
feedforward Neural Networks to be trained;
• Can theoretically perform “any” input-output
mapping;
• Can learn to solve linearly inseparable
problems.
393
Activation functions and training
• For feed-forward networks:

• A continuous function can be differentiated
allowing gradient-descent.
• Back-propagation is an example of a
gradient-descent technique.
• Reason for prevalence of sigmoid
394
Lecture 13
CS-13410
(Artificial Neural Networks-
Backpropagation)
Basics of an Artificial Neural
Network
 An artificial neural network is an information
processing system that has certain performance
characteristics in common with biological neural
networks.
 An ANN can be characterized by:
1. Architecture: The pattern of connections between
different neurons.
2. Training or Learning Algorithms: The method of
determining weights on the connections.
3. Activation Function: The nature of function used by a
neuron to become activated.
Back-Propagation
• A training procedure which allows multi-layer
feedforward Neural Networks to be trained;
• Can theoretically perform “any” input-output
mapping;
• Can learn to solve linearly inseparable
problems.
397
Classification by Backpropagation
 Backpropagation: A neural network learning algorithm
 Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
 A neural network: A set of connected input/output units
where each connection has a weight associated with it
 During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class label of the input tuples
 Also referred to as connectionist learning due to the
connections between units
398
A Multi-Layer Feed-Forward Neural Network
 Given a unit, j in a hidden or output layer, the net input, Ij , to unit j is

I j   wij Oi  q j
i
 Given the net input Ij to unit j, then Oj , the output of unit j, is computed as
1
Oj  I j
1 e 399
 Backpropagate the error: The error is propagated backward by
updating the weights and biases to reflect the error of the
network’s prediction. For a unit j in the output layer, the error Errj
is computed by
Err j  O j (1  O j )(T j  O j )
where Oj is the actual output of unit j, and Tj is the known target
value of the given training tuple.
 The error of a hidden layer unit j is
Err j  O j (1  O j ) Errk w jk
k
where wjk is the weight of the connection from unit j to a unit k in
the next higher layer, and Errk is the error of unit k.
 Weights are updated by the following equations, where (l ) Errj Oi is
the change in weight wij
wij  wij  (l ) Err j Oi
 Biases are updated by the following equations, where (l ) Errj is
the change in bias weight q j
q j  q j  (l) Err j 400
How A Multi-Layer Neural Network Works?
 The inputs to the network correspond to the attributes measured

for each training tuple
 Inputs are fed simultaneously into the units making up the input
layer
 They are then weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
 The network is feed-forward in that none of the weights cycles
back to an input unit or to an output unit of a previous layer
 From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training
samples, they can closely approximate any function 401
Defining a Network Topology
 First decide the network topology: # of units in the
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer
 Normalizing the input values for each attribute measured in
the training tuples to [0.0—1.0]
 One input unit per domain value, each initialized to 0
 Output, if for classification and more than two classes,
one output unit per class is used
 Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
402
Backpropagation
 Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
 For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the
actual target value
 Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
 Steps
 Initialize weights (to small random #s) and biases in the network
 Propagate the inputs forward (by applying activation function)
 Backpropagate the error (by updating weights and biases)
 Terminating condition (when error is very small, etc.)
403
Example
» The class label of the
training tuple is 1 and
the learning rate l is
0.9
 Let the initial weights and bias values are
404
Example (Cont…)
405
Terminating condition:
Training stops when
All wij in the previous epoch are so small as
to be below some specified threshold, or
The percentage of tuples misclassified in the
previous epoch is below some threshold, or
A prespecified number of epochs has expired.
In practice, several hundreds of thousands of
epochs may be required before the weights
will converge.
406
Backpropagation and Interpretability
 Efficiency of backpropagation: Each epoch (one iteration through the
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in the worst
case
 Rule extraction from networks: network pruning
 Simplify the network structure by removing weighted links that
have the least effect on the trained network
 Then perform link, unit, or activation value clustering
 The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit
layers
 Sensitivity analysis: assess the impact that a given input variable has
on a network output. The knowledge gained from this analysis can be
represented in rules
407
Neural Network as a Classifier
 Weakness
 Long training time
 Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure."
 Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units" in the network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on a wide array of real-world data
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of
rules from trained neural networks
408
Lecture 13
CS-13410
(Gradient Descent and ANN-
Backpropagation)
Gradient Descent Method
Gradient Descent Method is a first-order

optimization algorithm. To find a local
minimum of a function, one takes a step
proportional to the negative of the
gradient of the function at the current
point.
Gradient Descent Method
• Gradient is the slope of a function.
• Optimization is finding the “best” value of a function which is the
minimum value of the function.
• The number of “turning points” of a function depend on the order of the
function.
• Not all turning points are minima.
• The least of all the minimum points is called the “global” minimum.
• Every minimum is a “local” minimum.
f(x)
global maximum
inflection point
local minimum
global minimum
x
Gradient Descent Learning Rule
Consider linear unit without threshold and
continuous output o (not just –1,1)
 o=w0 + w1 x1 + … + wn xn
Train the wi’s such that they minimize the
squared error
 E[w1,…,wn] = ½ dD (td-od)2
where D is the set of training examples
412
Gradient Descent (used in Delta Rule)
D={<(1,1),1>,<(-1,-1),1>,
<(1,-1),-1>,<(-1,1),-1>}
(w1,w2)
Gradient:
E[w]=[E/w0,… E/wn]
w=- E[w] (w1+w1,w2 +w2)

wi=- E/wi
=/wi 1/2d(td-od)2
= /wi 1/2d(td-i wi xi)2
= d(td- od)(-xi)
413
Gradient Descent
Gradient-Descent(training_examples, )
Each training example is a pair of the form <(x1,…xn),t> where
(x1,…,xn) is the vector of input values, and t is the target output
value,  is the learning rate (e.g. 0.1)
 Initialize each wi to some small random value
 Until the termination condition is met, Do
 Initialize each wi to zero
 For each <(x1,…xn),t> in training_examples Do
• Input the instance (x1,…,xn) to the linear unit and compute the
output o
• For each linear unit weight wi Do
– wi =  (t-o) xi
 For each linear unit weight wi Do
• wi=wi+wi
414
Incremental Stochastic Gradient
Descent
 Batch mode : Gradient descent (GD)
w=w -  ED[w] over the entire data D
ED[w]=1/2d(td-od)2
 Incremental mode: Stochastic gradient descent (SGD)

w=w -  Ed[w] over individual training examples d
Ed[w]=1/2 (td-od)2
Incremental Gradient Descent can approximate Batch

Gradient Descent arbitrarily closely if  is small enough
415
Gradient Descent Algorithm and
Its Variants
416
Comparison Perceptron and
Gradient Descent Rule
Perceptron learning rule guaranteed to succeed if
 Training examples are linearly separable
 Sufficiently small learning rate 
Linear unit training rules uses gradient descent

 Guaranteed to converge to hypothesis with minimum
squared error
 Given sufficiently small learning rate 
 Even when training data contains noise
 Even when training data not separable by H
417
Expressive Capabilities of ANN
Boolean functions
 Every boolean function can be represented by
network with single hidden layer
 But might require exponential (in number of inputs)
hidden units
Continuous functions
 Every bounded continuous function can be
approximated with arbitrarily small error, by network
with one hidden layer [Cybenko 1989, Hornik 1989]
 Any function can be approximated to arbitrary
accuracy by a network with two hidden layers
[Cybenko 1988]
418
Applications
• The properties of neural networks define
where they are useful.
– Can learn complex mappings from inputs to
outputs, based solely on samples
– Difficult to analyse: firm predictions about neural
network behaviour difficult;
• Unsuitable for safety-critical applications.
– Require limited understanding from trainer, who
can be guided by heuristics.
419
Neural network for OCR
feedforward
A
B
network C
D
trained using E
Back- propagation
Hidden
Layer Output
Layer
Input
Layer 420
OCR for 8x10 characters
10 10 10
8 8 8
 NN are able to generalise
 learning involves generating a partitioning of the input space
 for single layer network input space must be linearly
separable
 what is the dimension of this input space?
 how many points in the input space?
 this network is binary(uses binary values)
 networks may also be continuous 421
Engine management
• The behaviour of a car engine is influenced

by a large number of parameters
– temperature at various points
– fuel/air mixture
– lubricant viscosity.
• Major companies have used neural networks
to dynamically tune an engine depending on
current settings.
422
ALVINN
Drives 70 mph on a public highway
30 outputs
for steering
30x32 weights
4 hidden
into one out of
units
four hidden
30x32 pixels unit
as inputs
423
Signature recognition
• Each person's signature is different.

• There are structural similarities which are
difficult to quantify.
• One company has manufactured a machine
which recognizes signatures to within a high
level of accuracy.
– Considers speed in addition to gross shape.
– Makes forgery even more difficult.
424
Sonar target recognition
• Distinguish mines from rocks on sea-bed

• The neural network is provided with a large
number of parameters which are extracted
from the sonar signal.
• The training set consists of sets of signals
from rocks and mines.
425
Stock market prediction
• “Technical trading” refers to trading based

solely on known statistical parameters; e.g.
previous price
• Neural networks have been used to attempt
to predict changes in prices.
• Difficult to assess success since companies
using these techniques are reluctant to
disclose information.
426
Mortgage assessment
• Assess risk of lending to an individual.

• Difficult to decide on marginal cases.
• Neural networks have been trained to make
decisions, based upon the opinions of expert
underwriters.
• Neural network produced a 12% reduction in
delinquencies compared with human experts.
427
Neural Network Problems
• Many Parameters to be set

• Overfitting
• long training times
• ...
428
Parameter setting
• Number of layers
• Number of neurons
• too many neurons, require more training time
• Learning rate
• from experience, value should be small ~0.1
• Momentum term
• ..
429
Over-fitting
• With sufficient nodes can classify any

training set exactly
• May have poor generalisation ability.
• Cross-validation with some patterns
– Typically 30% of training patterns
– Validation set error is checked each epoch
– Stop training if validation error goes up
430
Training time
• How many epochs of training?

– Stop if the error fails to improve (has reached a
minimum)
– Stop if the rate of improvement drops below a
certain level
– Stop if the error reaches an acceptable level
– Stop when a certain number of epochs have
passed
431
Literature & Resources
A very good book:
 ”Neural Networks for Pattern Recognition”, Bishop, C.M.,
1996
Software:
 Neural Networks for Face Recognition
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html
 SNNS Stuttgart Neural Networks Simulator
http://www-ra.informatik.uni-tuebingen.de/SNNS
 Neural Networks at your fingertips
http://www.geocities.com/CapeCanaveral/1624/
http://www.stats.gla.ac.uk/~ernest/files/NeuralAppl.html 432

All Into One ML

Uploaded by

Copyright:

Available Formats

All Into One ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Into One ML

Uploaded by

Copyright:

Available Formats

CS-13410

Introduction to Machine Learning

Office Hrs: Mon. to Fri. : 10:30 – 12:30

Homework/Assignments (4): 15%

(This is a tentative distribution and might change slightly)

Your final grade will follow University of Lahore conventions,

Example: P ( chips | coke ) = 0.7

Discriminant: IF income > θ1 AND savings > θ2

 The obvious performance measure is: the percent of games it

 Many machine learning systems can-be usefully

1 Sunny Warm Normal Strong Warm Same YES

2 Sunny Warm High Strong Warm Same YES

3 Rainy Cold High Strong Warm Change NO

4 Sunny Warm High Strong Cool Change YES

• Each attribute/constrain will be:

• The Inductive Learning Hypothesis - Any hypothesis found to

• Concept learning can be viewed as the task of searching

• h1 is more-general-than h2 ( h1 > h2) if and only if h1 ≥ h2 is true and

• h2 > h1 and h2 > h3

x 1 = <Sunny Warm Normal Strong Warm Same>, + h0 = <, , , , , >

(Sunny, Warm, ?, Strong, ?, ?)

 Correct hypothesis would be

(Sunny, Warm, ?, Strong, (Warm ∨ Cool), ?)

 Is not in hypothesis set H

• Why prefer the most specific hypothesis?

• What if there are several maximally specific consistent

• The key difference between this definition of consistent and

Definition : The version space, V SH,D, with respect to

1. VersionSpace ← a list containing every hypothesis in H

where x ≥ y means x is more general or equal to y.

G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }

• A version space with its general and specific boundary sets.

G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }

• S0 and G0 are the initial boundary sets

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

• Training examples 1 and 2 force the S

G 3: {<Sunny, ?, ?, ?, ?, ?> <?, Warm, ?, ?, ?, ?> <?, ?, ?, ?, ?, Same>}

3. <Rainy, Cold, High, Strong, Warm, Change>, EnjoySport=No

G 4: {<Sunny, ?, ?, ?, ?, ?> <?, Warm, ?, ?, ?, ?>}

Note: When positive example comes to G, prune G, if necessary

{<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }

G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }

• If the algorithm is allowed to select the next example, which is best?

{<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }

A. ( Sunny Warm Normal Strong Cool Change) ?

 The new example is inconsistent with the version-space we

 What if the target concept is not contained in the hypothesis space?

 What is the size of the hypothesis space H (the power set of X) ?

NEW PROBLEM: our concept learning algorithm is now completely

• Inductive Leap: A learner should be able to generalize training data

1 Yes Single 125K No

10 No Single 90K Yes

An instance of the classification problem: learn a method for

Tid Refund Marital Taxable

 Descriptive modeling: Explanatory tool to distinguish

 Predictive modeling: Predict a class of a previously

 Classifying credit card transactions as legitimate or

 Categorizing news stories as finance, weather, sports,

 Identifying spam email, spam web pages, adult content