Class1 PDF
Class1 PDF
Class1 PDF
Lecture 1
Machine Learning
Milos Hauskrecht
milos@cs.pitt.edu
5329 Sennott Square, x4-8845
http://www.cs.pitt.edu/~milos/courses/cs2750/
Administration
Study material
• Handouts, your notes and course readings
• Primary textbook:
– Duda, Hart, Stork. Pattern classification. 2nd edition. J
Wiley and Sons, 2000.
• Recommended book:
– Friedman, Hastie, Tibshirani. Elements of statistical
learning. Springer, 2001.
• Other books:
– C. Bishop. Neural networks for pattern recognition. Oxford
U. Press, 1996.
– T. Mitchell. Machine Learning. McGraw Hill, 1997
– J. Han, M. Kamber. Data Mining. Morgan Kauffman, 2001.
1
Administration
• Lectures:
– Random short quizzes testing the understanding of
basic concepts from previous lectures
• Homeworks: weekly
– Programming tool: Matlab (CSSD machines and labs)
– Matlab Tutorial: next week
• Exams:
– Midterm (March)
• Final project:
– Proposals (early March)
– Written report + Oral presentation
(end of the semester)
CS 2750 Machine Learning
Tentative topics
• Concept learning.
• Density estimation.
• Linear models for regression and classification.
• Multi-layer neural networks.
• Support vector machines. Kernel methods.
• Learning Bayesian networks.
• Clustering. Latent variable models.
• Dimensionality reduction. Feature extraction.
• Ensemble methods. Mixture models. Bagging and boosting.
• Hidden Markov models.
• Reinforcement learning
2
Machine Learning
• The field of machine learning studies the design of computer
programs (agents) capable of learning from past experience or
adapting to changes in the environment
Learning
Learning process:
Learner (a computer program) processes data D representing
past experiences and tries to either develop an appropriate
response to future data, or describe in some meaningful way
the data seen
Example:
Learner sees a set of patient cases (patient records) with
corresponding diagnoses. It can either try:
– to predict the presence of a disease for future patients
– describe the dependencies between diseases, symptoms
3
Types of learning
• Supervised learning
– Learning mapping between input x and desired output y
– Teacher gives me y’s for the learning purposes
• Unsupervised learning
– Learning relations between data components
– No specific outputs given by a teacher
• Reinforcement learning
– Learning mapping between input x and desired output y
– Critic does not give me y’s but instead a signal
(reinforcement) of how good my answer was
• Other types of learning:
– explanation-based learning, etc.
Supervised learning
Data: D = {d1 , d 2 ,.., d n } a set of n examples
d i =< x i , yi >
x i is input vector, and y is desired output (given by a teacher)
4
Supervised learning examples
• Regression: Y is continuous
Debt/equity
Earnings company stock price
Future product orders
• Classification: Y is discrete
Label “3”
Unsupervised learning
• Data: D = {d1 , d 2 ,.., d n }
d i = x i vector of values
No target value (output) y
• Objective:
– learn relations between samples, components of samples
Types of problems:
• Clustering
Group together “similar” examples, e.g. patient cases
• Density estimation
– Model probabilistically the population of samples
5
Unsupervised learning example.
• Density estimation. We want to build the probability model of
a population from which we draw samples d i = x i
2.5
1.5
0.5
-0.5
-1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
6
Reinforcement learning
• We want to learn: f : X → Y
• We see samples of x but not y
• Instead of y we get a feedback (reinforcement) from a critic
about how good our output was
reinforcement
Critic
Learning
• Assume we see examples of pairs (x , y) and we want to learn
the mapping f : X → Y to predict future ys for values of x
• We get the data what should we do?
10
y 6
-2
-4
-6
-8
-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
CS 2750 Machine Learning
7
Learning bias
• Problem: many possible functions f : X → Y exists for
representing the mapping between x and y
• Which one to choose? Many examples still unseen!
10
y 8
-2
-4
-6
-8
-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
CS 2750 Machine Learning
Learning bias
• Problem is easier when we make an assumption about the
model, say, f ( x ) = ax + b + ε
ε = N (0, σ ) - random (normally distributed) noise
• Restriction to a linear model is an example of learning bias
10
y 8
-2
-4
-6
-8
-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
CS 2750 Machine Learning
8
Learning bias
• Bias provides the learner with some basis for choosing among
possible representations of the function.
• Forms of bias: constraints, restrictions, model preferences
• Important: There is no learning without a bias!
10
y 8
-2
-4
-6
-8
-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
CS 2750 Machine Learning
Learning bias
• Choosing a parametric model or a set of models is not enough
Still too many functions f ( x ) = ax + b + ε ε = N (0, σ )
– One for every pair of parameters a, b
y
10
-2
-4
-6
-8
-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
CS 2750 Machine Learning
9
Fitting the data to the model
• We are interested in finding the best set of model parameters
Objective: Find the set of parameters that:
• reduces the misfit between the model and observed data
• Or, (in other words) that explain the data the best
Error function:
Measure of misfit between the data and the model
• Examples of error functions:
– Average square error 1 n
∑
n i =1
( y i − f ( x i )) 2
y 8
-2
-4
-6
-8
-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
CS 2750 Machine Learning
10
Typical learning
Three basic steps:
• Select a model or a set of models (with parameters)
E.g. y = ax + b + ε ε = N (0, σ )
• Select the error function to be optimized
E.g. 1 n
∑ ( y i − f ( xi )) 2
n i =1
• Find the set of parameters optimizing the error function
– The model and parameters with the smallest error represent
the best fit of the model to the data
Learning
Problem
• We fit the model based on past experience (past examples seen)
• But ultimately we are interested in learning the mapping that
performs well on the whole population of examples
Training data: Data used to fit the parameters of the model
Training error: 1 n
∑
n i =1
( y i − f ( x i )) 2
11
Overfitting
• Assume we have a set of 10 points and we consider
polynomial functions as our possible models
10
-2
-4
-6
-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Overfitting
• Fitting a linear function with the square error
• Error is nonzero
12
10
-2
-4
-6
-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
12
Overfitting
• Linear vs. cubic polynomial
• Higher order polynomial leads to a better fit, smaller error
12
10
-2
-4
-6
-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Overfitting
12
10
-2
-4
-6
-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
13
Overfitting
• For 10 data points, the degree 9 polynomial gives a perfect fit
(Lagrange interpolation). Error is zero.
• Is it always good to minimize the training error?
10
-2
-4
-6
-8
Overfitting
• For 10 data points, degree 9 polynomial gives a perfect fit
(Lagrange interpolation). Error is zero.
• Is it always good to minimize the training error? NO !!
• More important: How do we perform on the unseen data?
10
-2
-4
-6
-8
14
Overfitting
Situation when the training error is low and the generalization
error is high. Causes of the phenomenon:
• Model with a large number of parameters (degrees of freedom)
• Small data size (as compared to the complexity of the model)
10
-2
-4
-6
-8
15
How to evaluate the learner’s performance?
• Generalization error is the true error for the population of
examples we would like to optimize
E ( x , y ) [( y − f ( x )) 2 ]
• Sample mean only approximates it
• How to measure the generalization error?
• Two ways:
– Theoretical: Law of large numbers
• statistical bounds on the difference between true and
sample mean errors
– Practical: Use a separate data set with m data samples to
test
1
• (Mean) test error ∑
m j =1,.. m
( y j − f ( x j )) 2
16
Solutions for overfitting
How to make the learner avoid overfitting?
• Assure sufficient number of samples in the training set
– May not be possible
• Hold some data out of the training set = validation set
– Train (fit) on the training set (w/o data held out);
– Check for the generalization error on the validation set,
choose the model based on the validation set error
(cross-validation techniques)
• Regularization (Occam’s Razor)
– Penalize for the model complexity (number of parameters)
– Explicit preference towards simple models
Data
Model selection
Learning
Application
or Testing
17
Design of a learning system.
1. Data: D = {d1 , d 2 ,.., d n }
2. Model selection:
• Select a model or a set of models (with parameters)
E.g. y = ax + b + ε ε = N (0, σ )
• Select the error function to be optimized
E.g. 1 n
∑
n i =1
( y i − f ( x i )) 2
3. Learning:
• Find the set of parameters optimizing the error function
– The model and parameters with the smallest error
4. Application:
• Apply the learned model
– E.g. predict ys for new inputs x using learned f (x )
CS 2750 Machine Learning
18