Class1 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CS 2750 Machine Learning

Lecture 1

Machine Learning

Milos Hauskrecht
milos@cs.pitt.edu
5329 Sennott Square, x4-8845

http://www.cs.pitt.edu/~milos/courses/cs2750/

CS 2750 Machine Learning

Administration
Study material
• Handouts, your notes and course readings
• Primary textbook:
– Duda, Hart, Stork. Pattern classification. 2nd edition. J
Wiley and Sons, 2000.
• Recommended book:
– Friedman, Hastie, Tibshirani. Elements of statistical
learning. Springer, 2001.
• Other books:
– C. Bishop. Neural networks for pattern recognition. Oxford
U. Press, 1996.
– T. Mitchell. Machine Learning. McGraw Hill, 1997
– J. Han, M. Kamber. Data Mining. Morgan Kauffman, 2001.

CS 2750 Machine Learning

1
Administration
• Lectures:
– Random short quizzes testing the understanding of
basic concepts from previous lectures
• Homeworks: weekly
– Programming tool: Matlab (CSSD machines and labs)
– Matlab Tutorial: next week
• Exams:
– Midterm (March)
• Final project:
– Proposals (early March)
– Written report + Oral presentation
(end of the semester)
CS 2750 Machine Learning

Tentative topics
• Concept learning.
• Density estimation.
• Linear models for regression and classification.
• Multi-layer neural networks.
• Support vector machines. Kernel methods.
• Learning Bayesian networks.
• Clustering. Latent variable models.
• Dimensionality reduction. Feature extraction.
• Ensemble methods. Mixture models. Bagging and boosting.
• Hidden Markov models.
• Reinforcement learning

CS 2750 Machine Learning

2
Machine Learning
• The field of machine learning studies the design of computer
programs (agents) capable of learning from past experience or
adapting to changes in the environment

• The need for building agents capable of learning is everywhere


– predictions in medicine,
– text and web page classification,
– speech recognition,
– image/text retrieval,
– commercial software

CS 2750 Machine Learning

Learning

Learning process:
Learner (a computer program) processes data D representing
past experiences and tries to either develop an appropriate
response to future data, or describe in some meaningful way
the data seen

Example:
Learner sees a set of patient cases (patient records) with
corresponding diagnoses. It can either try:
– to predict the presence of a disease for future patients
– describe the dependencies between diseases, symptoms

CS 2750 Machine Learning

3
Types of learning
• Supervised learning
– Learning mapping between input x and desired output y
– Teacher gives me y’s for the learning purposes
• Unsupervised learning
– Learning relations between data components
– No specific outputs given by a teacher
• Reinforcement learning
– Learning mapping between input x and desired output y
– Critic does not give me y’s but instead a signal
(reinforcement) of how good my answer was
• Other types of learning:
– explanation-based learning, etc.

CS 2750 Machine Learning

Supervised learning
Data: D = {d1 , d 2 ,.., d n } a set of n examples
d i =< x i , yi >
x i is input vector, and y is desired output (given by a teacher)

Objective: learn the mapping f : X → Y


s.t. y i ≈ f ( x i ) for all i = 1,.., n
Two types of problems:
• Regression: X discrete or continuous
Y is continuous
• Classification: X discrete or continuous
Y is discrete

CS 2750 Machine Learning

4
Supervised learning examples
• Regression: Y is continuous

Debt/equity
Earnings company stock price
Future product orders

• Classification: Y is discrete

Label “3”

Handwritten digit (array of 0,1s)


CS 2750 Machine Learning

Unsupervised learning
• Data: D = {d1 , d 2 ,.., d n }
d i = x i vector of values
No target value (output) y

• Objective:
– learn relations between samples, components of samples

Types of problems:
• Clustering
Group together “similar” examples, e.g. patient cases
• Density estimation
– Model probabilistically the population of samples

CS 2750 Machine Learning

5
Unsupervised learning example.
• Density estimation. We want to build the probability model of
a population from which we draw samples d i = x i

2.5

1.5

0.5

-0.5

-1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

CS 2750 Machine Learning

Unsupervised learning. Density estimation


• A probability density of a point in the two dimensional space
– Model used here: Mixture of Gaussians

CS 2750 Machine Learning

6
Reinforcement learning
• We want to learn: f : X → Y
• We see samples of x but not y
• Instead of y we get a feedback (reinforcement) from a critic
about how good our output was

input sample output


Learner

reinforcement

Critic

• The goal is to select outputs that lead to the best reinforcement


CS 2750 Machine Learning

Learning
• Assume we see examples of pairs (x , y) and we want to learn
the mapping f : X → Y to predict future ys for values of x
• We get the data what should we do?

10

y 6

-2

-4

-6

-8

-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
CS 2750 Machine Learning

7
Learning bias
• Problem: many possible functions f : X → Y exists for
representing the mapping between x and y
• Which one to choose? Many examples still unseen!

10

y 8

-2

-4

-6

-8

-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
CS 2750 Machine Learning

Learning bias
• Problem is easier when we make an assumption about the
model, say, f ( x ) = ax + b + ε
ε = N (0, σ ) - random (normally distributed) noise
• Restriction to a linear model is an example of learning bias
10

y 8

-2

-4

-6

-8

-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
CS 2750 Machine Learning

8
Learning bias
• Bias provides the learner with some basis for choosing among
possible representations of the function.
• Forms of bias: constraints, restrictions, model preferences
• Important: There is no learning without a bias!
10

y 8

-2

-4

-6

-8

-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
CS 2750 Machine Learning

Learning bias
• Choosing a parametric model or a set of models is not enough
Still too many functions f ( x ) = ax + b + ε ε = N (0, σ )
– One for every pair of parameters a, b

y
10

-2

-4

-6

-8

-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
CS 2750 Machine Learning

9
Fitting the data to the model
• We are interested in finding the best set of model parameters
Objective: Find the set of parameters that:
• reduces the misfit between the model and observed data
• Or, (in other words) that explain the data the best
Error function:
Measure of misfit between the data and the model
• Examples of error functions:
– Average square error 1 n

n i =1
( y i − f ( x i )) 2

– Average misclassification error 1 n


∑ 1 y ≠ f ( xi )
n i =1 i
Average # of misclassified cases

CS 2750 Machine Learning

Fitting the data to the model


• Linear regression
– Least squares fit with the linear model
1 n
– minimizes ∑
n i =1
( y i − f ( x i )) 2
10

y 8

-2

-4

-6

-8

-10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
CS 2750 Machine Learning

10
Typical learning
Three basic steps:
• Select a model or a set of models (with parameters)
E.g. y = ax + b + ε ε = N (0, σ )
• Select the error function to be optimized
E.g. 1 n
∑ ( y i − f ( xi )) 2
n i =1
• Find the set of parameters optimizing the error function
– The model and parameters with the smallest error represent
the best fit of the model to the data

But there are problems one must be careful about …

CS 2750 Machine Learning

Learning
Problem
• We fit the model based on past experience (past examples seen)
• But ultimately we are interested in learning the mapping that
performs well on the whole population of examples
Training data: Data used to fit the parameters of the model
Training error: 1 n

n i =1
( y i − f ( x i )) 2

True (generalization) error (over the whole unknown


population):
E ( x , y ) [( y − f ( x )) 2 ] Mean squared error
Training error tries to approximate the true error !!!!
Does a good training error imply a good generalization error ?

CS 2750 Machine Learning

11
Overfitting
• Assume we have a set of 10 points and we consider
polynomial functions as our possible models

10

-2

-4

-6

-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

CS 2750 Machine Learning

Overfitting
• Fitting a linear function with the square error
• Error is nonzero
12

10

-2

-4

-6

-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

CS 2750 Machine Learning

12
Overfitting
• Linear vs. cubic polynomial
• Higher order polynomial leads to a better fit, smaller error
12

10

-2

-4

-6

-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

CS 2750 Machine Learning

Overfitting

• Is it always good to minimize the error of the observed data?

12

10

-2

-4

-6

-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

CS 2750 Machine Learning

13
Overfitting
• For 10 data points, the degree 9 polynomial gives a perfect fit
(Lagrange interpolation). Error is zero.
• Is it always good to minimize the training error?

10

-2

-4

-6

-8

-1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

Overfitting
• For 10 data points, degree 9 polynomial gives a perfect fit
(Lagrange interpolation). Error is zero.
• Is it always good to minimize the training error? NO !!
• More important: How do we perform on the unseen data?

10

-2

-4

-6

-8

-1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

14
Overfitting
Situation when the training error is low and the generalization
error is high. Causes of the phenomenon:
• Model with a large number of parameters (degrees of freedom)
• Small data size (as compared to the complexity of the model)

10

-2

-4

-6

-8

-1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

How to evaluate the learner’s performance?


• Generalization error is the true error for the population of
examples we would like to optimize
E ( x , y ) [( y − f ( x )) 2 ]
• But it cannot be computed exactly
• Sample mean only approximates the true mean

• Optimizing (mean) training error can lead to the overfit,


i.e. training error may not reflect properly the generalization
error
1
∑ ( y i − f ( xi )) 2
n i =1,.. n
• So how to test the generalization error?

CS 2750 Machine Learning

15
How to evaluate the learner’s performance?
• Generalization error is the true error for the population of
examples we would like to optimize
E ( x , y ) [( y − f ( x )) 2 ]
• Sample mean only approximates it
• How to measure the generalization error?
• Two ways:
– Theoretical: Law of large numbers
• statistical bounds on the difference between true and
sample mean errors
– Practical: Use a separate data set with m data samples to
test
1
• (Mean) test error ∑
m j =1,.. m
( y j − f ( x j )) 2

CS 2750 Machine Learning

Basic experimental setup to test the learner’s


performance
1. Take a dataset D and divide it into:
• Training data set
• Testing data set

2. Use the training set and your favorite ML algorithm to train


the learner

3. Test (evaluate) the learner on the testing data set

• The results on the testing set can be used to compare different


learners powered with different models and learning algorithms

CS 2750 Machine Learning

16
Solutions for overfitting
How to make the learner avoid overfitting?
• Assure sufficient number of samples in the training set
– May not be possible
• Hold some data out of the training set = validation set
– Train (fit) on the training set (w/o data held out);
– Check for the generalization error on the validation set,
choose the model based on the validation set error
(cross-validation techniques)
• Regularization (Occam’s Razor)
– Penalize for the model complexity (number of parameters)
– Explicit preference towards simple models

CS 2750 Machine Learning

Design of a learning system (first view)

Data

Model selection

Learning

Application
or Testing

CS 2750 Machine Learning

17
Design of a learning system.
1. Data: D = {d1 , d 2 ,.., d n }
2. Model selection:
• Select a model or a set of models (with parameters)
E.g. y = ax + b + ε ε = N (0, σ )
• Select the error function to be optimized
E.g. 1 n

n i =1
( y i − f ( x i )) 2

3. Learning:
• Find the set of parameters optimizing the error function
– The model and parameters with the smallest error
4. Application:
• Apply the learned model
– E.g. predict ys for new inputs x using learned f (x )
CS 2750 Machine Learning

18

You might also like