Lect3 Supervised1

Supervised
learning:
Classification and
Regression
CCCS 416 Applied machine learning
Mostly from Applied machine learning in Python-University of Michigan-

Dr. Kevyn Collins-Thompson
Recall
https://link.springer.com/chapter/10.1007%2F978-1-4842-6513-0_2
Outline
qSupervised leaning
qClassification vs regression
qGeneralization, overfitting, and underfitting
APPLIED MACHINE
LEARNING IN PYTHON
Supervised Machine Learning
Supervised machine learning: learn to predict target values from labelled

data.
• Classification (target values are discrete classes)
• Regression (target values are continuous values)
Classification vs regression
https://www.slideshare.net/EdurekaIN/linear-regression-vs-logistic-regression-edureka
Classification vs regression
https://in.springboard.com/blog/regression-vs-classification-in-machine-learning/
Supervised Learning
APPLIED MACHINE
LEARNING IN PYTHON
(classification example)
Training set Future sample
X Y
Sample Target Value (Label)
Classifier
𝑥𝑥1 Apple 𝑦𝑦1 f : X →Y
𝑥𝑥2 Lemon 𝑦𝑦2 f f
𝑥𝑥3 Apple 𝑦𝑦3 At training time, the

classifier uses labelled Label: Orange
examples to learn rules for After training, at prediction
𝑥𝑥4 recognizing each fruit type.
Orange 𝑦𝑦4 time, the trained model is
used to predict the fruit type
for new instances using the
learned rules.
APPLIED MACHINE
LEARNING IN PYTHON
Examples of explicit and implicit label sources

Explicit labels Implicit labels
"cat"
Task "dog" Human judges/

requester annotators
"cat"
"house"
"cat"
Clicking and reading the "Mackinac Island" result can be
an implicit label for the search engine to learn that
"dog" "Mackinac Island" is especially relevant for the query
[vacations in michigan] for that specific user.
Crowdsourcing platform
APPLIED MACHINE
LEARNING IN PYTHON
A Basic Machine Learning Workflow
Representation Evaluation Optimization
Choose: Choose: Choose:

• A feature representation • What criterion • How to search for the
• Type of classifier to use distinguishes good vs. bad settings/parameters that
e.g. image pixels, with
classifiers? give the best classifier
k-nearest neighbor classifier for this evaluation criterion
e.g. % correct predictions on test set
e.g. try a range of values for "k" parameter
in k-nearest neighbor classifier
APPLIED MACHINE
LEARNING IN PYTHON
Feature Representations
Feature Count Feature representation
to 1
To: Chris Brooks chris 2
From: Daniel Romero brooks 1
Email Subject: Next course offering
Hi Daniel,
from 1 A list of words with
daniel 2
Could you please send the outline for the romero 1
their frequency counts
next course offering? Thanks! -- Chris the 2
...
Picture A matrix of color

values (pixels)
Feature Value
DorsalFin Yes
Sea Creatures MainColor
Stripes
Orange
Yes A set of attribute values
StripeColor1 White
StripeColor2 Black
Length 4.3 cm
APPLIED MACHINE
LEARNING IN PYTHON
Representing a piece of fruit as an array of features (plus label information)
width 1. Feature representation

Label information
(available in training data only) Feature representation
height
2. Learning model Classifier

mass
color
Predicted class
(apple)
APPLIED MACHINE
LEARNING IN PYTHON
Represent / Train / Evaluate / Refine Cycle

Representation:
Extract and
select object
features
APPLIED MACHINE
LEARNING IN PYTHON

Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features
APPLIED MACHINE
LEARNING IN PYTHON

Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features
Evaluation
APPLIED MACHINE
LEARNING IN PYTHON

Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features
Feature and
model Evaluation
refinement
APPLIED MACHINE
LEARNING IN PYTHON
The fruit dataset
lemon
height
apple
orange
mass mandarin orange

color_score
fruit_data_with_colors.txt
Credit: Original version of the fruit dataset created by Dr. Iain Murray, Univ. of Edinburgh
APPLIED MACHINE
LEARNING IN PYTHON
The input data as a table
Each row corresponds to a

single data instance (sample)
The fruit_label column contains

the label for each data instance
(sample)
APPLIED MACHINE
LEARNING IN PYTHON
The scale for the (simplistic) color_score feature

used in the fruit dataset
0.00 0.25 0.50 0.75 1.00

Color category color_score
Red 0.85 - 1.00

Orange 0.75 - 0.85
Yellow 0.65 - 0.75
Green 0.45 - 0.65
APPLIED MACHINE
LEARNING IN PYTHON
Creating Training and Testing Sets

X y X_train y_train X_test y_test
…
X_train, X_test, y_train, y_test
= train_test_split(X, y)
Original data set Training set Test set

APPLIED MACHINE
LEARNING IN PYTHON
Some reasons why looking at the data initially is
important
Examples of incorrect or missing feature values
• Inspecting feature values may help identify

what cleaning or preprocessing still needs to
be done once you can see the range or
distribution of values that is typical for each
attribute.
• You might notice missing or noisy data, or
inconsistencies such as the wrong data type
being used for a column, incorrect units of
measurements for a particular column, or that
there aren’t enough examples of a particular
class.
• You may realize that your problem is
actually solvable without machine
learning.
APPLIED MACHINE
LEARNING IN PYTHON
Coding Example
• Training and test sets
• Model/Estimator
– Model fitting
produces a 'trained
model'.
– Training is the
process of estimating
model parameters.
• Evaluation method
APPLIED MACHINE
LEARNING IN PYTHON
Generalization, Overfitting, and Underfitting

• Generalization ability refers to an algorithm's ability to give accurate predictions
for new, previously unseen data.
• Assumptions:
– Future unseen data (test set) will have the same properties as the current training sets.
– Thus, models that are accurate on the training set are expected to be accurate on the test
set.
– But that may not happen if the trained model is tuned too specifically to the training set.
• Models that are too complex for the amount of training data available are said to
overfit and are not likely to generalize well to new examples.
• Models that are too simple, that don't even do well on the training data, are
said to underfit and also not likely to generalize well.
APPLIED MACHINE
LEARNING IN PYTHON
Overfitting in regression
Target variable
Input variable
APPLIED MACHINE
LEARNING IN PYTHON
Overfitting in classification
Feature 2
Feature 1
Summary of today’s lesson
• Supervised machine learning learn to predict target values from labelled data.
• Classification is the task for predicting discrete values while regression is for
predicting continuous values.
• Represent, train, evaluate, refine cycle.
• Models that are too complex (overfit) or too simple (underfit) decreases the
algorithm ability to generalize

Lect3 Supervised1

Uploaded by

Copyright:

Available Formats

Lect3 Supervised1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect3 Supervised1

Uploaded by

Copyright:

Available Formats

Supervised

Mostly from Applied machine learning in Python-University of Michigan-

Supervised Machine Learning

Supervised machine learning: learn to predict target values from labelled

𝑥𝑥2 Lemon 𝑦𝑦2 f f

𝑥𝑥3 Apple 𝑦𝑦3 At training time, the

Examples of explicit and implicit label sources

Task "dog" Human judges/

A Basic Machine Learning Workflow

Representation Evaluation Optimization

Choose: Choose: Choose:

Picture A matrix of color

Representing a piece of fruit as an array of features (plus label information)

width 1. Feature representation

2. Learning model Classifier

Represent / Train / Evaluate / Refine Cycle

Represent / Train / Evaluate / Refine Cycle

Represent / Train / Evaluate / Refine Cycle

Represent / Train / Evaluate / Refine Cycle

The fruit dataset

mass mandarin orange

The input data as a table

Each row corresponds to a

The fruit_label column contains

The scale for the (simplistic) color_score feature

0.00 0.25 0.50 0.75 1.00

Red 0.85 - 1.00

Creating Training and Testing Sets

Original data set Training set Test set

• Inspecting feature values may help identify

Generalization, Overfitting, and Underfitting

You might also like