Lect3 Supervised1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Supervised

learning:
Classification and
Regression
CCCS 416 Applied machine learning

Mostly from Applied machine learning in Python-University of Michigan-


Dr. Kevyn Collins-Thompson
Recall

https://link.springer.com/chapter/10.1007%2F978-1-4842-6513-0_2
Outline

qSupervised leaning
qClassification vs regression
qGeneralization, overfitting, and underfitting
APPLIED MACHINE
LEARNING IN PYTHON

Supervised Machine Learning

Supervised machine learning: learn to predict target values from labelled


data.
• Classification (target values are discrete classes)
• Regression (target values are continuous values)
Classification vs regression

https://www.slideshare.net/EdurekaIN/linear-regression-vs-logistic-regression-edureka
Classification vs regression

https://in.springboard.com/blog/regression-vs-classification-in-machine-learning/
Supervised Learning
APPLIED MACHINE
LEARNING IN PYTHON

(classification example)
Training set Future sample
X Y
Sample Target Value (Label)
Classifier
𝑥𝑥1 Apple 𝑦𝑦1 f : X →Y

𝑥𝑥2 Lemon 𝑦𝑦2 f f

𝑥𝑥3 Apple 𝑦𝑦3 At training time, the


classifier uses labelled Label: Orange
examples to learn rules for After training, at prediction
𝑥𝑥4 recognizing each fruit type.
Orange 𝑦𝑦4 time, the trained model is
used to predict the fruit type
for new instances using the
learned rules.
APPLIED MACHINE
LEARNING IN PYTHON

Examples of explicit and implicit label sources


Explicit labels Implicit labels

"cat"

Task "dog" Human judges/


requester annotators
"cat"
"house"

"cat"
Clicking and reading the "Mackinac Island" result can be
an implicit label for the search engine to learn that
"dog" "Mackinac Island" is especially relevant for the query
[vacations in michigan] for that specific user.
Crowdsourcing platform
APPLIED MACHINE
LEARNING IN PYTHON

A Basic Machine Learning Workflow

Representation Evaluation Optimization

Choose: Choose: Choose:


• A feature representation • What criterion • How to search for the
• Type of classifier to use distinguishes good vs. bad settings/parameters that
e.g. image pixels, with
classifiers? give the best classifier
k-nearest neighbor classifier for this evaluation criterion
e.g. % correct predictions on test set
e.g. try a range of values for "k" parameter
in k-nearest neighbor classifier
APPLIED MACHINE
LEARNING IN PYTHON

Feature Representations
Feature Count Feature representation
to 1
To: Chris Brooks chris 2
From: Daniel Romero brooks 1
Email Subject: Next course offering
Hi Daniel,
from 1 A list of words with
daniel 2
Could you please send the outline for the romero 1
their frequency counts
next course offering? Thanks! -- Chris the 2
...

Picture A matrix of color


values (pixels)

Feature Value

DorsalFin Yes
Sea Creatures MainColor
Stripes
Orange
Yes A set of attribute values
StripeColor1 White
StripeColor2 Black
Length 4.3 cm
APPLIED MACHINE
LEARNING IN PYTHON

Representing a piece of fruit as an array of features (plus label information)

width 1. Feature representation


Label information
(available in training data only) Feature representation
height

2. Learning model Classifier


mass
color

Predicted class
(apple)
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine Cycle


Representation:
Extract and
select object
features
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine Cycle


Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine Cycle


Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features

Evaluation
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine Cycle


Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features

Feature and
model Evaluation
refinement
APPLIED MACHINE
LEARNING IN PYTHON

The fruit dataset

lemon
height

apple
orange

mass mandarin orange


color_score
fruit_data_with_colors.txt

Credit: Original version of the fruit dataset created by Dr. Iain Murray, Univ. of Edinburgh
APPLIED MACHINE
LEARNING IN PYTHON

The input data as a table

Each row corresponds to a


single data instance (sample)

The fruit_label column contains


the label for each data instance
(sample)
APPLIED MACHINE
LEARNING IN PYTHON

The scale for the (simplistic) color_score feature


used in the fruit dataset

0.00 0.25 0.50 0.75 1.00


Color category color_score

Red 0.85 - 1.00


Orange 0.75 - 0.85
Yellow 0.65 - 0.75
Green 0.45 - 0.65
APPLIED MACHINE
LEARNING IN PYTHON

Creating Training and Testing Sets


X y X_train y_train X_test y_test


X_train, X_test, y_train, y_test
= train_test_split(X, y)

Original data set Training set Test set


APPLIED MACHINE
LEARNING IN PYTHON
Some reasons why looking at the data initially is
important
Examples of incorrect or missing feature values

• Inspecting feature values may help identify


what cleaning or preprocessing still needs to
be done once you can see the range or
distribution of values that is typical for each
attribute.
• You might notice missing or noisy data, or
inconsistencies such as the wrong data type
being used for a column, incorrect units of
measurements for a particular column, or that
there aren’t enough examples of a particular
class.
• You may realize that your problem is
actually solvable without machine
learning.
APPLIED MACHINE
LEARNING IN PYTHON

Coding Example
• Training and test sets
• Model/Estimator
– Model fitting
produces a 'trained
model'.
– Training is the
process of estimating
model parameters.
• Evaluation method
APPLIED MACHINE
LEARNING IN PYTHON

Generalization, Overfitting, and Underfitting


• Generalization ability refers to an algorithm's ability to give accurate predictions
for new, previously unseen data.
• Assumptions:
– Future unseen data (test set) will have the same properties as the current training sets.
– Thus, models that are accurate on the training set are expected to be accurate on the test
set.
– But that may not happen if the trained model is tuned too specifically to the training set.
• Models that are too complex for the amount of training data available are said to
overfit and are not likely to generalize well to new examples.
• Models that are too simple, that don't even do well on the training data, are
said to underfit and also not likely to generalize well.
APPLIED MACHINE
LEARNING IN PYTHON

Overfitting in regression

Target variable

Input variable
APPLIED MACHINE
LEARNING IN PYTHON

Overfitting in classification

Feature 2

Feature 1
Summary of today’s lesson
• Supervised machine learning learn to predict target values from labelled data.
• Classification is the task for predicting discrete values while regression is for
predicting continuous values.
• Represent, train, evaluate, refine cycle.
• Models that are too complex (overfit) or too simple (underfit) decreases the
algorithm ability to generalize

You might also like