Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Warming-up to ML, and Some Simple Supervised Learners

(Distance-based “Local” Methods)

Piyush Rai

Introduction to Machine Learning (CS771A)

August 2, 2018

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 1
Announcements

Please sign-up on Piazza if you haven’t already


I’ll be clearing all the add-drop requests by tomorrow
Maths refresher tutorial on Aug 4, 6:00-7:30pm in RM-101
Will be mostly on the basics of multivariate calculus, linear algebra, prob/stats, optimization (basically
things you are expected to know for this course)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 2
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: x n can be a 49 × 1 vector of pixel intensities

Note: Good features can also be learned from data (feature learning) or extracted using hand-crafted
rules defined by a domain expert. Having a good set of features is half the battle won!
Each yn is the output or response or label associated with input x n
The output yn can be a scalar, a vector of numbers, or a structured object (more on this later)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention
Will assume each input x n to be a D × 1 column vector (its transpose x >
n will be row vector)

xnd will denote the d-th feature of the n-th input


We will use X (N × D feature matrix) to collectively denote all the N inputs
We will use y (N × 1 output/response/label vector) to collectively denote all the N outputs
A feature
D

Input n Output for


xnT xn1 xn2 xnD yn
input n
N X y

Feature Matrix Outputs

Note: If each yn itself is a vector (we will see such cases later) then we will use a matrix Y to
collectively denote all the N outputs (with row n containing yn ) and also use boldfaced y n
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 4
Getting Features from Raw Data: A Simple Example
Consider the feature representation for some text data consisting of the following sentences:
John likes to watch movies
Mary likes movies too
John also likes football
Our feature “vocabulary” consists of 8 unique words
Here is the bag-of-words feature vector representation of these 3 sentences

Here the features are binary (presence/absence of each word)


Again, note that this may not necessarily be the best “feature” representation for a given task (which is
why other techniques or feature learning may be needed)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 5
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Binary: Male/female, adult/non-adult, or any yes/no or present/absent type values


Categorical/Discrete: Pincode, bloodgroup, or any “which one from this finite set” type values
Ordinal: Grade (A/B/C etc.) in a course, or any other type where relative values matters

Often, the features can be of mixed types (some real, some categorical, some ordinal, etc.)
Appropriate handling of different types of features may be very important (even if you algorithm is
designed to “learn” good features, given a set of heterogeneous features)
In Sup. Learning, different types of outputs may require different type of learning models

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Supervised Learning

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 7
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)
0.3 0.1 0.2 0.8 0.4
Illustration of a 5-dim output vector
for a multi-output regression problem

Binary Classification: yn ∈ {−1, +1} or {0, 1} (output in classification is also called “label”)
Multi-class Classification: yn ∈ {1, 2, . . . , M} or {0, 1, . . . , M − 1} (one of M classes is correct label)
0 0 0 1 0
Illustration of a 5-dim one-hot label vector
for a multi-class classification problem

Multi-label Classification: yn ∈ {−1, +1}M or {0, 1}M (a subset of M labels are correct)
1 0 1 0 0
Illustration of a 5-dim binary label vector
for a multi-label classification problem
(unlike one-hot, there can be multiple 1s)

Note: Multi-label classification is also informally called “tagging” (especially in Computer Vision)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning (Contd.)

Structured-Prediction (a.k.a. Structured Output Learning): Each yn is a structured object

One-Class Classification (a.k.a. outlier/anomaly/novelty detection): yn is “1” or “everything else”

Examples from the class


being modeled (e.g., animals)

All other examples (“outliers”,


e.g., humans, vehicles, etc)

Ranking: Each yn is a ranked list of relevant stuff for a given input/query x

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 9
Background: Computing Distances/Similarities
Assuming all real-valued features, an input x n ∈ RD×1 is a point in a D dim. vector space of reals
Standard rules of vector algebra apply on such representations, e.g.,
Euclidean distance b/w two points (say two images or two documents) x n ∈ RD and x m ∈ RD
v
u D
p uX
d(x n , x m ) = ||x n − x m || = (x n − x m )> (x n − x m ) = t (xnd − xmd )2
d=1

Inner-product similarity b/w x n and x m (cosine, x n , x m are unit-length vectors)


D
X
s(x n , x m ) = hx n , x m i = x >
n xm = xnd xmd
d=1

`1 distance between two points x n and x m


D
X
d1 (x n , x m ) = ||x n − x m ||1 = |xnd − xmd |
d=1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 10
Our First (Supervised) Learning Algorithm
(need to know nothing except how to
compute distances/similarities between points!)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 11
Prototype based Classification

Given: N labeled training examples {x n , yn }N


n=1 from two classes

Assume green is positive and red is negative class


N+ exampes from positive class, N− examples from negative class

Our goal: Learn a model to predict label (class) y for a new test example x

A simple “distance from means” model: predict the class that has a closer mean
Note: The basic idea easily generalizes to more than 2 classes as well

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 12
Prototype based Classification: More Formally

What does the decision rule look like, mathematically ?

The mean of each class is given by


1 X 1 X
µ− = xn and µ+ = xn
N− y =−1 N+ y =+1
n n

Euclidean Distances from each mean are given by


||µ− − x||2 = ||µ− ||2 + ||x||2 − 2hµ− , xi
||µ+ − x||2 = ||µ+ ||2 + ||x||2 − 2hµ+ , xi
Decision Rule: If f (x) := ||µ− − x||2 − ||µ+ − x||2 > 0 then predict +1, otherwise predict -1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 13
Prototype based Classification: The Decision Rule
We saw that our decision rule was

f (x) := ||µ− − x||2 − ||µ+ − x||2 = 2hµ+ − µ− , xi + ||µ− ||2 − ||µ+ ||2

Imp.: f (x) effectively denotes a hyperplane based classification rule f (x) = w > x + b with the
vector w = µ+ − µ− representing the direction normal to the hyperplane

PN
Imp.: Can show that the rule is equivalent to f (x) = n=1 αn hx n , xi + b, where α’s and b can be
estimated from training data (try this as an exercise)
This form of the decision rule is very important. Decision rules for many (in fact most) supervised
learning algorithms can be written like this (weighted sum of similarities with all the training inputs)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 14
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )
 
1 0
Shown below is an illustration of what M = will do (note: figure not to scale)
0 2

“Effective” Space under


Original Space Mahalanobis Transformation

How do I know what’s the right M for my data?Some options


Set it based on some knowledge of what you data looks like
Learn it from data (called Distance Metric Learning1 - a whole reseach area in itself)

Distance Metric Learning is one of the many approaches for feature learning from data
1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized
Several nice improvements/generalizations possible (some of which we will see in coming lectures)
Instead of a point (mean), model classes by prob. distributions (to account for class shapes/sizes)
Instead of Euclidean distances, can use non-Euclidean distances, distance metric learning, or “kernels”
Another limitation: Needs plenty of training data from each class to reliably estimate the means
But with a good feature learner, even ONE (or very few) example per class may be enough (a
state-of-the-art “Few-Shot Learning” model actually uses Prototype based classification)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Another Simple Supervised Learner:
Nearest Neighbors

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 17
Nearest Neighbor

Another classic distance-based supervised learning method

The label y for x ∈ RD will be the label of its nearest neighbor in training data. Also known as
one-nearest-neighbor (1-NN)
Euclidean/Mahalanobis distance can be used to find the nearest neighbor (or can use a learned
distance metric)
We typically use more (K > 1) neighbors in practice
Note: The method is widely applicable - works for both classification and regression problems

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 18
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

For a test input x, the averaging version of the prediction rule for K -nearest neighbors
1 X
y= yn
K
n∈NK (x)
.. where NK (x) is the set of K closest training inputs for x
Above assumes the K neighbors have equal (1/K ) weights. Can also use distance-based weights
Note: The rule works for multi-label classification too where each y n ∈ {0, 1}M is a binary vector
Averaging will give a real-valued “label score vector” y ∈ RM using which we can find the best label(s)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -NN for Multi-Label Learning: Pictorial Illustration

Suppose K = 3. The label averaging for a multi-label learning problem will look like

* 1 0 0 1 0
+
y= * 1 0 1 1 0 = 1 0 0.33 0.66 0.33
#1 label #4 label #3 label #2 label #3 label

* 1 0 0 0 1
Note that we can use the final y to rank the labels based on the real-valued scores
Can use it to predict the best, best-2, best-3, and so on..
Note: This is why multi-label learning is often used in some ranking problems where we wish to
predict a ranking of the possible labels an input can have

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 20
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K
Cross-validation - Divide the training data into two parts: actual training set and a validation set

(Actual)
Training Training
Data Set
Validation
Set

Never Ever Touch


Test Test While Training

Data Data

Try different values of K and look at the accuracies on the validation set
Note: For each K , we typically try multiple splits of train and validation sets

Select the K that gives the best accuracy on the validation set
Never touch the test set (even if you have access to it) during training to choose the best K
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Predction can be slow at test time
For each test point, need to compute its distance from all the training points
Clever data-structures or data-summarization techniques can provide speed-ups

Need to be careful in choosing the distance function to compute distances (especially when the
data dimension D is very large)
The 1-NN can suffer if data contains outliers (we will soon see a geometric illustration), or if
amount of training data is small. Using more neighbors (K > 1) is usually more robust
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 22
Geometry of 1-NN

1-NN induces a Voronoi tessellation of the input space

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
The Decision Boundary of 1-NN (for binary classification)

The decision boundary is composed of hyperplanes that form perpendicular bisectors of pairs of
points from different classes

Pic credit: Victor Lavrenko


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 24
Effect of Outliers on 1-NN

How the decision boundary can drastically change when the data contains some outliers

Pic credit: Victor Lavrenko


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 25
Effect of Varying K

Larger K leads to smoother decision boundaries

Too small K (e.g., K = 1) can lead to overfitting, too large K can lead to underfitting

Pic credit: Chris Bishop (PRML)


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 26
K -NN Behavior for Regression

Pic credit: Victor Lavrenko


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 27
K -NN Behavior for Regression

Pic credit: Alex Smola and Vishy Vishwanathan


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 28
Summary

Looked at two distance-based methods for classification/regression


A “Distance from Means” Method
Nearest Neighbors Method

Both are essentially “local” methods (look at local neighborhood of the test point)
Both are simple to understand and only require knowledge of basic geometry
Have connections to other more advanced methods (as we will see)

Need to be careful when computing the distances (learned Mahalanobis distance metrics, or
“learned features” + Euclidean distance can often do wonders!)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 29

You might also like