Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)
Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)
Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)
Piyush Rai
August 2, 2018
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 1
Announcements
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 2
Some Notation/Nomenclature/Convention
Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1
Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: x n can be a 49 × 1 vector of pixel intensities
Note: Good features can also be learned from data (feature learning) or extracted using hand-crafted
rules defined by a domain expert. Having a good set of features is half the battle won!
Each yn is the output or response or label associated with input x n
The output yn can be a scalar, a vector of numbers, or a structured object (more on this later)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention
Will assume each input x n to be a D × 1 column vector (its transpose x >
n will be row vector)
Note: If each yn itself is a vector (we will see such cases later) then we will use a matrix Y to
collectively denote all the N outputs (with row n containing yn ) and also use boldfaced y n
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 4
Getting Features from Raw Data: A Simple Example
Consider the feature representation for some text data consisting of the following sentences:
John likes to watch movies
Mary likes movies too
John also likes football
Our feature “vocabulary” consists of 8 unique words
Here is the bag-of-words feature vector representation of these 3 sentences
Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc
Often, the features can be of mixed types (some real, some categorical, some ordinal, etc.)
Appropriate handling of different types of features may be very important (even if you algorithm is
designed to “learn” good features, given a set of heterogeneous features)
In Sup. Learning, different types of outputs may require different type of learning models
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Supervised Learning
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 7
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)
0.3 0.1 0.2 0.8 0.4
Illustration of a 5-dim output vector
for a multi-output regression problem
Binary Classification: yn ∈ {−1, +1} or {0, 1} (output in classification is also called “label”)
Multi-class Classification: yn ∈ {1, 2, . . . , M} or {0, 1, . . . , M − 1} (one of M classes is correct label)
0 0 0 1 0
Illustration of a 5-dim one-hot label vector
for a multi-class classification problem
Multi-label Classification: yn ∈ {−1, +1}M or {0, 1}M (a subset of M labels are correct)
1 0 1 0 0
Illustration of a 5-dim binary label vector
for a multi-label classification problem
(unlike one-hot, there can be multiple 1s)
Note: Multi-label classification is also informally called “tagging” (especially in Computer Vision)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning (Contd.)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 9
Background: Computing Distances/Similarities
Assuming all real-valued features, an input x n ∈ RD×1 is a point in a D dim. vector space of reals
Standard rules of vector algebra apply on such representations, e.g.,
Euclidean distance b/w two points (say two images or two documents) x n ∈ RD and x m ∈ RD
v
u D
p uX
d(x n , x m ) = ||x n − x m || = (x n − x m )> (x n − x m ) = t (xnd − xmd )2
d=1
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 10
Our First (Supervised) Learning Algorithm
(need to know nothing except how to
compute distances/similarities between points!)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 11
Prototype based Classification
Our goal: Learn a model to predict label (class) y for a new test example x
A simple “distance from means” model: predict the class that has a closer mean
Note: The basic idea easily generalizes to more than 2 classes as well
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 12
Prototype based Classification: More Formally
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 13
Prototype based Classification: The Decision Rule
We saw that our decision rule was
f (x) := ||µ− − x||2 − ||µ+ − x||2 = 2hµ+ − µ− , xi + ||µ− ||2 − ||µ+ ||2
Imp.: f (x) effectively denotes a hyperplane based classification rule f (x) = w > x + b with the
vector w = µ+ − µ− representing the direction normal to the hyperplane
PN
Imp.: Can show that the rule is equivalent to f (x) = n=1 αn hx n , xi + b, where α’s and b can be
estimated from training data (try this as an exercise)
This form of the decision rule is very important. Decision rules for many (in fact most) supervised
learning algorithms can be written like this (weighted sum of similarities with all the training inputs)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 14
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )
1 0
Shown below is an illustration of what M = will do (note: figure not to scale)
0 2
Distance Metric Learning is one of the many approaches for feature learning from data
1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Prototype based Classification: Some Comments
A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized
Several nice improvements/generalizations possible (some of which we will see in coming lectures)
Instead of a point (mean), model classes by prob. distributions (to account for class shapes/sizes)
Instead of Euclidean distances, can use non-Euclidean distances, distance metric learning, or “kernels”
Another limitation: Needs plenty of training data from each class to reliably estimate the means
But with a good feature learner, even ONE (or very few) example per class may be enough (a
state-of-the-art “Few-Shot Learning” model actually uses Prototype based classification)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Another Simple Supervised Learner:
Nearest Neighbors
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 17
Nearest Neighbor
The label y for x ∈ RD will be the label of its nearest neighbor in training data. Also known as
one-nearest-neighbor (1-NN)
Euclidean/Mahalanobis distance can be used to find the nearest neighbor (or can use a learned
distance metric)
We typically use more (K > 1) neighbors in practice
Note: The method is widely applicable - works for both classification and regression problems
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 18
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs
For a test input x, the averaging version of the prediction rule for K -nearest neighbors
1 X
y= yn
K
n∈NK (x)
.. where NK (x) is the set of K closest training inputs for x
Above assumes the K neighbors have equal (1/K ) weights. Can also use distance-based weights
Note: The rule works for multi-label classification too where each y n ∈ {0, 1}M is a binary vector
Averaging will give a real-valued “label score vector” y ∈ RM using which we can find the best label(s)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -NN for Multi-Label Learning: Pictorial Illustration
Suppose K = 3. The label averaging for a multi-label learning problem will look like
* 1 0 0 1 0
+
y= * 1 0 1 1 0 = 1 0 0.33 0.66 0.33
#1 label #4 label #3 label #2 label #3 label
* 1 0 0 0 1
Note that we can use the final y to rank the labels based on the real-valued scores
Can use it to predict the best, best-2, best-3, and so on..
Note: This is why multi-label learning is often used in some ranking problems where we wish to
predict a ranking of the possible labels an input can have
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 20
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K
Cross-validation - Divide the training data into two parts: actual training set and a validation set
(Actual)
Training Training
Data Set
Validation
Set
Data Data
Try different values of K and look at the accuracies on the validation set
Note: For each K , we typically try multiple splits of train and validation sets
Select the K that gives the best accuracy on the validation set
Never touch the test set (even if you have access to it) during training to choose the best K
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class
Need to be careful in choosing the distance function to compute distances (especially when the
data dimension D is very large)
The 1-NN can suffer if data contains outliers (we will soon see a geometric illustration), or if
amount of training data is small. Using more neighbors (K > 1) is usually more robust
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 22
Geometry of 1-NN
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
The Decision Boundary of 1-NN (for binary classification)
The decision boundary is composed of hyperplanes that form perpendicular bisectors of pairs of
points from different classes
How the decision boundary can drastically change when the data contains some outliers
Too small K (e.g., K = 1) can lead to overfitting, too large K can lead to underfitting
Both are essentially “local” methods (look at local neighborhood of the test point)
Both are simple to understand and only require knowledge of basic geometry
Have connections to other more advanced methods (as we will see)
Need to be careful when computing the distances (learned Mahalanobis distance metrics, or
“learned features” + Euclidean distance can often do wonders!)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 29