Lecture13 - ML Linear & Log-Linear Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Natural Language

Processing
Lecture 13:
Machine Learning: Linear and Log-Linear Models

12/4/2019

COMS W4705
Yassine Benajiba
Intro
Intro
Machine Learning and NLP
• We have encountered many different situations where we
had to make a prediction:

• Text classification, language modeling, POS tagging,


constituency/dependency parsing,

• These are all classification problems of some form.

• Today: Some machine learning background. Linear/log-


linear models. Basic neural networks.
Generative Algorithms
• Assume the observed data is being “generated” by a
“hidden” class label.

• Build a different model for each class.

• To predict a new example, check it under each of the


models and see which one matches best.

• Model and . Then use bases rule


Discriminative Algorithms

• Model conditional distribution of the label given the data

• Learns decision boundaries that separate instances of the


different classes.

• To predict a new example, check on which side of the


decision boundary it falls.
Machine Learning Definition
• “Creating systems that improve from experience.”

• “A computer program is said to learn from experience E


with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.”
(Tom Mitchel, Machine Learning 1997)
Inductive Learning (a.k.a. Science)

• Goal: given a set of input/output pairs (training data), find


the function f(x) that maps inputs to outputs.
Problem: We did not see all possible inputs!

• Learn an approximate function h(x) from the training data


and hope that this function generalize well to unseen inputs.

• Ockham’s razor: Choose the simplest hypothesis that is


consistent with the training data.
Classification and Regression
• Recall: In supervised learning, training data consisting of training
examples
(x1, y1), …, (xn, yn), where xj is an input example (a d-dimensional vector of
attribute values) and yj is the label.

• Two types of supervised learning problems:

• In classification: yj is a finite, discrete set.


Typically yj ∈ {-1, +1}. i.e. predict a label from a set of labels.
Learn a classifier function:

• In regression: xj ∈ℝd, yi ∈ ℝ. i.e. predict a numeric value.


Learn a regressor function:
Linear Classification and
Regression
h(x) x1

x2
decision boundary
Regression Classification
Linear Classification
Training ML models
Training Data ML algorithm function h(x)=y

• How can we be confident about the learned function?

• Can compute empirical error/risk on the training set:

• Typical loss functions:

• Least square loss (L2):

• Classification error:
Training ML models
Training Data ML algorithm function h(x)=y

• Empirical error/risk:

• Training aims to minimize .

• We hope that this also minimizes , the test error.


Overfitting

• Problem: Minimizing empirical risk can lead to overfitting.

• This happens when a model works well on the training


data, but it does not generalize to testing data.

• Data sets can be noisy. Overfitting can model the noise


in the data.
Preventing Overfitting
• Solutions: Simpler models.

• Reduce the number of features (feature selection).

• Model selection.

• Regularization.

• Cross validation.

• However: Adding wrong assumptions (bias) to the training


algorithm can lead to underfitting!
Goodness of Fit
Linear Model
bias 1 Xi0
w0
Threshold Function
Xi1 w1
Σ output

wn

Xin

activation function
Linear Models
• We have chosen a function class (linear separators).

• Specified by parameter w.

• Need to estimate w on the basis of the training set.

• What loss should we use? One option: minimize


classification error:
Perceptron Learning
• Problem: Threshold function is not differentiable, so we
cannot find a closed-form solution or apply gradient descent.

• Instead use iterative perceptron learning algorithm:

• Start with arbitrary hyperplane.

• Adjust it using the training data.

• Update rule:

• Perceptron Convergence Theorem states that any linear


function can be learned using this algorithm in a finite number
of iterations.
Perceptron Learning
Algorithm
Input: Training examples (x1, y1),…,(xn, yn)
Output: A perceptron defined by (w0, w1,…,wd)

Initialize wj←0, for j=0…d

while not converged: "convergence" means that the weights don't


change for one entire iteration through the
training data.
shuffle training examples.
for each training example (xi, yi):

if output - target != 0: #(output and prediction do not match)

for each weight wj:


Perceptron
• Simple learning algorithm. Guaranteed to converge after a
finite number of steps.

• But only if the data is linearly separable.


x1

x2
perceptron cannot learn this
Feature Functions
• In NLP we often need to make multi-class decisions.
Linear models provide only binary decisions.

• Use a feature function where x is an input object


and y is a possible output.

• The values of are d-dimensional vectors.


Log-Linear Model
(a.k.a. "Maximum Entropy Models")

• Define conditional probability P(y|x)

• exp(z) = ez is positive for any z.

• But how should we estimate w?


Log-Likelihood
• Define the log-likelihood of some model w on the training
data (x1, y1), …, (xn, yn) as

• We want to compute the maximum likelihood

• Unfortunately, there is no general analytical solution. Can


use gradient-based optimization.
Simple Gradient Ascent
Initialize w ←any setting in the parameter (weight) space
for a set number of iterations T:
for each wi in w:

update each wi to w’i

• Follow the gradients (partial derivatives) to find a parameter setting


that maximizes LL(w)

• α > 0 is the learning rate or step size.


Partial Derivative of the Log
Likelihood
Regularization
• Problem: Parameter estimation can overfit the training
data.

• Can include a regularization term. For example L2


regularizer:
• λ > 0 controls the strength of the regularization.

• Since we are maximizing ,


there is now a trade-off between fit and model 'complexity'.
POS Tagging with
Log-Linear Models
• Previously we used a generative model (HMM) for POS
tagging.

• Now we want to use a discriminative model for

• Next tag is conditioned on previous tag sequence and all


observed words.
Maximum Entropy Markov
Models (MEMM)
• Make an independence assumption (similar to HMM):

• Probability only depends on the previous tag.


MEMMs

• Model each term using a log-linear model

• φ is a feature function defined over:


• the observed words w1,...,wm
• the position of the current word
• the previous tag ti-1
• the suggested tag for the current word ti
• t' is a variable ranging over all possible tags.
MEMMs

• Training: same as any log-linear model.

• Decoding: Need to find

• Can use Viterbi algorithm!


Feature Function
(Ratnaparkhi, 1996)

• is a feature vector of length d.

• (wi,ti), (wi-1,ti), (wi-2,ti), (wi+1,ti), (wi+2,ti)

• (ti-1,ti)

• (wi contains numbers, ti),


(wi contains uppercase characters, ti)
(wi contains a hyphen, ti)

• (prefix1 of wi,ti), (prefix2 of wi,ti), (prefix3 of wi,ti), (prefix4 of wi,ti)


(suffix1 of wi,ti), (suffix2 of wi,ti), (suffix3 of wi,ti), (suffix4 of wi,ti)
Feature Example
The stories about well-heeled communities and developers ...
DT NNS IN ??

• (well-helled,JJ), (about,JJ), (stories,JJ), (communities, JJ), (and,JJ)

• (IN,JJ)

• (wi contains a hyphen, JJ)

• (w,JJ), (we,JJ), (wel,JJ), (well, JJ)


(d,JJ), (ed,JJ), (led,JJ), (eled, JJ)

You might also like