Lecture13 - ML Linear & Log-Linear Models

Natural Language
Processing
Lecture 13:
Machine Learning: Linear and Log-Linear Models
12/4/2019
COMS W4705
Yassine Benajiba
Intro
Intro
Machine Learning and NLP
• We have encountered many different situations where we
had to make a prediction:
• Text classification, language modeling, POS tagging,

constituency/dependency parsing,
• These are all classification problems of some form.
• Today: Some machine learning background. Linear/log-

linear models. Basic neural networks.
Generative Algorithms
• Assume the observed data is being “generated” by a
“hidden” class label.
• Build a different model for each class.
• To predict a new example, check it under each of the

models and see which one matches best.
• Model and . Then use bases rule

Discriminative Algorithms
• Model conditional distribution of the label given the data
• Learns decision boundaries that separate instances of the

different classes.
• To predict a new example, check on which side of the

decision boundary it falls.
Machine Learning Definition
• “Creating systems that improve from experience.”
• “A computer program is said to learn from experience E

with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.”
(Tom Mitchel, Machine Learning 1997)
Inductive Learning (a.k.a. Science)
• Goal: given a set of input/output pairs (training data), find

the function f(x) that maps inputs to outputs.
Problem: We did not see all possible inputs!
• Learn an approximate function h(x) from the training data

and hope that this function generalize well to unseen inputs.
• Ockham’s razor: Choose the simplest hypothesis that is

consistent with the training data.
Classification and Regression
• Recall: In supervised learning, training data consisting of training
examples
(x1, y1), …, (xn, yn), where xj is an input example (a d-dimensional vector of
attribute values) and yj is the label.
• Two types of supervised learning problems:
• In classification: yj is a finite, discrete set.

Typically yj ∈ {-1, +1}. i.e. predict a label from a set of labels.
Learn a classifier function:
• In regression: xj ∈ℝd, yi ∈ ℝ. i.e. predict a numeric value.

Learn a regressor function:
Linear Classification and
Regression
h(x) x1
x2
decision boundary
Regression Classification
Linear Classification
Training ML models
Training Data ML algorithm function h(x)=y
• How can we be confident about the learned function?
• Can compute empirical error/risk on the training set:
• Typical loss functions:
• Least square loss (L2):
• Classification error:
Training ML models
Training Data ML algorithm function h(x)=y
• Empirical error/risk:
• Training aims to minimize .
• We hope that this also minimizes , the test error.

Overfitting
• Problem: Minimizing empirical risk can lead to overfitting.
• This happens when a model works well on the training

data, but it does not generalize to testing data.
• Data sets can be noisy. Overfitting can model the noise

in the data.
Preventing Overfitting
• Solutions: Simpler models.
• Reduce the number of features (feature selection).
• Model selection.
• Regularization.
• Cross validation.
• However: Adding wrong assumptions (bias) to the training

algorithm can lead to underfitting!
Goodness of Fit
Linear Model
bias 1 Xi0
w0
Threshold Function
Xi1 w1
Σ output
…
wn
Xin
activation function
Linear Models
• We have chosen a function class (linear separators).
• Specified by parameter w.
• Need to estimate w on the basis of the training set.
• What loss should we use? One option: minimize

classification error:
Perceptron Learning
• Problem: Threshold function is not differentiable, so we
cannot find a closed-form solution or apply gradient descent.
• Instead use iterative perceptron learning algorithm:
• Start with arbitrary hyperplane.
• Adjust it using the training data.
• Update rule:
• Perceptron Convergence Theorem states that any linear

function can be learned using this algorithm in a finite number
of iterations.
Perceptron Learning
Algorithm
Input: Training examples (x1, y1),…,(xn, yn)
Output: A perceptron defined by (w0, w1,…,wd)
Initialize wj←0, for j=0…d
while not converged: "convergence" means that the weights don't

change for one entire iteration through the
training data.
shuffle training examples.
for each training example (xi, yi):
if output - target != 0: #(output and prediction do not match)
for each weight wj:

Perceptron
• Simple learning algorithm. Guaranteed to converge after a
finite number of steps.
• But only if the data is linearly separable.

x1
x2
perceptron cannot learn this
Feature Functions
• In NLP we often need to make multi-class decisions.
Linear models provide only binary decisions.
• Use a feature function where x is an input object

and y is a possible output.
• The values of are d-dimensional vectors.

Log-Linear Model
(a.k.a. "Maximum Entropy Models")
• Define conditional probability P(y|x)
• exp(z) = ez is positive for any z.
• But how should we estimate w?

Log-Likelihood
• Define the log-likelihood of some model w on the training
data (x1, y1), …, (xn, yn) as
• We want to compute the maximum likelihood
• Unfortunately, there is no general analytical solution. Can

use gradient-based optimization.
Simple Gradient Ascent
Initialize w ←any setting in the parameter (weight) space
for a set number of iterations T:
for each wi in w:
update each wi to w’i
• Follow the gradients (partial derivatives) to find a parameter setting

that maximizes LL(w)
• α > 0 is the learning rate or step size.

Partial Derivative of the Log
Likelihood
Regularization
• Problem: Parameter estimation can overfit the training
data.
• Can include a regularization term. For example L2

regularizer:
• λ > 0 controls the strength of the regularization.
• Since we are maximizing ,

there is now a trade-off between fit and model 'complexity'.
POS Tagging with
Log-Linear Models
• Previously we used a generative model (HMM) for POS
tagging.
• Now we want to use a discriminative model for
• Next tag is conditioned on previous tag sequence and all

observed words.
Maximum Entropy Markov
Models (MEMM)
• Make an independence assumption (similar to HMM):
• Probability only depends on the previous tag.

MEMMs
• Model each term using a log-linear model
• φ is a feature function defined over:

• the observed words w1,...,wm
• the position of the current word
• the previous tag ti-1
• the suggested tag for the current word ti
• t' is a variable ranging over all possible tags.
MEMMs
• Training: same as any log-linear model.
• Decoding: Need to find
• Can use Viterbi algorithm!

Feature Function
(Ratnaparkhi, 1996)
• is a feature vector of length d.
• (wi,ti), (wi-1,ti), (wi-2,ti), (wi+1,ti), (wi+2,ti)
• (ti-1,ti)
• (wi contains numbers, ti),

(wi contains uppercase characters, ti)
(wi contains a hyphen, ti)
• (prefix1 of wi,ti), (prefix2 of wi,ti), (prefix3 of wi,ti), (prefix4 of wi,ti)

(suffix1 of wi,ti), (suffix2 of wi,ti), (suffix3 of wi,ti), (suffix4 of wi,ti)
Feature Example
The stories about well-heeled communities and developers ...
DT NNS IN ??
• (well-helled,JJ), (about,JJ), (stories,JJ), (communities, JJ), (and,JJ)
• (IN,JJ)
• (wi contains a hyphen, JJ)
• (w,JJ), (we,JJ), (wel,JJ), (well, JJ)

(d,JJ), (ed,JJ), (led,JJ), (eled, JJ)

Lecture13 - ML Linear & Log-Linear Models

Uploaded by

Copyright:

Available Formats

Lecture13 - ML Linear & Log-Linear Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture13 - ML Linear & Log-Linear Models

Uploaded by

Copyright:

Available Formats

Natural Language

• Text classification, language modeling, POS tagging,

• These are all classification problems of some form.

• Today: Some machine learning background. Linear/log-

• Build a different model for each class.

• To predict a new example, check it under each of the

• Model and . Then use bases rule

• Model conditional distribution of the label given the data

• Learns decision boundaries that separate instances of the

• To predict a new example, check on which side of the

• “A computer program is said to learn from experience E

• Goal: given a set of input/output pairs (training data), find

• Learn an approximate function h(x) from the training data

• Ockham’s razor: Choose the simplest hypothesis that is

• Two types of supervised learning problems:

• In classification: yj is a finite, discrete set.

• In regression: xj ∈ℝd, yi ∈ ℝ. i.e. predict a numeric value.

• How can we be confident about the learned function?

• Can compute empirical error/risk on the training set:

• Typical loss functions:

• Least square loss (L2):

• Training aims to minimize .

• We hope that this also minimizes , the test error.

• Problem: Minimizing empirical risk can lead to overfitting.

• This happens when a model works well on the training

• Data sets can be noisy. Overfitting can model the noise

• Reduce the number of features (feature selection).

• However: Adding wrong assumptions (bias) to the training

• Need to estimate w on the basis of the training set.

• What loss should we use? One option: minimize

• Instead use iterative perceptron learning algorithm:

• Start with arbitrary hyperplane.

• Adjust it using the training data.

• Perceptron Convergence Theorem states that any linear

Initialize wj←0, for j=0…d

while not converged: "convergence" means that the weights don't

if output - target != 0: #(output and prediction do not match)

for each weight wj:

• But only if the data is linearly separable.

• Use a feature function where x is an input object

• The values of are d-dimensional vectors.

• Define conditional probability P(y|x)

• exp(z) = ez is positive for any z.

• But how should we estimate w?

• We want to compute the maximum likelihood

• Unfortunately, there is no general analytical solution. Can

update each wi to w’i

• Follow the gradients (partial derivatives) to find a parameter setting

• α > 0 is the learning rate or step size.

• Can include a regularization term. For example L2

• Since we are maximizing ,

• Now we want to use a discriminative model for

• Next tag is conditioned on previous tag sequence and all

• Probability only depends on the previous tag.

• Model each term using a log-linear model

• φ is a feature function defined over:

• Training: same as any log-linear model.

• Decoding: Need to find

• Can use Viterbi algorithm!