0% found this document useful (0 votes)
9 views28 pages

Lecture14 - ML (FF, Autoenc, Dense Networks)

Uploaded by

1162407364
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views28 pages

Lecture14 - ML (FF, Autoenc, Dense Networks)

Uploaded by

1162407364
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Natural Language

Processing
Lecture 14:
Machine Learning: Feed-forward Neural Networks,
Autoencoders/embeddings, Dense networks

12 /7/2019

COMS W4705
Yassine Benajiba
Perceptron Expressiveness
• Simple perceptron learning algorithm, starts with an
arbitrary hyperplane and adjusts it using the training data.

• Step function is not differentiable, so no closed-form


solution.

• Perceptron produces a linear separator.

• Can only learn linearly separable patterns.

• Can represent boolean functions like and, or, not but not
the xor function.
The problem with xor
Multi-Layer Neural Networks

input layer hidden layer output layer

• Basic idea: represent any (non-linear) function as a composition of


soft-threshold functions. This is a form of non-linear regression.

• Lippmann 1987: Two hidden layers suffice to represent any arbitrary


region (provided enough neurons), even discontinuous functions!
Activation Functions
• One problem with perceptrons is that the threshold
function (step function) is undifferentiable.

• It is therefore unsuitable for gradient descent.

• One alternative is the sigmoid (logistic) function:

g(z) = 0 if z→-∞
g(z) = 1 if z→∞
Activation Functions
• Two other popular activation functions:
Output Representation
• Many NLP Problems are multi-class classification problems.

• Each output neuron represents one class. Predict the class


with the highest activation.

y0 0.9

y1 0.1

y2 0.7

y3 0.4
Softmax
• We often want the activation at the output layer to
represent probabilities.

• Normalize activation of each output unit by the sum of all


output activations (as in log-linear models).

z0 0.9

z1 0.1

z2 0.7 The network computes a probability

z3 0.4
Softmax
• We often want the activation at the output layer to
represent probabilities.

• Normalize activation of each output unit by the sum of all


output activations (as in log-linear models).

z0 0.35

z1 0.16

z2 0.28 The network computes a probability

z3 0.21
Learning in Multi-Layer
Neural Networks
• Network structure is fixed, but we want to train the weights. Assume
feed-forward neural networks: no connections that are loops.

• Backpropagation Algorithm:

• Given current weights, get network output and compute loss


function (assume multiple outputs / a vector of outputs).

• Can use gradient descent to update weights and minimize loss.

• Problem: We only know how to do this for the last layer!

• Idea: Propagate error backwards through the network.


Backpropagation
feed-forward computation of network outputs

x1 output vector
hw(x)
i hw(x)1 = a1
x2 k
input vector x
target vector y
hw(x)2 = a2
x3

Error function
x4 Etrain(w)

input layer hidden layer output layer

back propagation of error gradients


Negative Log-Likelihood
(also known as cross-entropy)

• Assume target output is a one-hot vector and c(y) is the


target class for target y.

• Compute the negative log-likehood for a single example

• Empirical error for the entire training data:


Stochastic Gradient Descent
(for a single unit)
• Goal: Learn parameters that minimize the empirical error.

Randomly initialize w
for a set number of iterations T:
shuffle training data
for j = 1...N:
for each wi (all weights in the network):

• is the learning rate.


• It often makes sense to compute the gradient over batches of examples,
instead of just one ("mini-batch").
Backpropgation
• Simplified multi-layer case (a single unit per layer):

x g g(x) f f(g(x)) Loss


w1 w2

• Stochastic Gradient Descent should perform the following


update:

• Problem: How do we compute the gradient for parameters w1


and w2?
Chain Rule of Calculus

• To compute gradients for hidden units, we need to apply the


chain rule of calculus:

The derivative of is
Backpropagation

x f f(x) g g(f(x)) Loss


w1 w2
Backpropagation
forward ... x f f(x) ... Loss
w

backward ... f ...


w

Assume we know

We want to compute to propagate it back.

and (for the weight update)


Backpropagation
forward ... x f f(x) ... Loss
w

backward ... f ...


w

to compute these
we have to know
the derivate of the
function f
Autoencoders
Embeddings
(Word level semantics)
Skip-Gram Model
• Input:
A single word in one-hot representation.

• Output: probability to see any single word as a context word.

0.02 a
0 d hidden

neurons 0.0 thought
0 Σ
0.04 cheese
eat 1 Σ
0 ⋮ 0.03 place
⋮ Σ

0 0.0 run
|V| neurons |V| neurons
softmax activation
• Softmax function normalizes the activation of the output neurons to sum up to 1.0.
Skip-Gram Model
• Compute error with respect to each context word.
wt-c place ...a place to eat delicious cheese .

⋮ (eat, place)
(eat, to)
wt-1 to (eat, delicious)
eat (eat,cheese)
wt+1 delicious
wt

wt+c cheese

• Combine errors for each word, then use combined error to update
weights using back-propagation.
Continuous Bag-of-Words
Model (CBOW)
wt-c

wt-1
wt

wt+1
SUM

wt+c

• Input: Context words. Averaged in the hidden layer.

• Output: Probability that each word is the target word.


Embeddings are Magic
(Mikolov 2016)

vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’)


Application: Word Pair
Relationships
Using Word Embeddings
• Word2Vec:

• https://code.google.com/archive/p/word2vec/

• GloVe: Global Vectors for Word Representation

• https://nlp.stanford.edu/projects/glove/

• Can either use pre-trained word embeddings or train them


on a large corpus.
Word embeddings
0.02 a
0 d hidden
⋮ neurons 0.0 thought
0 Σ
0.04 cheese
eat 1 Σ
0 ⋮ 0.03 place
⋮ Σ

0 0.0 run
|V| neurons |V| neurons
softmax activation
Word embeddings
Pros
- Groups semantically
similar words together
- A simple way to measure
similarity
- Great approach to better
deal with unseen words
in the training

Cons
- Doesn’t make a
difference between
function and content
words
- Only one representation How can we build a sentence
for polysemous words representation using word-level
- Non interpretable distributional representations?
semantic dimensions
Acknowledgments
• Some slides by Chris Kedzie

You might also like