Lec 01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

CSC413 Lecture 1: Introduction

Jimmy Ba and Bo Wang

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 1 / 68


Course information

Second course in machine learning, with a focus on neural networks


CSC413 is an advanced machine learning course following CSC411 with
an in-depth focus on cutting-edge topics
Assumes knowledge of basic ML algorithms: linear regression, logistic
regression, maximum likelihood, PCA, EM, etc.
First 2/3: supervised learning
Last 1/3: unsupervised learning and reinforcement learning
Four sections
Equivalent content, same assignments and midterms

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 2 / 68


Course information

Formal prerequisites or equivalent in the tri-campus system:


Multivariable Calculus: MAT235/MAT237/MAT257/equivalent
Linear Algebra: MAT221H1/MAT223H1/MAT240H1/equivalent
Machine Learning: CSC311/STA314/ECE421/ROB313/equivalent
Prerequisites will be enforced, including for grad students. See details
on the FAS calendar.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 3 / 68


Course information

Expectations and marking (undergrads)


4 Assignments (60% of total mark)
Due Friday nights at 11:59pm
First homework will be out 1/15, due 2/03
Written part: 2-3 conceptual questions
Programming part: 10-15 lines of python code using PyTorch
Exams
Midterm quiz (openbook) (10%)
Final project (30%)
See Course Information handout for detailed policies
Important policy: A minimum of 3 out of 4 assignments must be
submitted on time (with grace days) to pass the course.
Every student has a total of 7 grace days to extend the coursework
deadlines through the semester.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 4 / 68


Course information

Expectations and marking (grad students)


Same as undergrads:
Assignments: 60%
Final project: 40%
See Course Information handout for detailed policies
Waitlists expire ≈1 week after the course starts.
After that, students are responsible for trying to enroll in the course on
ACORN in a first-come, first-serve fashion.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 5 / 68


The Good, the Bad and the Ugly

This course will be “easy” but there are rules.


Grade days: every student has a total of 7 grace days to extend the
coursework deadlines through the semester.
Late penalty: 10% per day up to 3 days. i.e. maximum of 100%,
90%, 80%, 70%, 0% within 72 hours after the deadline.
Completion requirement: a minimum of 3 out of 4 assignments must
be submitted on time (with grace days) to pass the course.
Collaboration: individual work except for the final project.
Our guarantees to you: this will be one of the most unique learning
experiences at UofT. But, you will have to put in a lot of work.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 6 / 68


Resources and communication

A lot of learning in this course happens outside of the classroom.


We are fully committed to help you succeed:
Piazza: main platform for fast, asynchronous, collaborative learning.
Support ticketing system: default mailing list.
Office hours: in-person interaction.
Video lectures: catch up on the missing bits.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 7 / 68


Compute

Colab (Mandatory) Programming assignments are to be completed


in Google Colab, which is a web-based iPython Notebook service that
has access to a free Nvidia K80 GPU per Google account.
GCE (Recommended for course projects) Google Compute Engine
delivers virtual machines running in Google’s data center.
OpenAI API Variety of language models, e.g., GPT-3, ChatGPT,
you will encounter and try out in the course. Recommend register
early and apply for Codex beta access.
See Course Information handout for the details

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 8 / 68


Course information

Course web page: http://uoft-csc413.github.io/2023/

Includes detailed course information handout:


https://uoft-csc413.github.io/2023/assets/misc/syllabus.pdf

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 9 / 68


Hello world

Final project: you must form a group of two or three.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 10 / 68


Hello world

Final project: you must form a group of two or three.


Now, everyone, please stand up.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 10 / 68


What is machine learning?

For many problems, it’s difficult to program the correct behavior by


hand
recognizing people and objects
understanding human speech from audio files

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 11 / 68


What is machine learning?

For many problems, it’s difficult to program the correct behavior by


hand
recognizing people and objects
understanding human speech from audio files
Machine learning approach: program an algorithm to automatically
learn from data, or from experience

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 11 / 68


What is machine learning?

For many problems, it’s difficult to program the correct behavior by


hand
recognizing people and objects
understanding human speech from audio files
Machine learning approach: program an algorithm to automatically
learn from data, or from experience
Some reasons you might want to use a learning algorithm:
hard to code up a solution by hand (e.g. vision, natural language
processing)
system needs to adapt to a changing environment (e.g. spam detection)
want the system to perform better than the human programmers
privacy/fairness (e.g. ranking search results)

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 11 / 68


What is machine learning?

Types of machine learning


Supervised learning: have labeled examples of the correct behavior,
i.e. ground truth input/output response
Reinforcement learning: learning system receives a reward signal,
tries to learn to maximize the reward signal
Unsupervised learning: no labeled examples – instead, looking for
interesting patterns in the data

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 12 / 68


What are neural networks?

Most of the biological details aren’t essential, so we use vastly


simplified models of neurons.
While neural nets originally drew inspiration from the brain, nowadays
we mostly think about math, statistics, etc.

Neural networks are collections of thousands (or millions) of these


simple processing units that together perform useful computations.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 13 / 68


What are neural networks?

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 14 / 68


What are neural networks?

Why neural nets?


inspiration from the brain
proof of concept that a neural architecture can see and hear!
very effective across a range of applications (vision, text, speech,
medicine, robotics, etc.)
widely used in both academia and the tech industry
powerful software frameworks (PyTorch, TensorFlow, etc.) let us
quickly implement sophisticated algorithms

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 15 / 68


What are neural networks?

Some near-synonyms for neural networks


“Deep learning”
Emphasizes that the algorithms often involve hierarchies with many
stages of processing

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 16 / 68


“Deep learning”

Deep learning: many layers (stages) of processing


E.g. this network which recognizes objects in images:

(Krizhevsky et al., 2012)


Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
Each of the boxes consists of many neuron-like units similar to the one on
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
thetheprevious slide!
number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 17 / 68


“Deep learning”
You can visualize what a learned feature is responding to by finding
an image that excites it. (We’ll see how to do this.)
Higher layers in the network often learn higher-level, more
interpretable representations

https://distill.pub/2017/feature-visualization/

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 18 / 68


“Deep learning”
You can visualize what a learned feature is responding to by finding
an image that excites it.
Higher layers in the network often learn higher-level, more
interpretable representations

https://distill.pub/2017/feature-visualization/

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 19 / 68


What is a representation?

How you represent your data determines what questions are easy to
answer.
E.g. a dict of word counts is good for questions like “What is the most
common word in Hamlet?”
It’s not so good for semantic questions like “if Alice liked Harry Potter,
will she like The Hunger Games?”

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 20 / 68


What is a representation?
Idea: represent words as vectors

TSNE

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 21 / 68


What is a representation?

Mathematical relationships between vectors encode semantic


relationships between words
Measure semantic similarity using the dot product (or dissimilarity
using Euclidean distance)
Represent a web page with the average of its word vectors
Complete analogies by doing arithmetic on word vectors
e.g. “Paris is to France as London is to ”
France – Paris + London =

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 22 / 68


What is a representation?

Mathematical relationships between vectors encode semantic


relationships between words
Measure semantic similarity using the dot product (or dissimilarity
using Euclidean distance)
Represent a web page with the average of its word vectors
Complete analogies by doing arithmetic on word vectors
e.g. “Paris is to France as London is to ”
France – Paris + London =
It’s very hard to construct representations like these by hand, so we
need to learn them from data
This is a big part of what neural nets do, whether it’s supervised,
unsupervised, or reinforcement learning!

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 22 / 68


Supervised learning examples

Supervised learning: have labeled examples of the correct behavior

e.g. Handwritten digit classification with the MNIST dataset


Task: given an image of a handwritten digit, predict the digit class
Input: the image
Target: the digit class

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 23 / 68


Supervised learning examples

Supervised learning: have labeled examples of the correct behavior

e.g. Handwritten digit classification with the MNIST dataset


Task: given an image of a handwritten digit, predict the digit class
Input: the image
Target: the digit class
Data: 70,000 images of handwritten digits labeled by humans
Training set: first 60,000 images, used to train the network
Test set: last 10,000 images, not available during training, used to
evaluate performance

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 23 / 68


Supervised learning examples

Supervised learning: have labeled examples of the correct behavior

e.g. Handwritten digit classification with the MNIST dataset


Task: given an image of a handwritten digit, predict the digit class
Input: the image
Target: the digit class
Data: 70,000 images of handwritten digits labeled by humans
Training set: first 60,000 images, used to train the network
Test set: last 10,000 images, not available during training, used to
evaluate performance
This dataset is the “fruit fly” of neural net research
Neural nets already achieved > 99% accuracy in the 1990s, but we
still continue to learn a lot from it

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 23 / 68


Supervised learning examples

What makes a “2”?


It is very hard to say what makes a 2

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 24 / 68


Supervised learning examples
Object recognition
Some examples from an earlier version of the net

(Krizhevsky and Hinton, 2012)

ImageNet dataset: thousands of categories, millions of labeled images


Lots of variability in viewpoint, lighting, etc.
Error rate dropped from 26% to under 4% over the course of a few years!
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 25 / 68
Supervised learning examples
Caption generation

(Xu et al., 2015)

Given: dataset of Flickr images with captions


More examples at http://deeplearning.cs.toronto.edu/i2t
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 26 / 68
Supervised learning examples
Neural Machine Translation

(Wu et al., 2016)

Now the production model on Google Translate


Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 27 / 68
Unsupervised learning examples
In generative modeling, we want to learn a distribution over some dataset,
such as natural images.
We can evaluate a generative model by sampling from the model and seeing
if it looks like the data.
These results were considered impressive in 2014:

Denton et al., 2014, Deep generative image models using a Laplacian pyramid of adversarial networks

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 28 / 68


Unsupervised learning examples

The progress of generative models:

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 29 / 68


Unsupervised learning examples

The progress of generative models:


Stable Diffusion, Robini et al, 2022:

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 30 / 68


Unsupervised learning examples
The progress of generative models:
DreamStudio, Stability AI:

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 31 / 68


Unsupervised learning examples

Generative models of text. The models like BERT and GPT-2


perform unsupervised learning by reconstructing the next words in a
sentence. The GPT-2 model learns from 40GB of Internet text.

https://talktotransformer.com/

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 32 / 68


Unsupervised learning examples
The GPT-3 models now are learning from 2TB of Internet text.
Similar models are helping with code generation.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 33 / 68


Reinforcement learning

An agent interacts with an environment (e.g. game of Breakout)


In each time step,
the agent receives observations (e.g. pixels) which give it information
about the state (e.g. positions of the ball and paddle)
the agent picks an action (e.g. keystrokes) which affects the state
The agent periodically receives a reward (e.g. points)
The agent wants to learn a policy, or mapping from observations to
actions, which maximizes its average reward over time
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 34 / 68
Reinforcement learning

DeepMind trained neural networks to play many different Atari games


given the raw screen as input, plus the score as a reward
single network architecture shared between all the games
in many cases, the networks learned to play better than humans (in
terms of points in the first minute)
https://www.youtube.com/watch?v=V1eYniJ0Rnk

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 35 / 68


Reinforcement learning for control

Learning locomotion control from scratch


The reward is to run as far as possible over all the obstacles
single control policy that learns to adapt to different terrains
https://www.youtube.com/watch?v=hx_bgoTF7bs

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 36 / 68


Software frameworks

Scientific computing (NumPy)


vectorize computations (express them in terms of matrix/vector
operations) to exploit hardware efficiency
Neural net frameworks: PyTorch, TensorFlow, etc.
automatic differentiation
compiling computation graphs
libraries of algorithms and network primitives
support for graphics processing units (GPUs)
For this course:
Python, NumPy
PyTorch, a widely used neural net framework with a built-in automatic
differentiation feature

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 37 / 68


Software frameworks

Why take this class, if PyTorch does so much for you?

So you know what do to if something goes wrong!


Debugging learning algorithms requires sophisticated detective work,
which requires understanding what goes on beneath the hood.
That’s why we derive things by hand in this class!

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 38 / 68


After break

Linear models

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 39 / 68


Overview

One of the fundamental building blocks in deep learning are the linear
models, where you decide based on a linear function of the input
vector.
Here, we will review linear models, some other fundamental concepts
(e.g. gradient descent, generalization), and some of the common
supervised learning problems:
Regression: predict a scalar-valued target (e.g. stock price)
Binary classification: predict a binary label (e.g. spam vs. non-spam
email)
Multiway classification: predict a discrete label (e.g. object category,
from a list)

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 40 / 68


Problem Setup

Want to predict a scalar t as a function of a vector x


Given a dataset of pairs {(x(i) , t (i) )}N
i=1
The x(i) are called input vectors, and the t (i) are called targets.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 41 / 68


Problem Setup

Model: y is a linear function of x:


y = w> x + b
y is the prediction
w is the weight vector
b is the bias
w and b together are the parameters
Settings of the parameters are called hypotheses
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 42 / 68
Problem Setup

Loss function: squared error


1
L(y , t) = (y − t)2
2
y − t is the residual, and we want to make this small in magnitude
1
The 2 factor is just to make the calculations convenient.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 43 / 68


Problem Setup

Loss function: squared error


1
L(y , t) = (y − t)2
2
y − t is the residual, and we want to make this small in magnitude
The 12 factor is just to make the calculations convenient.
Cost function: loss function averaged over all training examples
N 2
1 X  (i)
J (w , b) = y − t (i)
2N
i=1
N 2
1 X  > (i)
= w x + b − t (i)
2N
i=1

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 43 / 68


Problem Setup

Visualizing the contours of the cost function:

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 44 / 68


Vectorization

We can organize all the training examples into a matrix X with one
row per training example, and all the targets into a vector t.

Computing the predictions for the whole dataset:


 > (1)   (1) 
w x +b y
.
.. . 
Xw + b1 =   =  ..  = y
  

w> x(N) + b y (N)

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 45 / 68


Vectorization

Computing the squared error cost across the whole dataset:

y = Xw + b1
1
J = ky − tk2
2N
In Python:

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 46 / 68


Solving the optimization problem

We defined a cost function. This is what we’d like to minimize.


Recall from calculus class: the minimum of a smooth function (if it
exists) occurs at a critical point, i.e. point where the partial
derivatives are all 0.
Two strategies for optimization:
Direct solution: derive a formula that sets the partial derivatives to 0.
This works only in a handful of cases (e.g. linear regression).
Iterative methods (e.g. gradient descent): repeatedly apply an update
rule which slightly improves the current solution. This is what we’ll do
throughout the course.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 47 / 68


Direct solution
Partial derivatives: derivatives of a multivariate function with respect
to one of its arguments.
∂ f (x1 + h, x2 ) − f (x1 , x2 )
f (x1 , x2 ) = lim
∂x1 h→0 h
To compute, take the single variable derivatives, pretending the other
arguments are constant.
Example: partial derivatives of the prediction y
 
∂y ∂ X
= wj 0 xj 0 + b 
∂wj ∂wj 0 j

= xj
 
∂y ∂ X
= wj 0 xj 0 + b 
∂b ∂b 0 j

=1

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 48 / 68


Direct solution
Chain rule for derivatives:
∂L dL ∂y
=
∂wj dy ∂wj
 
d 1
= (y − t)2 · xj
dy 2
= (y − t)xj
∂L
=y −t
∂b
We will give a more precise statement of the Chain Rule next week.
It’s actually pretty complicated.
Cost derivatives (average over data points):
N
∂J 1 X (i) (i)
= (y − t (i) ) xj
∂wj N i=1
N
∂J 1 X (i)
= y − t (i)
∂b N i=1

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 49 / 68


Gradient descent
Gradient descent is an iterative algorithm, which means we apply an
update repeatedly until some criterion is met.
We initialize the weights to something reasonable (e.g. all zeros) and
repeatedly adjust them in the direction of steepest descent.
The gradient descent update decreases the cost function for small
enough α:
∂J
wj ← wj − α
∂wj
N
α X (i) (i)
= wj − (y − t (i) ) xj
N
i=1

α is a learning rate. The larger it is, the faster w changes.


We’ll see later how to tune the learning rate, but values are typically
small, e.g. 0.01 or 0.0001
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 50 / 68
Gradient descent
This gets its name from the gradient:
 ∂J 
1 ∂w
∂J
∇J (w) = =  ... 
 
∂w ∂J
∂wD

This is the direction of fastest increase in J .

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 51 / 68


Gradient descent
This gets its name from the gradient:
 ∂J 
1 ∂w
∂J
∇J (w) = =  ... 
 
∂w ∂J
∂wD

This is the direction of fastest increase in J .


Update rule in vector form:

w ← w − α∇J (w)
N
α X (i)
=w− (y − t (i) ) x(i)
N
i=1

Hence, gradient descent updates the weights in the direction of


fastest decrease.
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 51 / 68
Gradient descent

Visualization:
http://www.cs.toronto.edu/~guerzhoy/321/lec/W01/linear_
regression.pdf#page=21

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 52 / 68


Gradient descent

Why gradient descent, if we can find the optimum directly?


GD can be applied to a much broader set of models
GD can be easier to implement than direct solutions, especially with
automatic differentiation software
For regression in high-dimensional spaces, GD is more efficient than
direct solution (matrix inversion is an O(D 3 ) algorithm).

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 53 / 68


Feature maps
We can convert linear models into nonlinear models using feature
maps.
y = w> φ(x)
E.g., if ψ(x) = (1, x, · · · , x D )> , then y is a polynomial in x. This
model is known as polynomial regression:

y = w0 + w1 x + · · · + wD x D

This doesn’t require changing the algorithm — just pretend ψ(x) is


the input vector.
We don’t need an expicit bias term, since it can be absorbed into ψ.
Feature maps let us fit nonlinear models, but it can be hard to choose
good features.
Before deep learning, most of the effort in building a practical machine
learning system was feature engineering.
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 54 / 68
Feature maps
y = w0 y = w0 + w1 x

1 M =0 1 M =1
t t

0 0

−1 −1

0 x 1 0 x 1

y = w0 + w1 x + w2 x 2 + w3 x 3 y = w0 + w1 x + · · · + w9 x 9

1 M =3 1 M =9
t t

0 0

−1 −1

0 x 1 0 x 1

-Pattern Recognition and Machine Learning, Christopher Bishop.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 55 / 68


Generalization

Underfitting : The model is too simple - does not fit the data.
1 M =0
t

−1

0 x 1

Overfitting : The model is too complex - fits perfectly, does not generalize.
1 M =9
t

−1

0 x 1

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 56 / 68


Generalization

We would like our models to generalize to data they haven’t seen


before
The degree of the polynomial is an example of a hyperparameter,
something we can’t include in the training procedure itself
We can tune hyperparameters using a validation set:

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 57 / 68


Classification

Binary linear classification


classification: predict a discrete-valued target
binary: predict a binary target t ∈ {0, 1}
Training examples with t = 1 are called positive examples, and training
examples with t = 0 are called negative examples. Sorry.
linear: model is a linear function of x, thresholded at zero:

z = wT x + b

1 if z ≥ 0
output =
0 if z < 0

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 58 / 68


Logistic Regression

We can’t optimize classification accuracy directly with gradient


descent because it’s discontinuous.
Instead, we typically define a continuous surrogate loss function which
is easier to optimize. Logistic regression is a canonical example of
this, in the context of classification.
The model outputs a continuous value y ∈ [0, 1], which you can think
of as the probability of the example being positive.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 59 / 68


Logistic Regression

There’s obviously no reason to predict values outside [0, 1]. Let’s


squash y into this interval.

The logistic function is a kind of sigmoidal, or


S-shaped, function:
1
σ(z) =
1 + e −z
A linear model with a logistic nonlinearity is known as log-linear:

z = w> x + b
y = σ(z)

Used in this way, σ is called an activation function, and z is called the


logit.
Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 60 / 68
Logistic Regression

Because y ∈ [0, 1], we can interpret it as the estimated probability


that t = 1.
Being 99% confident of the wrong answer is much worse than being
90% confident of the wrong answer. Cross-entropy loss captures this
intuition:


− log y if t = 1
LCE (y , t) =
− log(1 − y ) if t = 0
= −t log y − (1 − t) log(1 − y )

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 61 / 68


Logistic Regression

Logistic regression combines the logistic activation function with


cross-entropy loss.

z = w> x + b
y = σ(z)
1
=
1 + e −z
LCE = −t log y − (1 − t) log(1 − y )

Interestingly, the loss asymptotes to a linear function of the logit z.


Full derivation in the readings.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 62 / 68


Multiclass Classification

What about classification tasks with more than two categories?


It is very hard to say what makes a 2 Some examples from an earlier version of the net

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 63 / 68


Multiclass Classification

Targets form a discrete set {1, . . . , K }.


It’s often more convenient to represent them as one-hot vectors, or a
one-of-K encoding:

t = (0, . . . , 0, 1, 0, . . . , 0)
| {z }
entry k is 1

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 64 / 68


Multiclass Classification
Now there are D input dimensions and K output dimensions, so we
need K × D weights, which we arrange as a weight matrix W.
Also, we have a K -dimensional vector b of biases.
Linear predictions: X
zk = wkj xj + bk
j

Vectorized:
z = Wx + b

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 65 / 68


Multiclass Classification

A natural activation function to use is the softmax function, a


multivariable generalization of the logistic function:
e zk
yk = softmax(z1 , . . . , zK )k = P zk 0
k0 e

The inputs zk are called the logits.


Properties:
Outputs are positive and sum to 1 (so they can be interpreted as
probabilities)
If one of the zk ’s is much larger than the others, softmax(z) is
approximately the argmax. (So really it’s more like “soft-argmax”.)
Exercise: how does the case of K = 2 relate to the logistic function?
Note: sometimes σ(z) is used to denote the softmax function; in this
class, it will denote the logistic function applied elementwise.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 66 / 68


Multiclass Classification

If a model outputs a vector of class probabilities, we can use


cross-entropy as the loss function:
K
X
LCE (y, t) = − tk log yk
k=1
>
= −t (log y),

where the log is applied elementwise.


Just like with logistic regression, we typically combine the softmax
and cross-entropy into a softmax-cross-entropy function.

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 67 / 68


Multiclass Classification

Softmax regression, also called multiclass logistic regression:

z = Wx + b
y = softmax(z)
LCE = −t> (log y)

It’s possible to show the gradient descent updates have a convenient


form:
∂LCE
=y−t
∂z

Jimmy Ba and Bo Wang CSC413 Lecture 1: Introduction 68 / 68

You might also like