Lecture 6 – Computational Graphs; PyTorch and
Tensorflow
DD2424
April 11, 2019
DD2424 - Lecture 8 1
Outline
• First Part
• Computation Graphs
• TensorFlow
• PyTorch
• Notes
• Second Part
DD2424 - Lecture 8 2
Frameworks
DD2424 - Lecture 8 3
Frameworks
DD2424 - Lecture 8 4
O’Reilly Poll: Most popular framework for machine learning
[ Source: https://www.techrepublic.com/google-amp/article/most-popular-
programming-language-frameworks-and-tools-for-machine-learning/ ]
DD2424 - Lecture 8 5
What are computation graphs?
DD2424 - Lecture 8 6
Computation Graph
• DAG (directed acyclic graph)
• Nodes
• Variables
• Mathematical Operations
var
• Edges
• Feeding input op
var
DD2424 - Lecture 8 7
Computation Graph
•𝑐 = 𝑎+𝑏
𝒄=𝒂+𝒃
DD2424 - Lecture 8 8
Computation Graph
•𝑐 = 𝑎+𝑏∗2
𝒄=𝒂+𝒛
𝒃 𝒛=𝒃∗𝟐
DD2424 - Lecture 8 9
Computation Graph
• Tensors: Multi-dimensional arrays
• 𝒂 = 𝑊𝒙 + 𝒃
𝑧 = 𝑾𝑥 a= 𝒛 + 𝒃
DD2424 - Lecture 8 10
Computation Graph
• A feed-forward neural network
𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 )
𝒃𝟏
DD2424 - Lecture 8 11
Computation Graph
• A multi-layer feed-forward neural network
𝑾𝟏 𝑾𝟐
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑧2 = 𝑾𝟐 𝒔𝟏 𝒂𝟐 = 𝒛𝟐 + 𝒃𝟐 𝒔𝟐 = 𝝈(𝒂𝟐 )
𝒃𝟏 𝒃𝟐
DD2424 - Lecture 8 12
Python (NumPy)
𝑧 = 𝑾𝑥 a= 𝒛 + 𝒃
𝒃
DD2424 - Lecture 8 13
PyTorch
NumPy
PyTorch
𝑾
𝑧 = 𝑾𝑥 a= 𝒛 + 𝒃
𝒃
DD2424 - Lecture 8 14
PyTorch
NumPy
PyTorch
Not always!
DD2424 - Lecture 8 15
PyTorch-NumPy
• Converting a Torch Tensor to a NumPy array and vice versa is a breeze.
DD2424 - Lecture 8 16
PyTorch-NumPy
• Converting a Torch Tensor to a NumPy array and vice versa is a breeze.
Shared Memory
DD2424 - Lecture 8 17
PyTorch-NumPy
• Converting a Torch Tensor to a NumPy array and vice versa is a breeze.
DD2424 - Lecture 8 18
“Define by Run” Computation Graphs
This kind of computation graph is called “define by run“
Also referred to as “dynamic”
DD2424 - Lecture 8 19
“Define and Run” Computation Graphs
• First define the graph structure
• Then run it by feeding in the (input) variables.
Define graph G Run the graph G
𝑾𝟏
• Run G with 𝑥1 , 𝑊1 , 𝑏1
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 )
𝒙 • Run G with 𝑥2 , 𝑊2 , 𝑏2
𝒃𝟏
• …
Also known as “static graphs”
DD2424 - Lecture 8 20
Run graph
Define graph
many times
DD2424 - Lecture 8
21
TensorFlow
Data loop
• Dynamic Graph • Static Graph
DD2424 - Lecture 8 22
Why computation graphs at all?!
DD2424 - Lecture 8 23
Why computation graphs?
• In lecture 3, you’ve learnt how to do backprop using the chain rule
DD2424 - Lecture 8 24
Why computation graphs?
• Is it feasible?
DD2424 - Lecture 8 25
Why computation graphs?
• Automatic chain rule
• automatic back-prop using implemented operations
• Each operation has their gradient already implemented
• If you want to use a novel operation, then you have to provide it’s gradient w.r.t. inputs
and its learnable parameters (if any)
DD2424 - Lecture 8 26
Let’s look at examples in PyTorch and TensorFlow
DD2424 - Lecture 8 27
Computation Graph
• A feed-forward neural network
𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 )
𝒃𝟏
DD2424 - Lecture 8 28
Computation Graph
• A feed-forward neural network with squared 𝐿2 loss
𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
𝒃𝟏 𝒚
DD2424 - Lecture 8 29
Backprop in Computation Graph
• Learnable parameters
𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
𝒃𝟏 𝒚
DD2424 - Lecture 8 30
Backprop in Computation Graph
𝜕𝑙
𝜕𝑊1
𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
𝒃𝟏 𝒚
𝜕𝑙
𝜕𝑏1
DD2424 - Lecture 8 31
Backprop in Computation Graph
𝜕𝑙
𝜕𝑧1
𝜕𝑊1
𝑾𝟏 𝜕𝑊1 𝜕𝑎1 𝜕𝑠1 𝜕𝑙
𝜕𝑧1 𝜕𝑎1 𝜕𝑠1
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
𝒙
𝜕𝑎1
𝒃𝟏 𝒚
𝜕𝑏1
𝜕𝑙
𝜕𝑏1
DD2424 - Lecture 8 32
Backprop in Computation Graph
A deep learning framework provides an automatic gradient calculation
of its output variables w.r.t. its input variables
𝜕𝑙
𝜕𝑧1
𝜕𝑊1
𝑾𝟏 𝜕𝑊1 𝜕𝑎1 𝜕𝑠1 𝜕𝑙
𝜕𝑧1 𝜕𝑎1 𝜕𝑠1
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
𝒙
𝜕𝑎1
𝒃𝟏 𝒚
𝜕𝑏1
𝜕𝑙
𝜕𝑏1
DD2424 - Lecture 8 33
Backprop in Computation Graph
• Addition Node
• Forward pass: 𝑎 = 𝑏 + 𝑐
𝜕𝑎 𝜕𝑎
• Backward pass: 𝜕𝑏 = 1 and 𝜕𝑐
=1
DD2424 - Lecture 8 34
Backprop in Computation Graph
• Max Node
• Forward pass: 𝑎 = max 𝑏, 𝑐
• Backward pass:
• If b < c
𝜕𝑎 𝜕𝑎
• 𝜕𝑏
= 0 and
𝜕𝑐
=1
max
• If b > c
𝜕𝑎 𝜕𝑎
• 𝜕𝑏
= 1 and
𝜕𝑐
=0
DD2424 - Lecture 8 35
Variables and Ops
• Ops
• Intermediate or final nodes
• Variables
• intrinsic parameters of the model
• input to the model
𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
𝒃𝟏 𝒚
DD2424 - Lecture 8 36
Variables and Ops
• Ops
• Intermediate or final nodes
• Variables
• intrinsic parameters of the model
• input to the model
𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
𝒃𝟏 𝒚
DD2424 - Lecture 8 37
Variables and Ops
• Variables
• Intrinsic parameters of the model
• Input to the model
• TensorFlow
• Variables
• Place Holders 𝑾𝟏
𝑧1 = 𝑾𝟏 𝑥 𝒂𝟏 = 𝒛𝟏 + 𝒃𝟏 𝒔𝟏 = 𝝈(𝒂𝟏 ) 𝑙 = |𝑠1 − 𝑦 |2
• PyTorch 𝒙
• Variables
𝒃𝟏 𝒚
DD2424 - Lecture 8 38
Variable
PyTorch Autograd
• package: torch.autograd
Data
Tensor
Gradient
w.r.t.
this variable
Function
that created
this variable DD2424 - Lecture 8 39
Pytorch Autograd
DD2424 - Lecture 8 40
Pytorch Autograd
DD2424 - Lecture 8 41
Pytorch Autograd
DD2424 - Lecture 8 42
Pytorch Autograd
DD2424 - Lecture 8 43
PyTorch Autograd
• Calculate gradient using backward() method of a Variable
• var.backward()
DD2424 - Lecture 8 44
TensorFlow gradients
• Add gradient nodes in the graph where necessary using
Tf.gradients(ys, xs, gs)
• And evaluate it
DD2424 - Lecture 8 45
TensorFlow gradients
• Then update the parameters
DD2424 - Lecture 8 46
TensorFlow gradient
• Use tf.Variable instead
DD2424 - Lecture 8 47
How to use GPU?
DD2424 - Lecture 8 48
PyTorch GPU
Turn variables into “GPU” variables by the following command:
• var = var.cuda(#)
DD2424 - Lecture 8 49
PyTorch GPU
Turn back variables into “CPU” variables by the following command:
• var = var.cpu()
DD2424 - Lecture 8 50
TensorFlow GPU
• In TF variables or operations can sit on specific device
• tf.device(/gpu:0)
• tf.device(/gpu:1)
•…
• tf.device(/cpu:0)
DD2424 - Lecture 8 51
TensorFlow GPU
• In TF variables or operations can sit on specific device
tf.Session(config=tf.ConfigProto(log_device_placement=True))
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-10 12:59:09.508497: I tensorflow/core/common_runtime/placer.cc:874] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-10 12:59:09.508513: I tensorflow/core/common_runtime/placer.cc:874] add: (Add)/job:localhost/replica:0/task:0/device:GPU:0
Maximum: (Maximum): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-10 12:59:09.508525: I tensorflow/core/common_runtime/placer.cc:874] Maximum: (Maximum)/job:localhost/replica:0/task:0/device:GPU:0
Maximum/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-10 12:59:09.508537: I tensorflow/core/common_runtime/placer.cc:874] Maximum/y: (Const)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder_2: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-10 12:59:09.508548: I tensorflow/core/common_runtime/placer.cc:874] Placeholder_2: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder_1: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-10 12:59:09.508558: I tensorflow/core/common_runtime/placer.cc:874] Placeholder_1: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-10 12:59:09.508567: I tensorflow/core/common_runtime/placer.cc:874] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
DD2424 - Lecture 8 52
TensorFlow GPU
• Some TF operations do not have a CUDA implementation
tf.Session(config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True))
DD2424 - Lecture 8 53
How to implement complicated models in practice?
DD2424 - Lecture 8 54
PT High-Level Library
• PyTorch package called nn and class called Module
DD2424 - Lecture 8 55
TF High-Level Libraries
• Keras: highest abstraction
• SLIM: best pre-trained models
• TFLearn,
• Sonnet,
• Pretty Tensor,
•…
DD2424 - Lecture 8 56
Data, storage, and loading
!!!Important!!!
• Always monitor CPU/GPU usage (linux: nvidia-smi, top)
• Make storage more efficient (TF Records, etc.)
• Make reading pipeline more efficient (parallel readers, prefetching,
etc.)
DD2424 - Lecture 8 57
Use Visualization
• Always monitor the loss function on the training and validation sets visually
• Monitor all other important scalars, such as learning rate, regularization loss,
layer activations summary, how full your data queues are, and …
• If you have an imbalanced classification problem, visualize the CE loss separately
for each class.
• If you work with images, time to time visualize samples from the batch, if you do
data augmentation, visualize the original sample as well as the augmented one
• TensorBoard for TF
• TensorBoardX, matplotlib, seaborn, … for PT
DD2424 - Lecture 8 58
Use Visualization
You can have the configuration shown as a text file in tensorboard!
DD2424 - Lecture 8 59
Which one is better? PyTorch or TensorFlow?
DD2424 - Lecture 8 60
pros and cons
• PyTorch: easier for prototyping
• PyTorch: much easier to implement flexible graphs
• PyTorch: different structures in each iteration (dependent on data). This is possible with TF too, but is a pain.
• PyTorch: manipulating weight and gradients
• PyTorch: code-level debugging (breakpoints, imperative, tracing your own code instead of TF kernels)
• PyTorch: probably better abstractions for dataset, variable, parallelism, etc. but TF has many high-level wrappers with better abstractions
• Tie?!: Faster run-time, (NHWC v.s. NCHW)
• TF: TensorBoard
• TF: research-level debugging (TensorBoard)
• TF: windows
• TF: distributed training (PyTorch has it now too, but seems not as developed as the TF version)
• TF: easier with distributing the code over multiple devices (GPUs/CPU) (maybe not anymore)
• TF: online community is noticeably larger
• TF: data readers
• TF: supposedly more optimizations of the graph (done by the engine)
• TF: documentation and tutorials
• TF: more models available
• TF: Serialization, code and portability (saving and loading models for across platforms, or checkpoints)
• TF: Deployment: Server, Mobile, etc. (TensorFlow Serving, TensorFlow Lite)
• TF: Richer API (e.g. FFT)
• TF: Automatic shape inference
• TF has a MOOC: https://eu.udacity.com/course/deep-learning--ud730
DD2424 - Lecture 8 61
TensorFlow Eager execution
• Eager Execution
• Dynamic!
• tf.enable_eager_execuation()
• Considerably Slower (being worked on)
• https://www.tensorflow.org/guide/eager
DD2424 - Lecture 8 62
Caffe(2)
• Portability is seamless (e.g. mobile apps)
• Simplest framework for fine-tuning or feature extraction
• Used to be fastest (Caffe)
DD2424 - Lecture 8 63
Summary
• Don’t take the following statements too seriously! -- it depends on many factors
• If you want to use pretrained classic deep networks (AlexNet, VGG, ResNet, …) for feature extraction and/or fine-
tuning → Use Caffe and/or Caffe2
• If you have a mobile application in mind → Use Caffe/Caffe2 or TensorFlow
• If you want more pythonic → use PyTorch
• If you are familiar with Matlab and don’t need much flexibility or advanced layers → use MatConvNet
• If you don’t want so much of flexibility and still use python → use Keras
• If you are working on NLP applications or complicated RNNs → use PyTorch
• If you want large community help, sustainable learning of a framework → use TensorFlow
• If you want to work on bleeding-edge papers → See what framework has the original and/or cleanest
implementation (most likely TensorFlow)
• If you want to prototype many different novel setups → Use PyTorch or TF Eager
DD2424 - Lecture 8 64