01 Introreview PDF
01 Introreview PDF
01 Introreview PDF
Course Introduction +
Review of Throughput Hardware Concepts
Dv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow9 with class segmentation
the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [36] is
disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar
ce normal with the inferred gravity
Computational photography and image processing
e jointly trained late fusion model
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
vnet trained on class-balanced samples (1) or natural frequency
redictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
pixel mean mean f.w. noted RCNN3 ( 3 ). The metric for geometry is pixel accuracy.
acc. acc. IU IU
60.3 - 28.6 47.0 pixel mean mean f.w. geom.
60.0 42.2 29.2 43.9 acc. acc. IU IU acc.
61.5 42.4 30.5 45.5 Liu et al. [25] 76.7 - - - -
57.1 35.2 24.2 40.4 Tighe et al. [36] - - - - 90.8
Tighe et al. [37] 1 75.6 41.1 - - -
64.3 44.9 32.8 48.0
Tighe et al. [37] 2 78.6 39.2 - - -
65.4 46.1 34.0 49.5
Farabet et al. [9] 1 72.3 50.8 - - -
GB-D dataset collected using the Farabet et al. [9] 2 78.5 29.6 - - -
1449 RGB-D images, with pixel- Pinheiro et al. [31] 77.7 29.8 - - -
coalesced into a 40 class seman- FCN-16s 85.2 51.7 39.5 76.1 94.3
“We take an average of three hours to draw a single frame on the fastest computer money can buy.”
- Steve Jobs
UNC Pixel Planes (1981), computation-enhanced frame buffer
Ed Clark’s Geometry Engine
(1982)
Tessellate Tessellate
Cache Cache Cache Cache
Vertex Generation
3D vertex stream
Vertex Processing
Projected
vertex stream
Primitive Generation
Primitive stream
Fragment Generation
(“Rasterization”)
Fragment stream
Fragment Processing
Fragment stream
Output
image buffer Pixel Operations
(pixels)
Domain-specific languages for heterogeneous computing
OpenGL Graphics Pipeline (circa 2007)
Fragment stream
Output “fragment shader”
image buffer Pixel Operations (a.k.a kernel function mapped onto
(pixels)
input fragment stream)
Emerging state-of-the-art visual computing
systems today…
! Intelligent cameras in smartphones
! Cloud servers (“infinite” computing and storage at
your disposal as a service)
! Proliferation of specialized compute accelerators
- For image processing, machine learning
! Proliferation of high-resolution image sensors…
Capturing pixels to communicate
Ingesting/serving Ingesting/streaming
the world’s photos world’s video
14 cameras
8K x 8K stereo panorama output
VR: high resolution requirements
180o
~5o
Future “retina” VR display:
57 ppd covering 180o
= 10K x 10K display per eye
= 200 MPixel
28
[Image Credit: Kundu et al. 2016]
NVIDIA Drive PX
What is this?
Mobile
Continuous (always on)
Exceptionally high resolution
Capture for computers to analyze, not humans to watch
What is this course about?
1. Understanding the characteristics of important visual computing workloads
2. Understanding techniques used to achieve efficient system implementations
sults on NYUDv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow9 with class segmentation
MACHINE
pth channels at the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [36] is
as horizontal disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar
VISUAL COMPUTING
the local surface normal with the inferred gravity
GB-HHA is the jointly trained late fusion model
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
Parallelism
ORGANIZATION
vnet trained on class-balanced samples (1) or natural frequency
GB and HHA predictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
a et al. [15]
WORKLOADS
pixel
acc.
60.3
mean
acc.
-
mean
IU
28.6
f.w.
IU
47.0 pixel mean
Exploiting locality
noted RCNN3 ( 3 ). The metric for geometry is pixel accuracy.
Minimizing
mean f.w. geom. communication
N-32s RGB Algorithms for 3D graphics, image
60.0 42.2 29.2 43.9
Liu et al. [25]
acc.
76.7
acc. IU
- -
IU
-
acc.
-
32s RGBD 61.5 42.4 30.5 45.5
N-32s HHA processing, compression,
57.1 35.2 24.2 40.4etc.
Tighe et al. [36]
Tighe et al. [37] 1
-
75.6
-
41.1
-
-
-
-
90.8
-
RGB-HHA 64.3 44.9 32.8 48.0
Tighe et al. [37] 2 78.6 39.2 - - -
RGB-HHA 65.4 46.1 34.0 49.5
Farabet et al. [9] 1 72.3 50.8 - - -
[33] is an RGB-D dataset collected using the Farabet et al. [9] 2 78.5 29.6 - - -
nect. It has 1449 RGB-D images, with pixel- Pinheiro et al. [31] 77.7 29.8 - - -
hat have been coalesced into a 40 class seman- FCN-16s 85.2 51.7 39.5 76.1 94.3
ion task by Gupta et al. [14]. We report results
rd split of 795 training images and 654 testing FCN-8s SDS [17] Ground Truth Image
te: all model selection is performed on PAS-
l.) Table 4 gives the performance of our model
iations. First we train our unmodified coarse
32s) on RGB images. To add depth informa-
n on a model upgraded to take four-channel
t (early fusion). This provides little benefit, High-throughput hardware designs:
DESIGN OF PROGRAMMING
to the difficultly of propagating meaningful
he way through the model. Following the suc- Parallel, heterogeneous, specialized
et al. [15], we try the three-dimensional HHA
ABSTRACTIONS
depth, training nets on just this information, as
e fusion” of RGB and HHA where the predic-
oth nets are summed at the final layer, and the
-stream net is learned end-to-end. Finally we
ate fusion net to a 16-stride version. FOR VISUAL COMPUTING
choice of programming primitives
w is a dataset of 2,688 images with pixel labels
tic categories (“bridge”, “mountain”, “sun”),
ee geometric categories (“horizontal”, “verti-
y”). An FCN can naturally learn a joint repre-
simultaneously predicts both types of labels.
level
of-the-art performance on PASCAL. The left column of abstraction
Figure 6. Fully convolutional segmentation nets produce state-
shows the
output of our highest performing net, FCN-8s. The second shows
wo-headed version of FCN-16s with seman- the segmentations produced by the previous state-of-the-art system
In other words
It is about understanding the fundamental
structure of problems in the visual computing
domain, and then leveraging that
understanding to…
Part 2: Accelerating Deep Learning for Computer Vision (from a systems perspective)
* 10-crop results (ResNet 1-crop results are similar to other DNNs in this table) Stanford CS348V, Winter 2018
Major course themes/topics
Part 3: The GPU Accelerated 3D Graphics Pipeline
Multi-core GPU
(3D graphics,
OpenCL data-parallel compute)
Multi-core ARM CPU
Display engine
(compresses pixels for Video encode/decode
transfer to 4K screen) ASIC (H.265 @ 4K)
And so on…
100000f71: popq %rbp
100000f72: retq
x[i]
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]
Fetch/
Decode
PC ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]
Fetch/
Decode
ld r0, addr[r1]
PC mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU PC mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]
This program has five instructions, so it will take five clocks to execute, correct?
Can we do better?
x x y y z z
ILP = 3 * * *
ILP = 1 +
ILP = 1 +
a
Stanford CS348V, Winter 2018
Superscalar execution
a = x*x + y*y + z*z
// assume r0=x, r1=y, r2=z
result[i] = value;
}
}
Stanford CS348V, Winter 2018
Multi-core: process multiple instruction streams in parallel
Core 1 Core 2
Shared L3 cache
Core 3 Core 4
Intel “Skylake” Core i7 quad-core CPU NVIDIA GP104 (GTX 1080) GPU
(2015) 20 replicated (“SM”) cores
(2016)
Core 1
Core 2
Intel Xeon Phi “Knights Landing “ 76-core CPU Apple A9 dual-core CPU
(2015) (2015)
Fetch/
Decode Idea #2:
Amortize cost/complexity of managing an
ALU 0 ALU 1 ALU 2 ALU 3 instruction stream across many ALUs
ALU 4 ALU 5 ALU 6 ALU 7
SIMD processing
Single instruction, multiple data
void sinx(int N, int terms, float* x, float* result) Compiler understands loop iterations
{
are independent, and that same loop
// declare independent loop iterations
forall (int i from 0 to N-1)
body will be executed on a large
{ number of data elements.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;
Abstraction facilitates automatic
generation of both multi-core parallel
for (int j=1; j<=terms; j++) code, and vector instructions to make
{ use of SIMD processing capabilities
value += sign * numer / denom
numer *= x[i] * x[i];
within a core.
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[i] = value;
}
}
<unconditional code>
float x = A[i];
if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
result[i] = x;
<unconditional code>
float x = A[i];
T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
result[i] = x;
<unconditional code>
float x = A[i];
T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
<resume unconditional
Not all ALUs do useful work! code>
result[i] = x;
Worst case: 1/8 peak performance
<unconditional code>
float x = A[i];
T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
result[i] = x;
8 cores
8 SIMD ALUs per core
(AVX2 instructions)
* Showing only AVX math units, and fetch/decode unit for AVX (additional capability for integer math)
Stanford CS348V, Winter 2018
Example: NVIDIA GTX 1080 GPU
20 cores (“SMs”)
128 SIMD ALUs per core (@1.6 GHz) = 8.1 TFLOPs (180 Watts) Stanford CS348V, Winter 2018
Part 2:
accessing memory
Memory
▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
25 GB/sec Memory
. DDR3 DRAM
.. L3 cache
(8 MB) (Gigabytes)
L1 cache
(32 KB)
Core N
L2 cache
(256 KB)
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
25 GB/sec Memory
. DDR3 DRAM
.. L3 cache
(8 MB) (Gigabytes)
L1 cache
(32 KB)
Core N
L2 cache
(256 KB)
* Caches also provide high bandwidth data transfer to CPU Stanford CS348V, Winter 2018
Prefetching reduces stalls (hides latency)
▪ All modern CPUs have logic for prefetching data into caches
- Dynamically analyze program’s access patterns, predict what it will access soon
1 Core (1 thread)
Fetch/
Decode
Exec Ctx
1 2
3 4
Runnable
1 2
3 4
Runnable Stall
1 2
Stall
Runnable
3 4
Runnable
Done!
Runnable
Done!
Stanford CS348V, Winter 2018
Throughput computing trade-off
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
Done!
16 simultaneous instruction
streams
CMU 15-418/618, Spring 2016
ALU
for (int j=1; j<=terms; j++)
(Execute)
{
value += sign * numer / denom;
Execution
numer *= x[i] * x[i];
Context
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[i] = value;
}
}
Stanford CS348V, Winter 2018
Review: superscalar execution
Unmodified program
void sinx(int N, int terms, float* x, float* result) My single core, superscalar processor:
{
executes up to two instructions per clock
for (int i=0; i<N; i++)
{ from a single instruction stream.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3! Fetch/ Fetch/
int sign = -1; Decode Decode
// launch thread
pthread_create(&thread_id, NULL, my_thread_start, &args);
sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}
void parallel_sinx(int N, int terms, float* x, float* result) { Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
pthread_t thread_id;
my_args args;
Exec Exec Exec Exec
1 2 1 2
args.N = N/2;
args.terms = terms;
Execution Execution
args.x = x; Context Context
args.result = result;
// launch thread
pthread_create(&thread_id, NULL, my_thread_start, &args);
sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}
result[i] = value;
}
}
Stanford CS348V, Winter 2018
Review: four SIMD, multi-threaded cores
Observation: memory operations have very long latency
Solution: hide latency of loading data for one iteration by My multi-threaded, SIMD quad-core processor:
executing arithmetic instructions from other iterations executes one SIMD instruction per clock
void sinx(int N, int terms, float* x, float* result) from one instruction stream on each core. But
{ can switch to processing the other instruction
// declare independent loop iterations stream when faced with a stall.
forall (int i from 0 to N-1)
{ Fetch/ Fetch/
float value = x[i]; Memory load Decode Decode
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1; Execution Execution Execution Execution
Context Context Context Context
result[i] = value;
}
}
Stanford CS348V, Winter 2018
Summary: four superscalar, SIMD, multi-threaded cores
My multi-threaded, superscalar, SIMD quad-core processor:
executes up to two instructions per clock from one instruction stream on each core
(in this example: one SIMD instruction + one scalar instruction).
Processor can switch to execute the other instruction stream when faced with stall.
Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
Exec 1 Exec 1
Exec 1 Exec 1
On-chip
interconnect
Memory
L3 Cache Controller
Memory Bus
(to DRAM)
▪ Question: If you were the OS, how would to assign the two threads to
the four available execution contexts?
Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
pthreads?
Stanford CS348V, Winter 2018
Another thought experiment
Task: element-wise multiplication of two vectors A and B
Assume vectors contain millions of elements A
×
- Load input A[i] B
- Load input B[i] =
- Compute A[i] × B[i] C
- Store result into C[i]
Bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up.
No amount of latency hiding helps this.
Die temp: (junction temp -- Tj): chip becomes unreliable above this temp
(chip can run at high power for short period of time until chip heats to Tj)
Case temp: mobile device gets too hot for user to comfortably hold
(chip is at suitable operating temp, but heat is dissipating into case)
Power
Battery life: chip and case are cool, but want to reduce power
consumption to sustain long battery life for given task
Time
Slide credit: adopted from original slide from M. Shebanow: HPG 2013 keynote Stanford CS348V, Winter 2018
Efficiency benefits of compute specialization
▪ Rules of thumb: compared to high-quality C code on CPU...
[Source: Chung et al. 2010 , Dally 08] [Figure credit Eric Chung]
Stanford CS348V, Winter 2018
Hardware specialization increases efficiency
FPGA
GPUs
FPGA
GPUs
-
per Packet) / SHIFT, Permute, BitOps
Multi-core CPU Instruction Unit • Up to 8 16b MAC/cycle
-
• 2 SP FMA/cycle
Multi-core GPU (Adreno)
VLIW: Area & power efficient multi-issue
- Device
Hexagon DDR
DSP
Memory
L2
Cache • Dual 64-bit execution units
Variable
/ TCMsized • Standard 8/16/32/64bit data
instruction packets Instruction types
(1 to 4 instructionsData Cache • SIMD vectorized MPY / ALU
• Dual 64-bit Unit Data Unit Execution Execution
per Packet) / SHIFT, Permute, BitOps
load/store (Load/ (Load/
Instruction Unit
Unit Unit • Up to 8 16b MAC/cycle
units Store/ Store/ (64-bit (64-bit • 2 SP FMA/cycle
• Also 32-bit ALU) ALU) Vector) Vector)
Device L2
ALU DDR Data Cache • Unified 32x32bit
Cache
Memory
/ TCM General Register
File is best for
• Dual 64-bit Data Unit Data Unit Execution Execution
(Load/ (Load/ Unit Unit compiler.
load/store
units Store/ Store/ (64-bit •
(64-bit No separate Address
• Also 32-bit
Register
ALU) File/Thread
ALU) Vector) Vector) or Accum Regs
Register File
ALU Register
Data Cache File • Per-Thread
• Unified 32x32bit
General Register
File is best for
compiler. 7
Qualcomm Technologies, Inc. All Rights Reserved
• No separate Address
Register File/Thread or Accum Regs
Register File
Register File • Per-Thread
7
Stanford CS348V, Winter 2018
Qualcomm Technologies, Inc. All Rights Reserved
Summary: choosing the right tool for the job
Throughput-oriented FPGA/Future
Energy-optimized CPU processor (GPU) Programmable DSP reconfigurable logic ASIC
Video encode/decode,
Audio playback,
Area & power efficient multi-issue Camera RAW processing,
neural nets (future?)
• Dual 64-bit execution units
• Standard 8/16/32/64bit data
kets Instruction types
ons Cache
~10X more efficient • SIMD vectorized MPY / ALU
/ SHIFT, Permute, BitOps
~100X??? ~100-1000X
Instruction Unit • Up to 8 16b MAC/cycle
• 2 SP FMA/cycle
(jury still out) more efficient
L2
Easiest to program
Cache
/ TCM
Difficult to program Not programmable +
Data Unit Data Unit Execution Execution
(making it easier is costs 10-100’s millions
t
(Load/
Store/
(Load/
Store/
Unit
(64-bit
Unit
(64-bit
active area of research) of dollars to design /
t ALU) ALU) Vector) Vector) verify / create
Data Cache • Unified 32x32bit
General Register
File is best for
compiler.
• No separate Address
Register File/Thread
Credit Pat Hanrahan for this taxonomy
Register File or Accum Regs
Stanford CS348V, Winter 2018
Register File • Per-Thread
Data movement has high energy cost
▪ Rule of thumb in mobile system design: always seek to reduce amount of
data transferred from memory
- Earlier in class we discussed minimizing communication to reduce stalls (poor performance).
Now, we wish to reduce communication to reduce energy consumption
- Integer op: ~ 1 pJ *
- Floating point op: ~20 pJ *
- Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pJ
- Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pJ Suggests that recomputing values,
rather than storing and reloading
them, is a better answer when
▪ Implications optimizing code for energy efficiency!
- Reading 10 GB/sec from memory: ~1.6 watts
- Entire power budget for mobile GPU: ~1 watt
(remember phone is also running CPU, display, radios, etc.)
- iPhone 6 battery: ~7 watt-hours (note: my Macbook Pro laptop: 99 watt-hour battery)
- Exploiting locality matters!!!
* Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc. Stanford CS348V, Winter 2018
Welcome to cs348v!
▪ Make sure you are signed up on Piazza so you get
announcements
▪ Tonight’s reading:
- “The Compute Architecture of Intel Processor Graphics Gen9” - Intel Technical
Report, 2015
- “The Rise of Mobile Visual Computing Systems”, Fatahalian, IEEE Mobile
Computing 2016
// compute E = D + ((A + B) * C)
add(n, A, B, tmp1);
mul(n, tmp1, C, tmp2);
add(n, tmp2, D, E);
Program 2
void fused(int n, float* A, float* B, float* C, float* D, float* E) {
for (int i=0; i<n; i++)
E[i] = D[i] + (A[i] + B[i]) * C[i];
}
// compute E = D + (A + B) * C
fused(n, A, B, C, D, E);
Program 2
void fused(int n, float* A, float* B, float* C, float* D, float* E) {
for (int i=0; i<n; i++)
E[i] = D[i] + (A[i] + B[i]) * C[i];
}
// compute E = D + (A + B) * C
fused(n, A, B, C, D, E);
Thread 0
Thread 1
= ALU executing T0 at this time
= ALU executing T1 at this time
time (clocks)
Thread 0
Thread 1
= ALU executing T0 at this time
= ALU executing T1 at this time
Same as previous slide, but now just a different scheduling order of the threads
(fine-grained interleaving)
Thread 0
Thread 1
= ALU executing T0 at this time
= ALU executing T1 at this time
Thread 0
Thread 1
Thread 2
Thread 3
= some ALU executing T0 at this time = some ALU executing T2 at this time
= some ALU executing T1 at this time = some ALU executing T3 at this time
Stanford CS348V, Winter 2018
Another way to visualize execution (ALU-centric view)
Consider a processor with:
▪ Four execution contexts
▪ Two fetch and decode units (two instructions per clock, choose two of four threads)
▪ Two ALUs (to execute the two instructions)
Now the graph is visualizing what each ALU is doing each clock:
time (clocks)
ALU 0
ALU 1
time (clocks)
ALU 0
ALU 1