01 Introreview PDF

Lecture 1:
Course Introduction +
Review of Throughput Hardware Concepts
Visual Computing Systems

Stanford CS348V : Winter 2018
Hi!
Me: Your TA!
Prof. Kayvon Raj Setaluri
Stanford CS348V, Winter 2018

Visual computing applications
2D/3D graphics
Dv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow9 with class segmentation
the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [36] is
disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar
ce normal with the inferred gravity
Computational photography and image processing
e jointly trained late fusion model
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
vnet trained on class-balanced samples (1) or natural frequency
redictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
pixel mean mean f.w. noted RCNN3 ( 3 ). The metric for geometry is pixel accuracy.
acc. acc. IU IU
60.3 - 28.6 47.0 pixel mean mean f.w. geom.
60.0 42.2 29.2 43.9 acc. acc. IU IU acc.
61.5 42.4 30.5 45.5 Liu et al. [25] 76.7 - - - -
57.1 35.2 24.2 40.4 Tighe et al. [36] - - - - 90.8
Tighe et al. [37] 1 75.6 41.1 - - -
64.3 44.9 32.8 48.0
Tighe et al. [37] 2 78.6 39.2 - - -
65.4 46.1 34.0 49.5
Farabet et al. [9] 1 72.3 50.8 - - -
GB-D dataset collected using the Farabet et al. [9] 2 78.5 29.6 - - -
1449 RGB-D images, with pixel- Pinheiro et al. [31] 77.7 29.8 - - -
coalesced into a 40 class seman- FCN-16s 85.2 51.7 39.5 76.1 94.3
Understanding the contents

5 training images and 654 testing
FCN-8s
ofSDSimages
upta et al. [14]. We report results
[17]
and videos
Ground Truth Image
selection is performed on PAS-
ves the performance of our model
t we train our unmodified coarse
B images. To add depth informa-
l upgraded to take four-channel
on). This provides little benefit,
ultly of propagating meaningful
gh the model. Following the suc-
Visual Computing Systems
— Some History
(why I get so excited about this topic)

Ivan Sutherland’s Sketchpad on MIT TX-2 (1962)
The frame buffer 16 2K shift registers (640 x 486 x 8 bits)
Shoup’s SuperPaint (PARC 1972-73)
The frame buffer 16 2K shift registers (640 x 486 x 8 bits)
Shoup’s SuperPaint (PARC 1972-73)
Xerox Alto (1973)
Bravo (WYSIWYG) TI 74181 ALU

Goal: render everything you’ve ever seen
“Road to Pt. Reyes”

LucasFilm (1983)
Pixar’s Toy Story (1995)
“We take an average of three hours to draw a single frame on the fastest computer money can buy.”
- Steve Jobs
UNC Pixel Planes (1981), computation-enhanced frame buffer
Ed Clark’s Geometry Engine
(1982)
ASIC for geometric transforms

used in real-time graphics.
Real-time (30 fps) on a NVIDIA Titan X
Unreal Engine Kite Demo (Epic Games 2015)

NVIDIA Titan X Pascal GPU (2017)
(~ 12 TFLOPs fp32)
~ ASCI Q (top US supercomputer circa 2002)
Modern GPU: heterogeneous multi-core
SIMD SIMD SIMD SIMD Texture Texture
Exec Exec Exec Exec
Cache Cache Cache Cache

Texture Texture
SIMD SIMD SIMD SIMD

Tessellate Tessellate
Exec Exec Exec Exec
Tessellate Tessellate
Clip/Cull Clip/Cull DDR5

Rasterize Rasterize
SIMD SIMD SIMD SIMD
Exec Exec Exec Exec
Clip/Cull Clip/Cull
Rasterize Rasterize
Zbuffer / Zbuffer / Zbuffer /

Blend Blend Blend
SIMD SIMD SIMD SIMD
Exec Exec Exec Exec Zbuffer / Zbuffer / Zbuffer /
Blend Blend Blend
Scheduler / Work Distributor
Multi-threaded, SIMD cores

Custom circuits for key graphics arithmetic
Custom circuits for HW-assisted graphics-specific DRAM compression
HW logic for scheduling work onto these resources
Domain-specific languages for heterogeneous computing
OpenGL Graphics Pipeline (circa 2007)
Input vertex buffer
Vertex Generation
3D vertex stream
Vertex Processing
Projected
vertex stream
Primitive Generation
Primitive stream
Fragment Generation
(“Rasterization”)
Fragment stream
Fragment Processing
Fragment stream
Output
image buffer Pixel Operations
(pixels)
Domain-specific languages for heterogeneous computing
OpenGL Graphics Pipeline (circa 2007)
Input vertex buffer
Vertex Generation uniform sampler2D myTexture; read-only

uniform float3 lightDir; global variables
3D vertex stream
varying vec3 norm;
Vertex Processing varying vec2 uv;
“per-element” inputs
Projected
vertex stream void myFragmentShader()
Primitive Generation {
vec3 kd = texture2D(myTexture, uv);
Primitive stream kd *= clamp(dot(lightDir, norm), 0.0,
Fragment Generation 1.0);
(“Rasterization”) return vec4(kd, 1.0);
Fragment stream } per-element output:
RGBA surface color at pixel
Fragment Processing
Fragment stream
Output “fragment shader”
image buffer Pixel Operations (a.k.a kernel function mapped onto
(pixels)
input fragment stream)
Emerging state-of-the-art visual computing
systems today…
! Intelligent cameras in smartphones
! Cloud servers (“infinite” computing and storage at
your disposal as a service)
! Proliferation of specialized compute accelerators
- For image processing, machine learning
! Proliferation of high-resolution image sensors…
Capturing pixels to communicate
Ingesting/serving Ingesting/streaming
the world’s photos world’s video
2B photo uploads and shares Youtube 2015: 300 hours

per day across Facebook sites uploaded per minute [Youtube]
(incl. Instagram+WhatsApp)
[FB2015] Cisco VNI projection:
80-90% of 2019 internet
traffic will be video.
(64% in 2014)
Richer content: beyond a single image
■ Example: Apple’s “Live Photos”
■ Each photo is not only a single frame, but a few seconds of video before and after the
shutter is clicked
Facebook Live
VR output
Example: Google’s JumpVR video
Input stream: 16 4K GoPro cameras
Register + 3D align video stream (on edge device)
Broadcast encoded video stream across
the country to millions of viewers
High resolution, multi-camera
Facebook Surround 360 VR video
2048 x 2048 PointGrey Camera @ 30 FPS
14 cameras
8K x 8K stereo panorama output
VR: high resolution requirements
180o
~5o
Future “retina” VR display:
57 ppd covering 180o
= 10K x 10K display per eye
= 200 MPixel
RAW data rate @ 120Hz ≈ 72 GB/sec
iPhone 6: 4.7 in “retina” display:

1.3 MPixel
326 ppi → 57 ppd
Enhancing communication: understanding
images to improve acquired content
AutoEnhance: Photo “fix up” [Hayes 2007]
Portrait Mode: My bad vacation photo Part to fix
Similar photos others Fixed!

have taken
On every vehicle: analyzing images for robot navigation
High-resolution video (moving camera)
28
[Image Credit: Kundu et al. 2016]
NVIDIA Drive PX
Tegra X1 (1 TFlop fp16 at 1GHz)

On every corner: analyzing images for urban efficiency
“Managing urban areas has become one of the most

important development challenges of the 21st
century. Our success or failure in building sustainable
cities will be a major factor in the success of the
post-2015 UN development agenda.”
- UN Dept. of Economic and Social Affairs

High resolution (static camera)
Sensing human social interactions [Joo 2015]
CMU Panoptic Studio

480 video cameras (640 x 480 @ 25fps)
116 GPixel video sensor
(2.9 TPixel /sec)
Capturing social interactions
[Courtesy Yaser Sheikh, Tomas Simon, Hanbyul Joo]

Capturing social interactions
[Courtesy Yaser Sheikh, Tomas Simon, Hanbyul Joo]

On every human: analyzing egocentric images
to augment humans
What does this say?
What is this?
Gwangjiang Market (Seoul)

AR requires low-latency localization and
scene object recognition
[Tamburo 2016]
Smart headlight system
~1000 Hz (1 - 1.5 ms latency)

[Tamburo 2016]
Seeing clearly through precipitation
Future challenge: recording and analyzing the
world’s visual information, so computers can
understand and reason about it
Capturing everything about the visual world
To understand people
To understand the world around vehicles/drones
To understand cities
Mobile
Continuous (always on)
Exceptionally high resolution
Capture for computers to analyze, not humans to watch
What is this course about?
1. Understanding the characteristics of important visual computing workloads
2. Understanding techniques used to achieve efficient system implementations
sults on NYUDv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow9 with class segmentation
MACHINE
pth channels at the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [36] is
as horizontal disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar
VISUAL COMPUTING
the local surface normal with the inferred gravity
GB-HHA is the jointly trained late fusion model
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
Parallelism
ORGANIZATION
vnet trained on class-balanced samples (1) or natural frequency
GB and HHA predictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
a et al. [15]
WORKLOADS
pixel
acc.
60.3
mean
acc.
-
mean
IU
28.6
f.w.
IU
47.0 pixel mean
Exploiting locality
noted RCNN3 ( 3 ). The metric for geometry is pixel accuracy.
Minimizing
mean f.w. geom. communication
N-32s RGB Algorithms for 3D graphics, image
60.0 42.2 29.2 43.9
Liu et al. [25]
acc.
76.7
acc. IU
- -
IU
-
acc.
-
32s RGBD 61.5 42.4 30.5 45.5
N-32s HHA processing, compression,
57.1 35.2 24.2 40.4etc.
Tighe et al. [36]
Tighe et al. [37] 1
-
75.6
-
41.1
-
-
-
-
90.8
-
RGB-HHA 64.3 44.9 32.8 48.0
Tighe et al. [37] 2 78.6 39.2 - - -
RGB-HHA 65.4 46.1 34.0 49.5
Farabet et al. [9] 1 72.3 50.8 - - -
[33] is an RGB-D dataset collected using the Farabet et al. [9] 2 78.5 29.6 - - -
nect. It has 1449 RGB-D images, with pixel- Pinheiro et al. [31] 77.7 29.8 - - -
hat have been coalesced into a 40 class seman- FCN-16s 85.2 51.7 39.5 76.1 94.3
ion task by Gupta et al. [14]. We report results
rd split of 795 training images and 654 testing FCN-8s SDS [17] Ground Truth Image
te: all model selection is performed on PAS-
l.) Table 4 gives the performance of our model
iations. First we train our unmodified coarse
32s) on RGB images. To add depth informa-
n on a model upgraded to take four-channel
t (early fusion). This provides little benefit, High-throughput hardware designs:
DESIGN OF PROGRAMMING
to the difficultly of propagating meaningful
he way through the model. Following the suc- Parallel, heterogeneous, specialized
et al. [15], we try the three-dimensional HHA
ABSTRACTIONS
depth, training nets on just this information, as
e fusion” of RGB and HHA where the predic-
oth nets are summed at the final layer, and the
-stream net is learned end-to-end. Finally we
ate fusion net to a 16-stride version. FOR VISUAL COMPUTING
choice of programming primitives
w is a dataset of 2,688 images with pixel labels
tic categories (“bridge”, “mountain”, “sun”),
ee geometric categories (“horizontal”, “verti-
y”). An FCN can naturally learn a joint repre-
simultaneously predicts both types of labels.
level
of-the-art performance on PASCAL. The left column of abstraction
Figure 6. Fully convolutional segmentation nets produce state-
shows the
output of our highest performing net, FCN-8s. The second shows
wo-headed version of FCN-16s with seman- the segmentations produced by the previous state-of-the-art system
In other words
It is about understanding the fundamental
structure of problems in the visual computing
domain, and then leveraging that
understanding to…
To design more efficient algorithms
To build the most efficient hardware to run these applications
To design the right programming systems to make developing new

applications simpler, more productive, and highly performant
Course Logistics

Logistics
▪ Course web site:
- http://graphics.stanford.edu/courses/cs348v-18-winter
▪ All announcements will go out via Piazza

- https://piazza.com/class/jc1f626cfne6r6
▪ Kayvon’s office hours: Tuesday after class, or by appt.

Expectations of you
▪ 20% participation
- There will be ~1 assigned paper reading per class
- Everyone is expected to come to class and participate in discussions based on readings
- You are encouraged discuss papers and or my lectures on the course discussion board.
- If you form a weekly course reading/study group, I will buy Pizza for said group.
▪ 30% mini-assignments (3 short programming assignments)

- Assignment 1: analyze parallel program performance on a multi-core CPU
- Assignment 2: implement and optimize a basic RAW image processing pipeline
- Assignment 3: optimize performance of a modern DNN module
▪ 20% 1 take-home “exam”
▪ 30% self-selected final project

- I suggest you start thinking about projects now (can be teams of up to two)

Major course themes/topics
Part 1: High Efficiency Image and Video Processing
Overview of a Modern Digital Camera Processing Pipeline

Image Processing Algorithms You Should Know
Efficiently Scheduling Image Processing Algorithms on Parallel Hardware
Specialized Hardware for Image Processing
Lossy Image (JPG) and Video (H.264) Video Compression
Video Processing/Synthesis for Virtual Reality Display
Part 2: Accelerating Deep Learning for Computer Vision (from a systems perspective)
Workload Characteristics of DNN Inference for Image Analysis

Scheduling and Algorithms for Parallel DNN Training at Scale
A Case Study of Algorithmic Optimizations for Object Detection
Leveraging Task-Specific DNN Structure for Improving Performance and Accuracy
Hardware Accelerators for DNN Inference
Design Space of Dataflow Programming Abstractions for Deep Learning
Enhancing Efficiency Through Model Specialization (in particular for video)
Efficient Inference at Datacenter Scale

Algorithmic innovation in image classification
Improving accuracy-per-unit cost using better DNN designs?
2014 →2017 ~ 25x improvement in cost at similar accuracy
ImageNet Top-1 Cost/image

Accuracy Num Params (MADDs)
VGG-16 71.5% 138M 15B [2014]

GoogleNet 70% 6.8M 1.5B [2015]
ResNet-18 73% * 11.7M 1.8B [2016]
MobileNet-224 70.5% 4.2M 0.6B [2017]
* 10-crop results (ResNet 1-crop results are similar to other DNNs in this table) Stanford CS348V, Winter 2018
Major course themes/topics
Part 3: The GPU Accelerated 3D Graphics Pipeline
Real-Time 3D Graphics Pipeline Architecture

Hardware Acceleration of Z-Buffering and Texturing
Scheduling the Graphics Pipeline onto a GPU
Domain Specific Languages for Shading

Review:
key principles of modern
throughput computing hardware

Review concepts
▪ What are these design concepts, and what problem/goals do they
address?
- Muti-core processing
- SIMD processing
- Hardware multi-threading
▪ What is the motivation for specialization via:

- Multiple types of processors (e.g., CPUs, GPUs)
- Custom hardware units (ASIC)
▪ What is memory bandwidth a major constraint when mapping

applications to modern computer systems?
Let’s crack open a modern smartphone
Samsung Galaxy S7 phone with
Qualcomm Snapdragon 820 processor
Multi-core GPU
(3D graphics,
OpenCL data-parallel compute)
Multi-core ARM CPU
Display engine
(compresses pixels for Video encode/decode
transfer to 4K screen) ASIC (H.265 @ 4K)
Image Signal Processor “Hexagon”

(ISP): ASIC for processing pixels Programmable DSP
data-parallel multi-media
off camera (25MP at 30Hz)
processing
Multi-core processing

Review: what does a processor do?
_main:
It runs programs! 100000f10: pushq %rbp
100000f11: movq %rsp, %rbp
100000f14: subq $32, %rsp
100000f18: movl $0, -4(%rbp)
Processor executes instruction 100000f1f:
100000f22:
movl %edi, -8(%rbp)
movq %rsi, -16(%rbp)
referenced by the program counter 100000f26:
100000f2d:
movl $1, -20(%rbp)
movl $0, -24(%rbp)
(PC) 100000f34:
100000f38:
cmpl $10, -24(%rbp)
jge 23 <_main+0x45>
(executing the instruction will modify machine 100000f3e:
100000f41:
movl -20(%rbp), %eax
addl -20(%rbp), %eax
state: contents of registers, memory, CPU 100000f44: movl %eax, -20(%rbp)
100000f47: movl -24(%rbp), %eax
state, etc.) 100000f4a: addl $1, %eax
100000f4d: movl %eax, -24(%rbp)
100000f50: jmp -33 <_main+0x24>
Move to next instruction … 100000f55:

100000f5c:
leaq 58(%rip), %rdi
movl -20(%rbp), %esi
100000f5f: movb $0, %al
100000f61: callq 14
Then execute it… PC

100000f66:
100000f68:
xorl %esi, %esi
movl %eax, -28(%rbp)
100000f6b: movl %esi, %eax
100000f6d: addq $32, %rsp
And so on…
100000f71: popq %rbp
100000f72: retq

Executing an instruction stream
x[i]
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]

My very simple processor: executes one instruction per clock
x[i]
Fetch/
Decode
PC ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]

x[i]
Fetch/
Decode
ld r0, addr[r1]
PC mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]

x[i]
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU PC mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
result[i]

Quick aside:
Instruction-level parallelism and
superscalar execution

Instruction level parallelism (ILP) example
a = x*x + y*y + z*z
Consider the following program:

// assume r0=x, r1=y, r2=z
mul r0, r0, r0

mul r1, r1, r1
mul r2, r2, r2
add r0, r0, r1
add r3, r0, r2
// now r3 stores value of program variable ‘a’
This program has five instructions, so it will take five clocks to execute, correct?
Can we do better?

ILP example
a = x*x + y*y + z*z
x x y y z z
ILP = 3 * * *
ILP = 1 +
ILP = 1 +
a
Superscalar execution
a = x*x + y*y + z*z
// assume r0=x, r1=y, r2=z
1. mul r0, r0, r0

2. mul r1, r1, r1
3. mul r2, r2, r2
4. add r0, r0, r1
5. add r3, r0, r2
// r3 stores value of variable ‘a’
Superscalar execution: processor automatically finds independent instructions in an

instruction sequence and executes them in parallel on multiple execution units!
In this example: instructions 1, 2, and 3 can be executed in parallel

(on a superscalar processor that determines that the lack of dependencies exists)
But instruction 4 must come after instructions 1 and 2
And instruction 5 must come after instruction 4
Superscalar execution
Program: computes sin of input x via Taylor expansion
void sinx(int N, int terms, float x)
{
float value = x; My single core, superscalar processor:
float numer = x * x * x; executes up to two instructions per clock
int denom = 6; // 3!
int sign = -1;
from a single instruction stream.
for (int j=1; j<=terms; j++)

Fetch/ Fetch/
{
Decode Decode
value += sign * numer / denom;
numer *= x * x;
Exec Exec
denom *= (2*j+2) * (2*j+3);
1 2
sign *= -1;
} Independent operations in
instruction stream Execution
Context
return value; (They are detected by the processor
} at run-time and may be executed in
parallel on execution units 1 and 2)

Now consider a program that computes
the sine of many numbers…

Example program
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
for each element of an array of N floating-point numbers
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i];
int sign = -1;

{
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[i] = value;
}
}
Multi-core: process multiple instruction streams in parallel
Sixteen cores, sixteen simultaneous instruction streams

Multi-core examples
Core 1 Core 2
Shared L3 cache
Core 3 Core 4
Intel “Skylake” Core i7 quad-core CPU NVIDIA GP104 (GTX 1080) GPU
(2015) 20 replicated (“SM”) cores
(2016)

More multi-core examples
Core 1
Core 2
Intel Xeon Phi “Knights Landing “ 76-core CPU Apple A9 dual-core CPU
(2015) (2015)
A9 image credit: Chipworks (obtained via Anandtech)

http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/3 Stanford CS348V, Winter 2018
SIMD processing

Add ALUs to increase compute capability
Fetch/
Decode Idea #2:
Amortize cost/complexity of managing an
ALU 0 ALU 1 ALU 2 ALU 3 instruction stream across many ALUs
ALU 4 ALU 5 ALU 6 ALU 7
SIMD processing
Single instruction, multiple data
Same instruction broadcast to all ALUs

Execution Context Executed in parallel on all ALUs

Scalar program
Original compiled program:
{
for (int i=0; i<N; i++) Processes one array element using scalar
{ instructions on scalar registers (e.g., 32-bit floats)
float value = x[i];
int sign = -1;
ld r0, addr[r1]
mul r1, r0, r0
{ mul r1, r1, r0
value += sign * numer / denom; ...
numer *= x[i] * x[i]; ...
denom *= (2*j+2) * (2*j+3); ...
sign *= -1; ...
} ...
...
result[i] = value; st addr[r2], r0
}
}

Vector program (using AVX intrinsics)
#include <immintrin.h>
void sinx(int N, int terms, float* x, float* sinx)
{ vloadps xmm0, addr[r1]
float three_fact = 6; // 3! vmulps xmm1, xmm0, xmm0
for (int i=0; i<N; i+=8)
vmulps xmm1, xmm1, xmm0
{
...
__m256 origx = _mm256_load_ps(&x[i]);
...
__m256 value = origx;
__m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx));
...
__m256 denom = _mm256_broadcast_ss(&three_fact); ...
int sign = -1; ...
...
for (int j=1; j<=terms; j++) vstoreps addr[xmm2], xmm0
{
// value += sign * numer / denom
__m256 tmp =
Compiled program:
_mm256_div_ps(_mm256_mul_ps(_mm256_broadcast_ss(sign),numer),denom); Processes eight array elements
value = _mm256_add_ps(value, tmp);
simultaneously using vector
numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx));
instructions on 256-bit vector registers
denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3)));
sign *= -1;
}
_mm256_store_ps(&sinx[i], value);
}
}

16 SIMD cores: 128 elements in parallel
CMU 15-418/618, Spring 2016
16 cores, 128 ALUs, 16 simultaneous instruction streams

CMU 15-418/618, Spring 2016
Data-parallel expression
(in Kayvon’s fictitious data-parallel language)
void sinx(int N, int terms, float* x, float* result) Compiler understands loop iterations
{
are independent, and that same loop
// declare independent loop iterations
forall (int i from 0 to N-1)
body will be executed on a large
{ number of data elements.
float value = x[i];
int sign = -1;
Abstraction facilitates automatic
generation of both multi-core parallel
for (int j=1; j<=terms; j++) code, and vector instructions to make
{ use of SIMD processing capabilities
value += sign * numer / denom
within a core.
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[i] = value;
}
}

What about conditional execution?
1 2 ... ... 8 (assume logic below is to be executed for
Time (clocks) each element in input array ‘A’, producing
ALU 1 ALU 2 . . . . . . ALU 8 output into the array ‘result’)
<unconditional code>
float x = A[i];
if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
<resume unconditional code>
result[i] = x;

What about conditional execution?
float x = A[i];
T T F T F F F F if (x > 0) {
tmp *= kMyConst1;
} else {
x = 2.f * tmp;
}
result[i] = x;

Mask (discard) output of ALU
float x = A[i];
T T F T F F F F if (x > 0) {
tmp *= kMyConst1;
} else {
x = 2.f * tmp;
}
<resume unconditional
Not all ALUs do useful work! code>
result[i] = x;
Worst case: 1/8 peak performance

After branch: continue at full performance
float x = A[i];
T T F T F F F F if (x > 0) {
tmp *= kMyConst1;
} else {
x = 2.f * tmp;
}
result[i] = x;

Example: eight-core Intel Xeon E5-1660 v4
8 cores
8 SIMD ALUs per core
(AVX2 instructions)
490 GFLOPs (@3.2 GHz)

(140 Watts)
* Showing only AVX math units, and fetch/decode unit for AVX (additional capability for integer math)
Example: NVIDIA GTX 1080 GPU
20 cores (“SMs”)
128 SIMD ALUs per core (@1.6 GHz) = 8.1 TFLOPs (180 Watts) Stanford CS348V, Winter 2018
Part 2:
accessing memory
Memory

Hardware multi-threading

Terminology
▪ Memory latency
- The amount of time for a memory request (e.g., load, store) from a
processor to be serviced by the memory system
- Example: 100 cycles, 100 nsec
▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s

Stalls
▪ A processor “stalls” when it cannot run the next instruction in
an instruction stream because of a dependency on a previous
instruction.
▪ Accessing memory is a major source of stalls

ld r0 mem[r2]
ld r1 mem[r3] Dependency: cannot execute ‘add’ instruction until data at mem[r2] and
mem[r3] have been loaded from memory
add r0, r0, r1
▪ Memory access times ~ 100’s of cycles

- Memory “access time” is a measure of latency

Review: why do modern processors have caches?
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
25 GB/sec Memory
. DDR3 DRAM
.. L3 cache
(8 MB) (Gigabytes)
L1 cache
(32 KB)
Core N
L2 cache
(256 KB)

Caches reduce length of stalls (reduce latency)
Processors run efficiently when data is resident in caches
Caches reduce memory access latency *
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
25 GB/sec Memory
. DDR3 DRAM
.. L3 cache
(8 MB) (Gigabytes)
L1 cache
(32 KB)
Core N
L2 cache
(256 KB)
* Caches also provide high bandwidth data transfer to CPU Stanford CS348V, Winter 2018
Prefetching reduces stalls (hides latency)
▪ All modern CPUs have logic for prefetching data into caches
- Dynamically analyze program’s access patterns, predict what it will access soon
▪ Reduces stalls since data is resident in cache when accessed

predict value of r2, initiate load
Note: Prefetching can also reduce
predict value of r3, initiate load
...
performance if the guess is wrong
... (hogs bandwidth, pollutes caches)
...
data arrives in cache
...
data arrives in cache (more detail later in course)
...
...
ld r0 mem[r2]
These loads are cache hits
ld r1 mem[r3]
add r0, r0, r1

Multi-threading reduces stalls
▪ Idea: interleave processing of multiple threads on the same
core to hide stalls
▪ Like prefetching, multi-threading is a latency hiding, not a

latency reducing technique

Hiding stalls with multi-threading
Thread 1
Elements 0 … 7
Time
1 Core (1 thread)
Fetch/
Decode
Exec Ctx

Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
1 2 3 4
1 Core (4 hardware threads)

Fetch/
Decode
1 2
3 4

Time
1 2 3 4

Stall
Fetch/
Decode
Runnable
1 2
3 4

Time
1 2 3 4

Stall
Fetch/
Decode
Stall
Runnable Stall
1 2
Stall
Runnable
3 4
Runnable
Done!
Runnable
Done!
Throughput computing trade-off
Time
Key idea of throughput-oriented systems:

Stall Potentially increase time to complete work by any
one any one thread, in order to increase overall
system throughput when running multiple threads.
Runnable
During this time, this thread is runnable, but it is not being executed
by the processor. (The core is running some other thread.)
Done!

Kayvon’s fictitious multi-core chip
16 cores
8 SIMD ALUs per core

(128 total)
4 threads per core
16 simultaneous instruction
streams
CMU 15-418/618, Spring 2016
64 total concurrent instruction

streams
512 independent pieces of work

are needed to run chip with
maximal latency hiding ability
CMU 15-418/618, Spring 2016

GPUs: extreme throughput-oriented processors
= SIMD function unit,
NVIDIA GTX 1080 core (“SM”) control shared across 32 units
(1 MUL-ADD per clock)
Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode Decode Decode Decode Decode
▪ Instructions operate on 32 pieces of
data at a time (instruction streams
called “warps”).
▪ Think: warp = thread issuing 32-wide

vector instructions
▪ Different instructions from up to four

warps can be executed simultaneously
(simultaneous multi-threading)
Execution contexts (registers)
(256 KB) ▪ Up to 64 warps are interleaved on the
SM (interleaved multi-threading)
“Shared” memory
(96 KB) ▪ Over 2,048 elements can be processed
concurrently by a core
Source: NVIDIA Pascal Tuning Guide Stanford CS348V, Winter 2018
NVIDIA GTX 1080
There are 20 SM cores on the GTX 1080:

That’s 40,960 pieces of data being processed concurrently to get maximal latency hiding!
Another example:
for review and to check your understanding
(if you understand the following sequence you understand this lecture)

Running code on a simple processor
My very simple program:
compute sin(x) using Taylor expansion
{
My very simple processor:
{ completes one instruction per clock
float value = x[i];
int denom = 6; // 3! Fetch/
Decode
int sign = -1;
ALU
(Execute)
{
Execution
Context
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[i] = value;
}
}
Review: superscalar execution
Unmodified program
void sinx(int N, int terms, float* x, float* result) My single core, superscalar processor:
{
executes up to two instructions per clock
{ from a single instruction stream.
float value = x[i];
int denom = 6; // 3! Fetch/ Fetch/
int sign = -1; Decode Decode
for (int j=1; j<=terms; j++) Exec Exec

{ 1 2
numer *= x[i] * x[i]; Execution
denom *= (2*j+2) * (2*j+3); Context
sign *= -1;
}
Independent operations in
result[i] = value; instruction stream
} (They are detected by the processor
}
at run-time and may be executed in
parallel on execution units 1 and 2)
Review: multi-core execution (two cores)
Modify program to create two threads of control (two instruction streams)
typedef struct {
int N;
My dual-core processor:
int terms; executes one instruction per clock
float* x;
float* result; from an instruction stream on each core.
} my_args;
void parallel_sinx(int N, int terms, float* x, float* result) { Fetch/ Fetch/

pthread_t thread_id; Decode Decode
my_args args;
ALU ALU
(Execute) (Execute)
args.N = N/2;
args.terms = terms;
Execution Execution
args.x = x; Context Context
args.result = result;
// launch thread
pthread_create(&thread_id, NULL, my_thread_start, &args);
sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}
void my_thread_start(void* thread_arg) {

my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}

Review: multi-core + superscalar execution
Modify program to create two threads of control (two instruction streams)
typedef struct {
int N;
My superscalar dual-core processor:
int terms; executes up to two instructions per clock
float* x;
float* result; from an instruction stream on each core.
} my_args;
void parallel_sinx(int N, int terms, float* x, float* result) { Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
pthread_t thread_id;
my_args args;
Exec Exec Exec Exec
1 2 1 2
args.N = N/2;
args.terms = terms;
Execution Execution
args.x = x; Context Context
args.result = result;
// launch thread
pthread_create(&thread_id, NULL, my_thread_start, &args);
sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}
void my_thread_start(void* thread_arg) {

my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}

Review: multi-core (four cores)
Modify program to create many threads of control:
(code written in Kayvon’s fictitious data-parallel language)
My quad-core processor:
{ executes one instruction per clock
// declare independent loop iterations from an instruction stream on each core.
{
float value = x[i]; Fetch/ Fetch/
Decode Decode
ALU ALU
int denom = 6; // 3! (Execute) (Execute)
int sign = -1;
Execution Execution
Context Context
{
Fetch/ Fetch/
denom *= (2*j+2) * (2*j+3); Decode Decode
sign *= -1; ALU ALU
(Execute) (Execute)
}
Execution Execution
Context Context
result[i] = value;
}
}

Review: four, 8-wide SIMD cores
Observation: program must execute many iterations of the same loop body.
Optimization: share instruction stream across execution of multiple iterations (single instruction
multiple data = SIMD) My SIMD quad-core processor:
void sinx(int N, int terms, float* x, float* result) executes one 8-wide SIMD instruction per clock
{
// declare independent loop iterations
from an instruction stream on each core.
{ Fetch/ Fetch/
Decode Decode
float value = x[i];
Execution Execution
int sign = -1;
Context Context

{
Fetch/ Fetch/
numer *= x[i] * x[i]; Decode Decode
denom *= (2*j+2) * (2*j+3);
sign *= -1;
} Execution Execution
Context Context
result[i] = value;
}
}
Review: four SIMD, multi-threaded cores
Observation: memory operations have very long latency
Solution: hide latency of loading data for one iteration by My multi-threaded, SIMD quad-core processor:
executing arithmetic instructions from other iterations executes one SIMD instruction per clock
void sinx(int N, int terms, float* x, float* result) from one instruction stream on each core. But
{ can switch to processing the other instruction
// declare independent loop iterations stream when faced with a stall.
{ Fetch/ Fetch/
float value = x[i]; Memory load Decode Decode
int sign = -1; Execution Execution Execution Execution
Context Context Context Context

{
Fetch/ Fetch/
numer *= x[i] * x[i]; Decode Decode
denom *= (2*j+2) * (2*j+3);
sign *= -1; Memory store
} Execution Execution Execution Execution
result[i] = value;
}
}
Summary: four superscalar, SIMD, multi-threaded cores
My multi-threaded, superscalar, SIMD quad-core processor:
executes up to two instructions per clock from one instruction stream on each core
(in this example: one SIMD instruction + one scalar instruction).
Processor can switch to execute the other instruction stream when faced with stall.
Fetch/ Fetch/ Fetch/ Fetch/
SIMD Exec 2 SIMD Exec 2
Exec 1 Exec 1
Execution Execution Execution Execution


SIMD Exec 2 SIMD Exec 2
Exec 1 Exec 1
Execution Execution Execution Execution


Connecting it all together
Kayvon’s simple quad-core processor:
Four cores, two-way multi-threading per core (max eight threads active on chip at once), up to two
instructions per clock per core (one of those instructions is 8-wide SIMD)
Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/

Decode Decode Decode Decode Decode Decode Decode Decode
SIMD Exec 2 SIMD Exec 2 SIMD Exec 2 SIMD Exec 2
Exec 1 Exec 1 Exec 1 Exec 1
Execution Execution Execution Execution Execution Execution Execution Execution

Context Context Context Context Context Context Context Context
L1 Cache L1 Cache L1 Cache L1 Cache
L2 Cache L2 Cache L2 Cache L2 Cache
On-chip
interconnect
Memory
L3 Cache Controller
Memory Bus
(to DRAM)

Thought experiment
▪ You write a C application that spawns two pthreads
▪ The application runs on the processor shown below
- Two cores, two-execution contexts per core, up to instructions per clock, one
instruction is an 8-wide SIMD instruction.
▪ Question: “who” is responsible for mapping your pthreads to the

processor’s thread execution contexts?
Answer: the operating system
▪ Question: If you were the OS, how would to assign the two threads to
the four available execution contexts?
▪ Another question: How would you SIMD Exec 2 SIMD Exec 2
assign threads to execution contexts Exec 1 Exec 1
if your C program spawned five Execution

Context
Execution
Context
Execution
Context
Execution
Context
pthreads?
Another thought experiment
Task: element-wise multiplication of two vectors A and B
Assume vectors contain millions of elements A
×
- Load input A[i] B
- Load input B[i] =
- Compute A[i] × B[i] C
- Store result into C[i]
Three memory operations (12 bytes) for every MUL

NVIDIA GTX 1080 GPU can do 2560 MULs per clock (@ 1.6 GHz)
Need ~50 TB/sec of bandwidth to keep functional units busy (only have 320 GB/sec)
<1% GPU efficiency… but 4.2x faster than eight-core CPU!

(3.2 GHz Xeon E5v4 eight-core CPU connected to 76 GB/sec memory bus will exhibit ~3%
efficiency on this computation)
Bandwidth limited!
Bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up.
No amount of latency hiding helps this.
Bandwidth is a critical resource
Overcoming bandwidth limits are a common challenge for

application developers on throughput-optimized systems.

Hardware specialization

Why does energy efficiency matter?
▪ General mobile processing rule: the longer a task runs the less power it can use
- Processor’s power consumption is limited by heat generated (efficiency is
required for more than just maximizing battery life)
Electrical limit: max power that can be supplied to chip
Die temp: (junction temp -- Tj): chip becomes unreliable above this temp
(chip can run at high power for short period of time until chip heats to Tj)
Case temp: mobile device gets too hot for user to comfortably hold
(chip is at suitable operating temp, but heat is dissipating into case)
Power
Battery life: chip and case are cool, but want to reduce power
consumption to sustain long battery life for given task
iPhone 6 battery: 7 watt-hours

9.7in iPad Pro battery: 28 watt-hours
15in Macbook Pro: 99 watt-hours
Time
Slide credit: adopted from original slide from M. Shebanow: HPG 2013 keynote Stanford CS348V, Winter 2018
Efficiency benefits of compute specialization
▪ Rules of thumb: compared to high-quality C code on CPU...
▪ Throughput-maximized processor architectures: e.g., GPU cores

- Approximately 10x improvement in perf / watt
- Assuming code maps well to wide data-parallel execution and is compute bound
▪ Fixed-function ASIC (“application-specific integrated circuit”)

- Can approach 100-1000x or greater improvement in perf/watt
- Assuming code is compute bound and
and is not floating-point math
[Source: Chung et al. 2010 , Dally 08] [Figure credit Eric Chung]
Hardware specialization increases efficiency
FPGA
GPUs
ASIC delivers same performance

as one CPU core with ~ 1/1000th
the chip area.
GPU cores: ~ 5-7 times more area

lg2(N) (data set size) efficient than CPU cores.
FPGA
GPUs
ASIC delivers same performance

as one CPU core with only ~
1/100th the power.
lg2(N) (data set size)

[Chung et al. MICRO 2010] Stanford CS348V, Winter 2018
Modern systems use ASICs for…
▪ Image/video encode/decode (e.g., H.264, JPG)
▪ Audio recording/playback
▪ Voice “wake up” (e.g., Ok Google)
▪ Camera “RAW” processing: processing data acquired by image
sensor into images that are pleasing to humans
▪ Many 3D graphics tasks (rasterization, texture mapping,
occlusion using the Z-buffer)
▪ Significant modern interest in ASICs for deep network
evaluation (e.g., Google’s Tensor Processing Unit)

Qualcomm Hexagon DSP
VLIW: Area & power efficient multi-issue
▪ Originally used for audio/LTE support on Qualcomm SoC’s
• Dual 64-bit execution units
▪ Multi-threaded,
Variable sized VLIW DSP
Instruction
• Standard 8/16/32/64bit data
types
instruction packets
▪ Third major programmable unit on modern
(1 to 4 instructions CacheQualcomm SoCs • SIMD vectorized MPY / ALU
-
per Packet) / SHIFT, Permute, BitOps
Multi-core CPU Instruction Unit • Up to 8 16b MAC/cycle
-
• 2 SP FMA/cycle
Multi-core GPU (Adreno)
VLIW: Area & power efficient multi-issue
- Device
Hexagon DDR
DSP
Memory
L2
Cache • Dual 64-bit execution units
Variable
/ TCMsized • Standard 8/16/32/64bit data
instruction packets Instruction types
(1 to 4 instructionsData Cache • SIMD vectorized MPY / ALU
• Dual 64-bit Unit Data Unit Execution Execution
per Packet) / SHIFT, Permute, BitOps
load/store (Load/ (Load/
Instruction Unit
Unit Unit • Up to 8 16b MAC/cycle
units Store/ Store/ (64-bit (64-bit • 2 SP FMA/cycle
• Also 32-bit ALU) ALU) Vector) Vector)
Device L2
ALU DDR Data Cache • Unified 32x32bit
Cache
Memory
/ TCM General Register
File is best for
• Dual 64-bit Data Unit Data Unit Execution Execution
(Load/ (Load/ Unit Unit compiler.
load/store
units Store/ Store/ (64-bit •
(64-bit No separate Address
• Also 32-bit
Register
ALU) File/Thread
ALU) Vector) Vector) or Accum Regs
Register File
ALU Register
Data Cache File • Per-Thread
• Unified 32x32bit
General Register
File is best for
compiler. 7
Qualcomm Technologies, Inc. All Rights Reserved
• No separate Address
Register File/Thread or Accum Regs
Register File
Register File • Per-Thread
7
Qualcomm Technologies, Inc. All Rights Reserved
Summary: choosing the right tool for the job
Throughput-oriented FPGA/Future
Energy-optimized CPU processor (GPU) Programmable DSP reconfigurable logic ASIC
Video encode/decode,
Audio playback,
Area & power efficient multi-issue Camera RAW processing,
neural nets (future?)
• Dual 64-bit execution units
• Standard 8/16/32/64bit data
kets Instruction types
ons Cache
~10X more efficient • SIMD vectorized MPY / ALU
/ SHIFT, Permute, BitOps
~100X??? ~100-1000X
Instruction Unit • Up to 8 16b MAC/cycle
• 2 SP FMA/cycle
(jury still out) more efficient
L2
Easiest to program
Cache
/ TCM
Difficult to program Not programmable +
Data Unit Data Unit Execution Execution
(making it easier is costs 10-100’s millions
t
(Load/
Store/
(Load/
Store/
Unit
(64-bit
Unit
(64-bit
active area of research) of dollars to design /
t ALU) ALU) Vector) Vector) verify / create
Data Cache • Unified 32x32bit
General Register
File is best for
compiler.
• No separate Address
Register File/Thread
Credit Pat Hanrahan for this taxonomy
Register File or Accum Regs
Register File • Per-Thread
Data movement has high energy cost
▪ Rule of thumb in mobile system design: always seek to reduce amount of
data transferred from memory
- Earlier in class we discussed minimizing communication to reduce stalls (poor performance).
Now, we wish to reduce communication to reduce energy consumption
▪ “Ballpark” numbers [Sources: Bill Dally (NVIDIA), Tom Olson (ARM)]
- Integer op: ~ 1 pJ *
- Floating point op: ~20 pJ *
- Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pJ
- Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pJ Suggests that recomputing values,
rather than storing and reloading
them, is a better answer when
▪ Implications optimizing code for energy efficiency!
- Reading 10 GB/sec from memory: ~1.6 watts
- Entire power budget for mobile GPU: ~1 watt
(remember phone is also running CPU, display, radios, etc.)
- iPhone 6 battery: ~7 watt-hours (note: my Macbook Pro laptop: 99 watt-hour battery)
- Exploiting locality matters!!!
* Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc. Stanford CS348V, Winter 2018
Welcome to cs348v!
▪ Make sure you are signed up on Piazza so you get
announcements
▪ Tonight’s reading:
- “The Compute Architecture of Intel Processor Graphics Gen9” - Intel Technical
Report, 2015
- “The Rise of Mobile Visual Computing Systems”, Fatahalian, IEEE Mobile
Computing 2016

More review

For the rest of this class, know these terms
▪ Multi-core processor
▪ SIMD execution
▪ Coherent control flow
▪ Hardware multi-threading
- Interleaved multi-threading
- Simultaneous multi-threading
▪ Memory latency
▪ Memory bandwidth
▪ Bandwidth bound application
▪ Arithmetic intensity

Which program performs better?
Program 1
void add(int n, float* A, float* B, float* C) {
(Note: an answer probably needs
for (int i=0; i<n; i++)
C[i] = A[i] + B[i];
to state its assumptions.)
}
void mul(int n, float* A, float* B, float* C) {

C[i] = A[i] * B[i];
}
float* A, *B, *C, *D, *E, *tmp1, *tmp2;
// assume arrays are allocated here
// compute E = D + ((A + B) * C)
add(n, A, B, tmp1);
mul(n, tmp1, C, tmp2);
add(n, tmp2, D, E);
Program 2
void fused(int n, float* A, float* B, float* C, float* D, float* E) {
E[i] = D[i] + (A[i] + B[i]) * C[i];
}
// compute E = D + (A + B) * C
fused(n, A, B, C, D, E);

More thought questions
Program 1
void add(int n, float* A, float* B, float* C) {
Which code structuring style
C[i] = A[i] + B[i];
would you rather write?
}
void mul(int n, float* A, float* B, float* C) {

C[i] = A[i] * B[i];
} Consider running either of these
programs: would CPU support for
float* A, *B, *C, *D, *E, *tmp1, *tmp2;
hardware-multi-threading help
// assume arrays are allocated here
performance?
// compute E = D + ((A + B) * C)
add(n, A, B, tmp1);
mul(n, tmp1, C, tmp2);
add(n, tmp2, D, E);
Program 2
void fused(int n, float* A, float* B, float* C, float* D, float* E) {
E[i] = D[i] + (A[i] + B[i]) * C[i];
}
// compute E = D + (A + B) * C
fused(n, A, B, C, D, E);

Visualizing interleaved and simultaneous
multi-threading
(and combinations thereof)

Interleaved multi-threading
Consider a processor with:
▪ Two execution contexts
▪ One fetch and decode unit (one instruction per clock)
▪ One ALU (to execute the instruction)
time (clocks)
Thread 0
Thread 1
= ALU executing T0 at this time
In an interleaved multi-threading scenario, the threads share the processor.

(This is a visualization of when threads are having their instructions executed by the ALU.)
Interleaved multi-threading
▪ One fetch and decode unit (one instruction per clock)
▪ One ALU (to execute the instruction)
time (clocks)
Thread 0
Thread 1
Same as previous slide, but now just a different scheduling order of the threads
(fine-grained interleaving)

Simultaneous multi-threading
▪ Two fetch and decode units (two instructions per clock)
▪ Two ALUs (to execute the two instructions)
time (clocks)
Thread 0
Thread 1
In an simultaneous multi-threading scenario, the threads execute simultaneously on

the two ALUs. (note, no ILP in a thread since each thread is run sequentially on one ALU)

Combining simultaneous and interleaved multi-threading
▪ Four execution contexts
▪ Two fetch and decode units (two instructions per clock, choose two of four threads)
time (clocks)
Thread 0
Thread 1
Thread 2
Thread 3
= some ALU executing T0 at this time = some ALU executing T2 at this time
= some ALU executing T1 at this time = some ALU executing T3 at this time
Another way to visualize execution (ALU-centric view)
▪ Two fetch and decode units (two instructions per clock, choose two of four threads)
Now the graph is visualizing what each ALU is doing each clock:
time (clocks)
ALU 0
ALU 1
= executing T0 at this time = executing T2 at this time


Instructions can be drawn from same thread (ILP)
▪ Two fetch and decode units (two instructions per clock, choose any two
independent instructions from the four threads)
time (clocks)
ALU 0
ALU 1
Two instructions from same thread executing simultaneously.


01 Introreview PDF

Uploaded by

Copyright:

Available Formats

01 Introreview PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01 Introreview PDF

Uploaded by

Copyright:

Available Formats

Lecture 1:

Visual Computing Systems

Prof. Kayvon Raj Setaluri

Stanford CS348V, Winter 2018

Understanding the contents

(why I get so excited about this topic)

Stanford CS348V, Winter 2018

Bravo (WYSIWYG) TI 74181 ALU

“Road to Pt. Reyes”

ASIC for geometric transforms

Unreal Engine Kite Demo (Epic Games 2015)

Cache Cache Cache Cache

SIMD SIMD SIMD SIMD

Clip/Cull Clip/Cull DDR5

Zbuffer / Zbuffer / Zbuffer /

Multi-threaded, SIMD cores

Input vertex buffer

Input vertex buffer

Vertex Generation uniform sampler2D myTexture; read-only

2B photo uploads and shares Youtube 2015: 300 hours

2048 x 2048 PointGrey Camera @ 30 FPS

RAW data rate @ 120Hz ≈ 72 GB/sec

iPhone 6: 4.7 in “retina” display:

Portrait Mode: My bad vacation photo Part to fix

Similar photos others Fixed!

Tegra X1 (1 TFlop fp16 at 1GHz)

“Managing urban areas has become one of the most

- UN Dept. of Economic and Social Affairs

CMU Panoptic Studio

[Courtesy Yaser Sheikh, Tomas Simon, Hanbyul Joo]

[Courtesy Yaser Sheikh, Tomas Simon, Hanbyul Joo]

What does this say?

Gwangjiang Market (Seoul)

~1000 Hz (1 - 1.5 ms latency)

To design more efficient algorithms

To build the most efficient hardware to run these applications

To design the right programming systems to make developing new

Stanford CS348V, Winter 2018

▪ All announcements will go out via Piazza

▪ Kayvon’s office hours: Tuesday after class, or by appt.

Stanford CS348V, Winter 2018

▪ 30% mini-assignments (3 short programming assignments)

▪ 20% 1 take-home “exam”

▪ 30% self-selected final project

Stanford CS348V, Winter 2018

Overview of a Modern Digital Camera Processing Pipeline

Workload Characteristics of DNN Inference for Image Analysis

Stanford CS348V, Winter 2018

ImageNet Top-1 Cost/image

VGG-16 71.5% 138M 15B [2014]

Real-Time 3D Graphics Pipeline Architecture

Stanford CS348V, Winter 2018

Stanford CS348V, Winter 2018

▪ What is the motivation for specialization via:

▪ What is memory bandwidth a major constraint when mapping

Image Signal Processor “Hexagon”

Stanford CS348V, Winter 2018

Move to next instruction … 100000f55:

Then execute it… PC