01 Introreview PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 130

Lecture 1:

Course Introduction +
Review of Throughput Hardware Concepts

Visual Computing Systems


Stanford CS348V : Winter 2018
Hi!
Me: Your TA!

Prof. Kayvon Raj Setaluri

Stanford CS348V, Winter 2018


Visual computing applications
2D/3D graphics

Dv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow9 with class segmentation
the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [36] is
disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar
ce normal with the inferred gravity
Computational photography and image processing
e jointly trained late fusion model
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
vnet trained on class-balanced samples (1) or natural frequency
redictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
pixel mean mean f.w. noted RCNN3 ( 3 ). The metric for geometry is pixel accuracy.
acc. acc. IU IU
60.3 - 28.6 47.0 pixel mean mean f.w. geom.
60.0 42.2 29.2 43.9 acc. acc. IU IU acc.
61.5 42.4 30.5 45.5 Liu et al. [25] 76.7 - - - -
57.1 35.2 24.2 40.4 Tighe et al. [36] - - - - 90.8
Tighe et al. [37] 1 75.6 41.1 - - -
64.3 44.9 32.8 48.0
Tighe et al. [37] 2 78.6 39.2 - - -
65.4 46.1 34.0 49.5
Farabet et al. [9] 1 72.3 50.8 - - -
GB-D dataset collected using the Farabet et al. [9] 2 78.5 29.6 - - -
1449 RGB-D images, with pixel- Pinheiro et al. [31] 77.7 29.8 - - -
coalesced into a 40 class seman- FCN-16s 85.2 51.7 39.5 76.1 94.3

Understanding the contents


5 training images and 654 testing
FCN-8s
ofSDSimages
upta et al. [14]. We report results
[17]
and videos
Ground Truth Image
selection is performed on PAS-
ves the performance of our model
t we train our unmodified coarse
B images. To add depth informa-
l upgraded to take four-channel
on). This provides little benefit,
ultly of propagating meaningful
Stanford CS348V, Winter 2018
gh the model. Following the suc-
Visual Computing Systems
— Some History

(why I get so excited about this topic)

Stanford CS348V, Winter 2018


Ivan Sutherland’s Sketchpad on MIT TX-2 (1962)
The frame buffer 16 2K shift registers (640 x 486 x 8 bits)
Shoup’s SuperPaint (PARC 1972-73)
The frame buffer 16 2K shift registers (640 x 486 x 8 bits)
Shoup’s SuperPaint (PARC 1972-73)
Xerox Alto (1973)

Bravo (WYSIWYG) TI 74181 ALU


Goal: render everything you’ve ever seen

“Road to Pt. Reyes”


LucasFilm (1983)
Pixar’s Toy Story (1995)

“We take an average of three hours to draw a single frame on the fastest computer money can buy.”
- Steve Jobs
UNC Pixel Planes (1981), computation-enhanced frame buffer
Ed Clark’s Geometry Engine
(1982)

ASIC for geometric transforms


used in real-time graphics.
Real-time (30 fps) on a NVIDIA Titan X

Unreal Engine Kite Demo (Epic Games 2015)


NVIDIA Titan X Pascal GPU (2017)
(~ 12 TFLOPs fp32)
~ ASCI Q (top US supercomputer circa 2002)
Modern GPU: heterogeneous multi-core
SIMD SIMD SIMD SIMD Texture Texture
Exec Exec Exec Exec

Cache Cache Cache Cache


Texture Texture

SIMD SIMD SIMD SIMD


Tessellate Tessellate
Exec Exec Exec Exec

Tessellate Tessellate
Cache Cache Cache Cache

Clip/Cull Clip/Cull DDR5


Rasterize Rasterize
SIMD SIMD SIMD SIMD
Exec Exec Exec Exec
Clip/Cull Clip/Cull
Rasterize Rasterize
Cache Cache Cache Cache

Zbuffer / Zbuffer / Zbuffer /


Blend Blend Blend
SIMD SIMD SIMD SIMD
Exec Exec Exec Exec Zbuffer / Zbuffer / Zbuffer /
Blend Blend Blend
Cache Cache Cache Cache
Scheduler / Work Distributor

Multi-threaded, SIMD cores


Custom circuits for key graphics arithmetic
Custom circuits for HW-assisted graphics-specific DRAM compression
HW logic for scheduling work onto these resources
Domain-specific languages for heterogeneous computing
OpenGL Graphics Pipeline (circa 2007)

Input vertex buffer

Vertex Generation

3D vertex stream

Vertex Processing
Projected
vertex stream
Primitive Generation

Primitive stream
Fragment Generation
(“Rasterization”)

Fragment stream

Fragment Processing

Fragment stream
Output
image buffer Pixel Operations
(pixels)
Domain-specific languages for heterogeneous computing
OpenGL Graphics Pipeline (circa 2007)

Input vertex buffer

Vertex Generation uniform sampler2D myTexture; read-only


uniform float3 lightDir; global variables
3D vertex stream
varying vec3 norm;
Vertex Processing varying vec2 uv;
“per-element” inputs
Projected
vertex stream void myFragmentShader()
Primitive Generation {
vec3 kd = texture2D(myTexture, uv);
Primitive stream kd *= clamp(dot(lightDir, norm), 0.0,
Fragment Generation 1.0);
(“Rasterization”) return vec4(kd, 1.0);
Fragment stream } per-element output:
RGBA surface color at pixel
Fragment Processing

Fragment stream
Output “fragment shader”
image buffer Pixel Operations (a.k.a kernel function mapped onto
(pixels)
input fragment stream)
Emerging state-of-the-art visual computing
systems today…
! Intelligent cameras in smartphones
! Cloud servers (“infinite” computing and storage at
your disposal as a service)
! Proliferation of specialized compute accelerators
- For image processing, machine learning
! Proliferation of high-resolution image sensors…
Capturing pixels to communicate
Ingesting/serving Ingesting/streaming
the world’s photos world’s video

2B photo uploads and shares Youtube 2015: 300 hours


per day across Facebook sites uploaded per minute [Youtube]
(incl. Instagram+WhatsApp)
[FB2015] Cisco VNI projection:
80-90% of 2019 internet
traffic will be video.
(64% in 2014)
Richer content: beyond a single image
■ Example: Apple’s “Live Photos”
■ Each photo is not only a single frame, but a few seconds of video before and after the
shutter is clicked
Facebook Live
VR output
Example: Google’s JumpVR video
Input stream: 16 4K GoPro cameras
Register + 3D align video stream (on edge device)
Broadcast encoded video stream across
the country to millions of viewers
High resolution, multi-camera
Facebook Surround 360 VR video

2048 x 2048 PointGrey Camera @ 30 FPS

14 cameras
8K x 8K stereo panorama output
VR: high resolution requirements

180o

~5o
Future “retina” VR display:
57 ppd covering 180o
= 10K x 10K display per eye
= 200 MPixel

RAW data rate @ 120Hz ≈ 72 GB/sec

iPhone 6: 4.7 in “retina” display:


1.3 MPixel
326 ppi → 57 ppd
Enhancing communication: understanding
images to improve acquired content
AutoEnhance: Photo “fix up” [Hayes 2007]

Portrait Mode: My bad vacation photo Part to fix

Similar photos others Fixed!


have taken
On every vehicle: analyzing images for robot navigation
High-resolution video (moving camera)

28
[Image Credit: Kundu et al. 2016]
NVIDIA Drive PX

Tegra X1 (1 TFlop fp16 at 1GHz)


On every corner: analyzing images for urban efficiency

“Managing urban areas has become one of the most


important development challenges of the 21st
century. Our success or failure in building sustainable
cities will be a major factor in the success of the
post-2015 UN development agenda.”

- UN Dept. of Economic and Social Affairs


High resolution (static camera)
Sensing human social interactions [Joo 2015]

CMU Panoptic Studio


480 video cameras (640 x 480 @ 25fps)
116 GPixel video sensor
(2.9 TPixel /sec)
Capturing social interactions

[Courtesy Yaser Sheikh, Tomas Simon, Hanbyul Joo]


Capturing social interactions

[Courtesy Yaser Sheikh, Tomas Simon, Hanbyul Joo]


On every human: analyzing egocentric images
to augment humans

What does this say?

What is this?

Gwangjiang Market (Seoul)


AR requires low-latency localization and
scene object recognition
[Tamburo 2016]
Smart headlight system

~1000 Hz (1 - 1.5 ms latency)


[Tamburo 2016]
Seeing clearly through precipitation
Future challenge: recording and analyzing the
world’s visual information, so computers can
understand and reason about it
Capturing everything about the visual world
To understand people
To understand the world around vehicles/drones
To understand cities

Mobile
Continuous (always on)
Exceptionally high resolution
Capture for computers to analyze, not humans to watch
What is this course about?
1. Understanding the characteristics of important visual computing workloads
2. Understanding techniques used to achieve efficient system implementations
sults on NYUDv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow9 with class segmentation

MACHINE
pth channels at the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [36] is
as horizontal disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar

VISUAL COMPUTING
the local surface normal with the inferred gravity
GB-HHA is the jointly trained late fusion model
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
Parallelism
ORGANIZATION
vnet trained on class-balanced samples (1) or natural frequency
GB and HHA predictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de-

a et al. [15]
WORKLOADS
pixel
acc.
60.3
mean
acc.
-
mean
IU
28.6
f.w.
IU
47.0 pixel mean
Exploiting locality
noted RCNN3 ( 3 ). The metric for geometry is pixel accuracy.

Minimizing
mean f.w. geom. communication
N-32s RGB Algorithms for 3D graphics, image
60.0 42.2 29.2 43.9
Liu et al. [25]
acc.
76.7
acc. IU
- -
IU
-
acc.
-
32s RGBD 61.5 42.4 30.5 45.5
N-32s HHA processing, compression,
57.1 35.2 24.2 40.4etc.
Tighe et al. [36]
Tighe et al. [37] 1
-
75.6
-
41.1
-
-
-
-
90.8
-
RGB-HHA 64.3 44.9 32.8 48.0
Tighe et al. [37] 2 78.6 39.2 - - -
RGB-HHA 65.4 46.1 34.0 49.5
Farabet et al. [9] 1 72.3 50.8 - - -
[33] is an RGB-D dataset collected using the Farabet et al. [9] 2 78.5 29.6 - - -
nect. It has 1449 RGB-D images, with pixel- Pinheiro et al. [31] 77.7 29.8 - - -
hat have been coalesced into a 40 class seman- FCN-16s 85.2 51.7 39.5 76.1 94.3
ion task by Gupta et al. [14]. We report results
rd split of 795 training images and 654 testing FCN-8s SDS [17] Ground Truth Image
te: all model selection is performed on PAS-
l.) Table 4 gives the performance of our model
iations. First we train our unmodified coarse
32s) on RGB images. To add depth informa-
n on a model upgraded to take four-channel
t (early fusion). This provides little benefit, High-throughput hardware designs:
DESIGN OF PROGRAMMING
to the difficultly of propagating meaningful
he way through the model. Following the suc- Parallel, heterogeneous, specialized
et al. [15], we try the three-dimensional HHA

ABSTRACTIONS
depth, training nets on just this information, as
e fusion” of RGB and HHA where the predic-
oth nets are summed at the final layer, and the
-stream net is learned end-to-end. Finally we
ate fusion net to a 16-stride version. FOR VISUAL COMPUTING
choice of programming primitives
w is a dataset of 2,688 images with pixel labels
tic categories (“bridge”, “mountain”, “sun”),
ee geometric categories (“horizontal”, “verti-
y”). An FCN can naturally learn a joint repre-
simultaneously predicts both types of labels.
level
of-the-art performance on PASCAL. The left column of abstraction
Figure 6. Fully convolutional segmentation nets produce state-
shows the
output of our highest performing net, FCN-8s. The second shows
wo-headed version of FCN-16s with seman- the segmentations produced by the previous state-of-the-art system
In other words
It is about understanding the fundamental
structure of problems in the visual computing
domain, and then leveraging that
understanding to…

To design more efficient algorithms

To build the most efficient hardware to run these applications

To design the right programming systems to make developing new


applications simpler, more productive, and highly performant
Course Logistics

Stanford CS348V, Winter 2018


Logistics
▪ Course web site:
- http://graphics.stanford.edu/courses/cs348v-18-winter

▪ All announcements will go out via Piazza


- https://piazza.com/class/jc1f626cfne6r6

▪ Kayvon’s office hours: Tuesday after class, or by appt.

Stanford CS348V, Winter 2018


Expectations of you
▪ 20% participation
- There will be ~1 assigned paper reading per class
- Everyone is expected to come to class and participate in discussions based on readings
- You are encouraged discuss papers and or my lectures on the course discussion board.
- If you form a weekly course reading/study group, I will buy Pizza for said group.

▪ 30% mini-assignments (3 short programming assignments)


- Assignment 1: analyze parallel program performance on a multi-core CPU
- Assignment 2: implement and optimize a basic RAW image processing pipeline
- Assignment 3: optimize performance of a modern DNN module

▪ 20% 1 take-home “exam”

▪ 30% self-selected final project


- I suggest you start thinking about projects now (can be teams of up to two)

Stanford CS348V, Winter 2018


Major course themes/topics
Part 1: High Efficiency Image and Video Processing

Overview of a Modern Digital Camera Processing Pipeline


Image Processing Algorithms You Should Know
Efficiently Scheduling Image Processing Algorithms on Parallel Hardware
Specialized Hardware for Image Processing
Lossy Image (JPG) and Video (H.264) Video Compression
Video Processing/Synthesis for Virtual Reality Display

Part 2: Accelerating Deep Learning for Computer Vision (from a systems perspective)

Workload Characteristics of DNN Inference for Image Analysis


Scheduling and Algorithms for Parallel DNN Training at Scale
A Case Study of Algorithmic Optimizations for Object Detection
Leveraging Task-Specific DNN Structure for Improving Performance and Accuracy
Hardware Accelerators for DNN Inference
Design Space of Dataflow Programming Abstractions for Deep Learning
Enhancing Efficiency Through Model Specialization (in particular for video)
Efficient Inference at Datacenter Scale

Stanford CS348V, Winter 2018


Algorithmic innovation in image classification
Improving accuracy-per-unit cost using better DNN designs?
2014 →2017 ~ 25x improvement in cost at similar accuracy

ImageNet Top-1 Cost/image


Accuracy Num Params (MADDs)

VGG-16 71.5% 138M 15B [2014]


GoogleNet 70% 6.8M 1.5B [2015]
ResNet-18 73% * 11.7M 1.8B [2016]
MobileNet-224 70.5% 4.2M 0.6B [2017]

* 10-crop results (ResNet 1-crop results are similar to other DNNs in this table) Stanford CS348V, Winter 2018
Major course themes/topics
Part 3: The GPU Accelerated 3D Graphics Pipeline

Real-Time 3D Graphics Pipeline Architecture


Hardware Acceleration of Z-Buffering and Texturing
Scheduling the Graphics Pipeline onto a GPU
Domain Specific Languages for Shading

Stanford CS348V, Winter 2018


Review:
key principles of modern
throughput computing hardware

Stanford CS348V, Winter 2018


Review concepts
▪ What are these design concepts, and what problem/goals do they
address?
- Muti-core processing
- SIMD processing
- Hardware multi-threading

▪ What is the motivation for specialization via:


- Multiple types of processors (e.g., CPUs, GPUs)
- Custom hardware units (ASIC)

▪ What is memory bandwidth a major constraint when mapping


applications to modern computer systems?
Stanford CS348V, Winter 2018
Let’s crack open a modern smartphone
Samsung Galaxy S7 phone with
Qualcomm Snapdragon 820 processor

Multi-core GPU
(3D graphics,
OpenCL data-parallel compute)
Multi-core ARM CPU

Display engine
(compresses pixels for Video encode/decode
transfer to 4K screen) ASIC (H.265 @ 4K)

Image Signal Processor “Hexagon”


(ISP): ASIC for processing pixels Programmable DSP
data-parallel multi-media
off camera (25MP at 30Hz)
processing
Stanford CS348V, Winter 2018
Multi-core processing

Stanford CS348V, Winter 2018


Review: what does a processor do?
_main:
It runs programs! 100000f10: pushq %rbp
100000f11: movq %rsp, %rbp
100000f14: subq $32, %rsp
100000f18: movl $0, -4(%rbp)
Processor executes instruction 100000f1f:
100000f22:
movl %edi, -8(%rbp)
movq %rsi, -16(%rbp)
referenced by the program counter 100000f26:
100000f2d:
movl $1, -20(%rbp)
movl $0, -24(%rbp)
(PC) 100000f34:
100000f38:
cmpl $10, -24(%rbp)
jge 23 <_main+0x45>
(executing the instruction will modify machine 100000f3e:
100000f41:
movl -20(%rbp), %eax
addl -20(%rbp), %eax
state: contents of registers, memory, CPU 100000f44: movl %eax, -20(%rbp)
100000f47: movl -24(%rbp), %eax
state, etc.) 100000f4a: addl $1, %eax
100000f4d: movl %eax, -24(%rbp)
100000f50: jmp -33 <_main+0x24>

Move to next instruction … 100000f55:


100000f5c:
leaq 58(%rip), %rdi
movl -20(%rbp), %esi
100000f5f: movb $0, %al
100000f61: callq 14

Then execute it… PC


100000f66:
100000f68:
xorl %esi, %esi
movl %eax, -28(%rbp)
100000f6b: movl %esi, %eax
100000f6d: addq $32, %rsp

And so on…
100000f71: popq %rbp
100000f72: retq

Stanford CS348V, Winter 2018


Executing an instruction stream

x[i]

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0

result[i]

Stanford CS348V, Winter 2018


Executing an instruction stream
My very simple processor: executes one instruction per clock
x[i]

Fetch/
Decode
PC ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0

result[i]

Stanford CS348V, Winter 2018


Executing an instruction stream
My very simple processor: executes one instruction per clock
x[i]

Fetch/
Decode
ld r0, addr[r1]
PC mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0

result[i]

Stanford CS348V, Winter 2018


Executing an instruction stream
My very simple processor: executes one instruction per clock
x[i]

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU PC mul r1, r1, r0
(Execute) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0

result[i]

Stanford CS348V, Winter 2018


Quick aside:
Instruction-level parallelism and
superscalar execution

Stanford CS348V, Winter 2018


Instruction level parallelism (ILP) example
a = x*x + y*y + z*z

Consider the following program:


// assume r0=x, r1=y, r2=z

mul r0, r0, r0


mul r1, r1, r1
mul r2, r2, r2
add r0, r0, r1
add r3, r0, r2

// now r3 stores value of program variable ‘a’

This program has five instructions, so it will take five clocks to execute, correct?
Can we do better?

Stanford CS348V, Winter 2018


ILP example
a = x*x + y*y + z*z

x x y y z z

ILP = 3 * * *

ILP = 1 +

ILP = 1 +

a
Stanford CS348V, Winter 2018
Superscalar execution
a = x*x + y*y + z*z
// assume r0=x, r1=y, r2=z

1. mul r0, r0, r0


2. mul r1, r1, r1
3. mul r2, r2, r2
4. add r0, r0, r1
5. add r3, r0, r2

// r3 stores value of variable ‘a’

Superscalar execution: processor automatically finds independent instructions in an


instruction sequence and executes them in parallel on multiple execution units!

In this example: instructions 1, 2, and 3 can be executed in parallel


(on a superscalar processor that determines that the lack of dependencies exists)
But instruction 4 must come after instructions 1 and 2
And instruction 5 must come after instruction 4
Stanford CS348V, Winter 2018
Superscalar execution
Program: computes sin of input x via Taylor expansion
void sinx(int N, int terms, float x)
{
float value = x; My single core, superscalar processor:
float numer = x * x * x; executes up to two instructions per clock
int denom = 6; // 3!
int sign = -1;
from a single instruction stream.

for (int j=1; j<=terms; j++)


Fetch/ Fetch/
{
Decode Decode
value += sign * numer / denom;
numer *= x * x;
Exec Exec
denom *= (2*j+2) * (2*j+3);
1 2
sign *= -1;
} Independent operations in
instruction stream Execution
Context
return value; (They are detected by the processor
} at run-time and may be executed in
parallel on execution units 1 and 2)

Stanford CS348V, Winter 2018


Now consider a program that computes
the sine of many numbers…

Stanford CS348V, Winter 2018


Example program
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
for each element of an array of N floating-point numbers
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;

for (int j=1; j<=terms; j++)


{
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}
Stanford CS348V, Winter 2018
Multi-core: process multiple instruction streams in parallel

Sixteen cores, sixteen simultaneous instruction streams


Stanford CS348V, Winter 2018
Multi-core examples

Core 1 Core 2

Shared L3 cache

Core 3 Core 4

Intel “Skylake” Core i7 quad-core CPU NVIDIA GP104 (GTX 1080) GPU
(2015) 20 replicated (“SM”) cores
(2016)

Stanford CS348V, Winter 2018


More multi-core examples

Core 1

Core 2

Intel Xeon Phi “Knights Landing “ 76-core CPU Apple A9 dual-core CPU
(2015) (2015)

A9 image credit: Chipworks (obtained via Anandtech)


http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/3 Stanford CS348V, Winter 2018
SIMD processing

Stanford CS348V, Winter 2018


Add ALUs to increase compute capability

Fetch/
Decode Idea #2:
Amortize cost/complexity of managing an
ALU 0 ALU 1 ALU 2 ALU 3 instruction stream across many ALUs
ALU 4 ALU 5 ALU 6 ALU 7

SIMD processing
Single instruction, multiple data

Same instruction broadcast to all ALUs


Execution Context Executed in parallel on all ALUs

Stanford CS348V, Winter 2018


Scalar program
void sinx(int N, int terms, float* x, float* result)
Original compiled program:
{
for (int i=0; i<N; i++) Processes one array element using scalar
{ instructions on scalar registers (e.g., 32-bit floats)
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;
ld r0, addr[r1]
mul r1, r0, r0
for (int j=1; j<=terms; j++)
{ mul r1, r1, r0
value += sign * numer / denom; ...
numer *= x[i] * x[i]; ...
denom *= (2*j+2) * (2*j+3); ...
sign *= -1; ...
} ...
...
result[i] = value; st addr[r2], r0
}
}

Stanford CS348V, Winter 2018


Vector program (using AVX intrinsics)
#include <immintrin.h>
void sinx(int N, int terms, float* x, float* sinx)
{ vloadps xmm0, addr[r1]
float three_fact = 6; // 3! vmulps xmm1, xmm0, xmm0
for (int i=0; i<N; i+=8)
vmulps xmm1, xmm1, xmm0
{
...
__m256 origx = _mm256_load_ps(&x[i]);
...
__m256 value = origx;
__m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx));
...
__m256 denom = _mm256_broadcast_ss(&three_fact); ...
int sign = -1; ...
...
for (int j=1; j<=terms; j++) vstoreps addr[xmm2], xmm0
{
// value += sign * numer / denom
__m256 tmp =
Compiled program:
_mm256_div_ps(_mm256_mul_ps(_mm256_broadcast_ss(sign),numer),denom); Processes eight array elements
value = _mm256_add_ps(value, tmp);
simultaneously using vector
numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx));
instructions on 256-bit vector registers
denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3)));
sign *= -1;
}
_mm256_store_ps(&sinx[i], value);
}
}

Stanford CS348V, Winter 2018


16 SIMD cores: 128 elements in parallel

CMU 15-418/618, Spring 2016

16 cores, 128 ALUs, 16 simultaneous instruction streams


Stanford CS348V, Winter 2018
CMU 15-418/618, Spring 2016
Data-parallel expression
(in Kayvon’s fictitious data-parallel language)

void sinx(int N, int terms, float* x, float* result) Compiler understands loop iterations
{
are independent, and that same loop
// declare independent loop iterations
forall (int i from 0 to N-1)
body will be executed on a large
{ number of data elements.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;
Abstraction facilitates automatic
generation of both multi-core parallel
for (int j=1; j<=terms; j++) code, and vector instructions to make
{ use of SIMD processing capabilities
value += sign * numer / denom
numer *= x[i] * x[i];
within a core.
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}

Stanford CS348V, Winter 2018


What about conditional execution?
1 2 ... ... 8 (assume logic below is to be executed for
Time (clocks) each element in input array ‘A’, producing
ALU 1 ALU 2 . . . . . . ALU 8 output into the array ‘result’)

<unconditional code>

float x = A[i];

if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional code>

result[i] = x;

Stanford CS348V, Winter 2018


What about conditional execution?
1 2 ... ... 8 (assume logic below is to be executed for
Time (clocks) each element in input array ‘A’, producing
ALU 1 ALU 2 . . . . . . ALU 8 output into the array ‘result’)

<unconditional code>

float x = A[i];

T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional code>

result[i] = x;

Stanford CS348V, Winter 2018


Mask (discard) output of ALU
1 2 ... ... 8 (assume logic below is to be executed for
Time (clocks) each element in input array ‘A’, producing
ALU 1 ALU 2 . . . . . . ALU 8 output into the array ‘result’)

<unconditional code>

float x = A[i];

T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional
Not all ALUs do useful work! code>
result[i] = x;
Worst case: 1/8 peak performance

Stanford CS348V, Winter 2018


After branch: continue at full performance
1 2 ... ... 8 (assume logic below is to be executed for
Time (clocks) each element in input array ‘A’, producing
ALU 1 ALU 2 . . . . . . ALU 8 output into the array ‘result’)

<unconditional code>

float x = A[i];

T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional code>

result[i] = x;

Stanford CS348V, Winter 2018


Example: eight-core Intel Xeon E5-1660 v4

8 cores
8 SIMD ALUs per core
(AVX2 instructions)

490 GFLOPs (@3.2 GHz)


(140 Watts)

* Showing only AVX math units, and fetch/decode unit for AVX (additional capability for integer math)
Stanford CS348V, Winter 2018
Example: NVIDIA GTX 1080 GPU

20 cores (“SMs”)
128 SIMD ALUs per core (@1.6 GHz) = 8.1 TFLOPs (180 Watts) Stanford CS348V, Winter 2018
Part 2:
accessing memory

Memory

Stanford CS348V, Winter 2018


Hardware multi-threading

Stanford CS348V, Winter 2018


Terminology
▪ Memory latency
- The amount of time for a memory request (e.g., load, store) from a
processor to be serviced by the memory system
- Example: 100 cycles, 100 nsec

▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s

Stanford CS348V, Winter 2018


Stalls
▪ A processor “stalls” when it cannot run the next instruction in
an instruction stream because of a dependency on a previous
instruction.

▪ Accessing memory is a major source of stalls


ld r0 mem[r2]
ld r1 mem[r3] Dependency: cannot execute ‘add’ instruction until data at mem[r2] and
mem[r3] have been loaded from memory
add r0, r0, r1

▪ Memory access times ~ 100’s of cycles


- Memory “access time” is a measure of latency

Stanford CS348V, Winter 2018


Review: why do modern processors have caches?

L1 cache
(32 KB)

Core 1
L2 cache
(256 KB)

25 GB/sec Memory
. DDR3 DRAM
.. L3 cache
(8 MB) (Gigabytes)
L1 cache
(32 KB)

Core N
L2 cache
(256 KB)

Stanford CS348V, Winter 2018


Caches reduce length of stalls (reduce latency)
Processors run efficiently when data is resident in caches
Caches reduce memory access latency *

L1 cache
(32 KB)

Core 1
L2 cache
(256 KB)

25 GB/sec Memory
. DDR3 DRAM
.. L3 cache
(8 MB) (Gigabytes)
L1 cache
(32 KB)

Core N
L2 cache
(256 KB)

* Caches also provide high bandwidth data transfer to CPU Stanford CS348V, Winter 2018
Prefetching reduces stalls (hides latency)
▪ All modern CPUs have logic for prefetching data into caches
- Dynamically analyze program’s access patterns, predict what it will access soon

▪ Reduces stalls since data is resident in cache when accessed


predict value of r2, initiate load
Note: Prefetching can also reduce
predict value of r3, initiate load
...
performance if the guess is wrong
... (hogs bandwidth, pollutes caches)
...
data arrives in cache
...
data arrives in cache (more detail later in course)
...
...
ld r0 mem[r2]
These loads are cache hits
ld r1 mem[r3]
add r0, r0, r1

Stanford CS348V, Winter 2018


Multi-threading reduces stalls
▪ Idea: interleave processing of multiple threads on the same
core to hide stalls

▪ Like prefetching, multi-threading is a latency hiding, not a


latency reducing technique

Stanford CS348V, Winter 2018


Hiding stalls with multi-threading
Thread 1
Elements 0 … 7
Time

1 Core (1 thread)

Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

Exec Ctx

Stanford CS348V, Winter 2018


Hiding stalls with multi-threading
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
1 2 3 4

1 Core (4 hardware threads)


Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

1 2

3 4

Stanford CS348V, Winter 2018


Hiding stalls with multi-threading
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
1 2 3 4

1 Core (4 hardware threads)


Stall
Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

Runnable
1 2

3 4

Stanford CS348V, Winter 2018


Hiding stalls with multi-threading
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
1 2 3 4

1 Core (4 hardware threads)


Stall
Fetch/
Decode
Stall
ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

Runnable Stall
1 2
Stall
Runnable
3 4
Runnable
Done!
Runnable
Done!
Stanford CS348V, Winter 2018
Throughput computing trade-off
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time

Key idea of throughput-oriented systems:


Stall Potentially increase time to complete work by any
one any one thread, in order to increase overall
system throughput when running multiple threads.
Runnable
During this time, this thread is runnable, but it is not being executed
by the processor. (The core is running some other thread.)

Done!

Stanford CS348V, Winter 2018


Kayvon’s fictitious multi-core chip
16 cores

8 SIMD ALUs per core


(128 total)

4 threads per core

16 simultaneous instruction
streams
CMU 15-418/618, Spring 2016

64 total concurrent instruction


streams

512 independent pieces of work


are needed to run chip with
maximal latency hiding ability
CMU 15-418/618, Spring 2016

Stanford CS348V, Winter 2018


GPUs: extreme throughput-oriented processors
= SIMD function unit,
NVIDIA GTX 1080 core (“SM”) control shared across 32 units
(1 MUL-ADD per clock)
Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode Decode Decode Decode Decode
▪ Instructions operate on 32 pieces of
data at a time (instruction streams
called “warps”).

▪ Think: warp = thread issuing 32-wide


vector instructions

▪ Different instructions from up to four


warps can be executed simultaneously
(simultaneous multi-threading)
Execution contexts (registers)
(256 KB) ▪ Up to 64 warps are interleaved on the
SM (interleaved multi-threading)
“Shared” memory
(96 KB) ▪ Over 2,048 elements can be processed
concurrently by a core
Source: NVIDIA Pascal Tuning Guide Stanford CS348V, Winter 2018
NVIDIA GTX 1080

There are 20 SM cores on the GTX 1080:


That’s 40,960 pieces of data being processed concurrently to get maximal latency hiding!
Stanford CS348V, Winter 2018
Another example:
for review and to check your understanding
(if you understand the following sequence you understand this lecture)

Stanford CS348V, Winter 2018


Running code on a simple processor
My very simple program:
compute sin(x) using Taylor expansion
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++)
My very simple processor:
{ completes one instruction per clock
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3! Fetch/
Decode
int sign = -1;

ALU
for (int j=1; j<=terms; j++)
(Execute)
{
value += sign * numer / denom;
Execution
numer *= x[i] * x[i];
Context
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}
Stanford CS348V, Winter 2018
Review: superscalar execution
Unmodified program
void sinx(int N, int terms, float* x, float* result) My single core, superscalar processor:
{
executes up to two instructions per clock
for (int i=0; i<N; i++)
{ from a single instruction stream.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3! Fetch/ Fetch/
int sign = -1; Decode Decode

for (int j=1; j<=terms; j++) Exec Exec


{ 1 2
value += sign * numer / denom;
numer *= x[i] * x[i]; Execution
denom *= (2*j+2) * (2*j+3); Context
sign *= -1;
}
Independent operations in
result[i] = value; instruction stream
} (They are detected by the processor
}
at run-time and may be executed in
parallel on execution units 1 and 2)
Stanford CS348V, Winter 2018
Review: multi-core execution (two cores)
Modify program to create two threads of control (two instruction streams)
typedef struct {
int N;
My dual-core processor:
int terms; executes one instruction per clock
float* x;
float* result; from an instruction stream on each core.
} my_args;

void parallel_sinx(int N, int terms, float* x, float* result) { Fetch/ Fetch/


pthread_t thread_id; Decode Decode
my_args args;
ALU ALU
(Execute) (Execute)
args.N = N/2;
args.terms = terms;
Execution Execution
args.x = x; Context Context
args.result = result;

// launch thread
pthread_create(&thread_id, NULL, my_thread_start, &args);
sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}

void my_thread_start(void* thread_arg) {


my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}

Stanford CS348V, Winter 2018


Review: multi-core + superscalar execution
Modify program to create two threads of control (two instruction streams)
typedef struct {
int N;
My superscalar dual-core processor:
int terms; executes up to two instructions per clock
float* x;
float* result; from an instruction stream on each core.
} my_args;

void parallel_sinx(int N, int terms, float* x, float* result) { Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
pthread_t thread_id;
my_args args;
Exec Exec Exec Exec
1 2 1 2
args.N = N/2;
args.terms = terms;
Execution Execution
args.x = x; Context Context
args.result = result;

// launch thread
pthread_create(&thread_id, NULL, my_thread_start, &args);
sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}

void my_thread_start(void* thread_arg) {


my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}

Stanford CS348V, Winter 2018


Review: multi-core (four cores)
Modify program to create many threads of control:
(code written in Kayvon’s fictitious data-parallel language)
My quad-core processor:
void sinx(int N, int terms, float* x, float* result)
{ executes one instruction per clock
// declare independent loop iterations from an instruction stream on each core.
forall (int i from 0 to N-1)
{
float value = x[i]; Fetch/ Fetch/
Decode Decode
float numer = x[i] * x[i] * x[i];
ALU ALU
int denom = 6; // 3! (Execute) (Execute)
int sign = -1;
Execution Execution
Context Context
for (int j=1; j<=terms; j++)
{
value += sign * numer / denom
numer *= x[i] * x[i];
Fetch/ Fetch/
denom *= (2*j+2) * (2*j+3); Decode Decode
sign *= -1; ALU ALU
(Execute) (Execute)
}
Execution Execution
Context Context
result[i] = value;
}
}

Stanford CS348V, Winter 2018


Review: four, 8-wide SIMD cores
Observation: program must execute many iterations of the same loop body.
Optimization: share instruction stream across execution of multiple iterations (single instruction
multiple data = SIMD) My SIMD quad-core processor:
void sinx(int N, int terms, float* x, float* result) executes one 8-wide SIMD instruction per clock
{
// declare independent loop iterations
from an instruction stream on each core.
forall (int i from 0 to N-1)
{ Fetch/ Fetch/
Decode Decode
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
Execution Execution
int sign = -1;
Context Context

for (int j=1; j<=terms; j++)


{
value += sign * numer / denom
Fetch/ Fetch/
numer *= x[i] * x[i]; Decode Decode
denom *= (2*j+2) * (2*j+3);
sign *= -1;
} Execution Execution
Context Context

result[i] = value;
}
}
Stanford CS348V, Winter 2018
Review: four SIMD, multi-threaded cores
Observation: memory operations have very long latency
Solution: hide latency of loading data for one iteration by My multi-threaded, SIMD quad-core processor:
executing arithmetic instructions from other iterations executes one SIMD instruction per clock
void sinx(int N, int terms, float* x, float* result) from one instruction stream on each core. But
{ can switch to processing the other instruction
// declare independent loop iterations stream when faced with a stall.
forall (int i from 0 to N-1)
{ Fetch/ Fetch/
float value = x[i]; Memory load Decode Decode
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1; Execution Execution Execution Execution
Context Context Context Context

for (int j=1; j<=terms; j++)


{
value += sign * numer / denom
Fetch/ Fetch/
numer *= x[i] * x[i]; Decode Decode
denom *= (2*j+2) * (2*j+3);
sign *= -1; Memory store
} Execution Execution Execution Execution
Context Context Context Context

result[i] = value;
}
}
Stanford CS348V, Winter 2018
Summary: four superscalar, SIMD, multi-threaded cores
My multi-threaded, superscalar, SIMD quad-core processor:
executes up to two instructions per clock from one instruction stream on each core
(in this example: one SIMD instruction + one scalar instruction).
Processor can switch to execute the other instruction stream when faced with stall.
Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode

SIMD Exec 2 SIMD Exec 2

Exec 1 Exec 1

Execution Execution Execution Execution


Context Context Context Context

Fetch/ Fetch/ Fetch/ Fetch/


Decode Decode Decode Decode

SIMD Exec 2 SIMD Exec 2

Exec 1 Exec 1

Execution Execution Execution Execution


Context Context Context Context

Stanford CS348V, Winter 2018


Connecting it all together
Kayvon’s simple quad-core processor:
Four cores, two-way multi-threading per core (max eight threads active on chip at once), up to two
instructions per clock per core (one of those instructions is 8-wide SIMD)

Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/


Decode Decode Decode Decode Decode Decode Decode Decode

SIMD Exec 2 SIMD Exec 2 SIMD Exec 2 SIMD Exec 2

Exec 1 Exec 1 Exec 1 Exec 1

Execution Execution Execution Execution Execution Execution Execution Execution


Context Context Context Context Context Context Context Context

L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache L2 Cache L2 Cache

On-chip
interconnect

Memory
L3 Cache Controller

Memory Bus
(to DRAM)

Stanford CS348V, Winter 2018


Thought experiment
▪ You write a C application that spawns two pthreads
▪ The application runs on the processor shown below
- Two cores, two-execution contexts per core, up to instructions per clock, one
instruction is an 8-wide SIMD instruction.

▪ Question: “who” is responsible for mapping your pthreads to the


processor’s thread execution contexts?
Answer: the operating system

▪ Question: If you were the OS, how would to assign the two threads to
the four available execution contexts?
Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode

▪ Another question: How would you SIMD Exec 2 SIMD Exec 2

assign threads to execution contexts Exec 1 Exec 1

if your C program spawned five Execution


Context
Execution
Context
Execution
Context
Execution
Context

pthreads?
Stanford CS348V, Winter 2018
Another thought experiment
Task: element-wise multiplication of two vectors A and B
Assume vectors contain millions of elements A
×
- Load input A[i] B
- Load input B[i] =
- Compute A[i] × B[i] C
- Store result into C[i]

Three memory operations (12 bytes) for every MUL


NVIDIA GTX 1080 GPU can do 2560 MULs per clock (@ 1.6 GHz)
Need ~50 TB/sec of bandwidth to keep functional units busy (only have 320 GB/sec)

<1% GPU efficiency… but 4.2x faster than eight-core CPU!


(3.2 GHz Xeon E5v4 eight-core CPU connected to 76 GB/sec memory bus will exhibit ~3%
efficiency on this computation)
Stanford CS348V, Winter 2018
Bandwidth limited!

Bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up.
No amount of latency hiding helps this.

Bandwidth is a critical resource

Overcoming bandwidth limits are a common challenge for


application developers on throughput-optimized systems.

Stanford CS348V, Winter 2018


Hardware specialization

Stanford CS348V, Winter 2018


Why does energy efficiency matter?
▪ General mobile processing rule: the longer a task runs the less power it can use
- Processor’s power consumption is limited by heat generated (efficiency is
required for more than just maximizing battery life)

Electrical limit: max power that can be supplied to chip

Die temp: (junction temp -- Tj): chip becomes unreliable above this temp
(chip can run at high power for short period of time until chip heats to Tj)

Case temp: mobile device gets too hot for user to comfortably hold
(chip is at suitable operating temp, but heat is dissipating into case)
Power

Battery life: chip and case are cool, but want to reduce power
consumption to sustain long battery life for given task

iPhone 6 battery: 7 watt-hours


9.7in iPad Pro battery: 28 watt-hours
15in Macbook Pro: 99 watt-hours

Time

Slide credit: adopted from original slide from M. Shebanow: HPG 2013 keynote Stanford CS348V, Winter 2018
Efficiency benefits of compute specialization
▪ Rules of thumb: compared to high-quality C code on CPU...

▪ Throughput-maximized processor architectures: e.g., GPU cores


- Approximately 10x improvement in perf / watt
- Assuming code maps well to wide data-parallel execution and is compute bound

▪ Fixed-function ASIC (“application-specific integrated circuit”)


- Can approach 100-1000x or greater improvement in perf/watt
- Assuming code is compute bound and
and is not floating-point math

[Source: Chung et al. 2010 , Dally 08] [Figure credit Eric Chung]
Stanford CS348V, Winter 2018
Hardware specialization increases efficiency
FPGA
GPUs

ASIC delivers same performance


as one CPU core with ~ 1/1000th
the chip area.

GPU cores: ~ 5-7 times more area


lg2(N) (data set size) efficient than CPU cores.

FPGA
GPUs

ASIC delivers same performance


as one CPU core with only ~
1/100th the power.

lg2(N) (data set size)


[Chung et al. MICRO 2010] Stanford CS348V, Winter 2018
Modern systems use ASICs for…
▪ Image/video encode/decode (e.g., H.264, JPG)
▪ Audio recording/playback
▪ Voice “wake up” (e.g., Ok Google)
▪ Camera “RAW” processing: processing data acquired by image
sensor into images that are pleasing to humans
▪ Many 3D graphics tasks (rasterization, texture mapping,
occlusion using the Z-buffer)
▪ Significant modern interest in ASICs for deep network
evaluation (e.g., Google’s Tensor Processing Unit)

Stanford CS348V, Winter 2018


Qualcomm Hexagon DSP
VLIW: Area & power efficient multi-issue
▪ Originally used for audio/LTE support on Qualcomm SoC’s
• Dual 64-bit execution units
▪ Multi-threaded,
Variable sized VLIW DSP
Instruction
• Standard 8/16/32/64bit data
types
instruction packets
▪ Third major programmable unit on modern
(1 to 4 instructions CacheQualcomm SoCs • SIMD vectorized MPY / ALU

-
per Packet) / SHIFT, Permute, BitOps
Multi-core CPU Instruction Unit • Up to 8 16b MAC/cycle

-
• 2 SP FMA/cycle
Multi-core GPU (Adreno)
VLIW: Area & power efficient multi-issue
- Device
Hexagon DDR
DSP
Memory
L2
Cache • Dual 64-bit execution units
Variable
/ TCMsized • Standard 8/16/32/64bit data
instruction packets Instruction types
(1 to 4 instructionsData Cache • SIMD vectorized MPY / ALU
• Dual 64-bit Unit Data Unit Execution Execution
per Packet) / SHIFT, Permute, BitOps
load/store (Load/ (Load/
Instruction Unit
Unit Unit • Up to 8 16b MAC/cycle
units Store/ Store/ (64-bit (64-bit • 2 SP FMA/cycle
• Also 32-bit ALU) ALU) Vector) Vector)
Device L2
ALU DDR Data Cache • Unified 32x32bit
Cache
Memory
/ TCM General Register
File is best for
• Dual 64-bit Data Unit Data Unit Execution Execution
(Load/ (Load/ Unit Unit compiler.
load/store
units Store/ Store/ (64-bit •
(64-bit No separate Address
• Also 32-bit
Register
ALU) File/Thread
ALU) Vector) Vector) or Accum Regs
Register File
ALU Register
Data Cache File • Per-Thread
• Unified 32x32bit
General Register
File is best for
compiler. 7
Qualcomm Technologies, Inc. All Rights Reserved
• No separate Address
Register File/Thread or Accum Regs
Register File
Register File • Per-Thread

7
Stanford CS348V, Winter 2018
Qualcomm Technologies, Inc. All Rights Reserved
Summary: choosing the right tool for the job

Throughput-oriented FPGA/Future
Energy-optimized CPU processor (GPU) Programmable DSP reconfigurable logic ASIC
Video encode/decode,
Audio playback,
Area & power efficient multi-issue Camera RAW processing,
neural nets (future?)
• Dual 64-bit execution units
• Standard 8/16/32/64bit data
kets Instruction types
ons Cache
~10X more efficient • SIMD vectorized MPY / ALU
/ SHIFT, Permute, BitOps
~100X??? ~100-1000X
Instruction Unit • Up to 8 16b MAC/cycle
• 2 SP FMA/cycle
(jury still out) more efficient
L2
Easiest to program
Cache
/ TCM
Difficult to program Not programmable +
Data Unit Data Unit Execution Execution
(making it easier is costs 10-100’s millions
t
(Load/
Store/
(Load/
Store/
Unit
(64-bit
Unit
(64-bit
active area of research) of dollars to design /
t ALU) ALU) Vector) Vector) verify / create
Data Cache • Unified 32x32bit
General Register
File is best for
compiler.
• No separate Address
Register File/Thread
Credit Pat Hanrahan for this taxonomy
Register File or Accum Regs
Stanford CS348V, Winter 2018
Register File • Per-Thread
Data movement has high energy cost
▪ Rule of thumb in mobile system design: always seek to reduce amount of
data transferred from memory
- Earlier in class we discussed minimizing communication to reduce stalls (poor performance).
Now, we wish to reduce communication to reduce energy consumption

▪ “Ballpark” numbers [Sources: Bill Dally (NVIDIA), Tom Olson (ARM)]

- Integer op: ~ 1 pJ *
- Floating point op: ~20 pJ *
- Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pJ
- Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pJ Suggests that recomputing values,
rather than storing and reloading
them, is a better answer when
▪ Implications optimizing code for energy efficiency!
- Reading 10 GB/sec from memory: ~1.6 watts
- Entire power budget for mobile GPU: ~1 watt
(remember phone is also running CPU, display, radios, etc.)
- iPhone 6 battery: ~7 watt-hours (note: my Macbook Pro laptop: 99 watt-hour battery)
- Exploiting locality matters!!!

* Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc. Stanford CS348V, Winter 2018
Welcome to cs348v!
▪ Make sure you are signed up on Piazza so you get
announcements

▪ Tonight’s reading:
- “The Compute Architecture of Intel Processor Graphics Gen9” - Intel Technical
Report, 2015
- “The Rise of Mobile Visual Computing Systems”, Fatahalian, IEEE Mobile
Computing 2016

Stanford CS348V, Winter 2018


More review

Stanford CS348V, Winter 2018


For the rest of this class, know these terms
▪ Multi-core processor
▪ SIMD execution
▪ Coherent control flow
▪ Hardware multi-threading
- Interleaved multi-threading
- Simultaneous multi-threading
▪ Memory latency
▪ Memory bandwidth
▪ Bandwidth bound application
▪ Arithmetic intensity

Stanford CS348V, Winter 2018


Which program performs better?
Program 1
void add(int n, float* A, float* B, float* C) {
(Note: an answer probably needs
for (int i=0; i<n; i++)
C[i] = A[i] + B[i];
to state its assumptions.)
}

void mul(int n, float* A, float* B, float* C) {


for (int i=0; i<n; i++)
C[i] = A[i] * B[i];
}

float* A, *B, *C, *D, *E, *tmp1, *tmp2;

// assume arrays are allocated here

// compute E = D + ((A + B) * C)
add(n, A, B, tmp1);
mul(n, tmp1, C, tmp2);
add(n, tmp2, D, E);

Program 2
void fused(int n, float* A, float* B, float* C, float* D, float* E) {
for (int i=0; i<n; i++)
E[i] = D[i] + (A[i] + B[i]) * C[i];
}

// compute E = D + (A + B) * C
fused(n, A, B, C, D, E);

Stanford CS348V, Winter 2018


More thought questions
Program 1
void add(int n, float* A, float* B, float* C) {
Which code structuring style
for (int i=0; i<n; i++)
C[i] = A[i] + B[i];
would you rather write?
}

void mul(int n, float* A, float* B, float* C) {


for (int i=0; i<n; i++)
C[i] = A[i] * B[i];
} Consider running either of these
programs: would CPU support for
float* A, *B, *C, *D, *E, *tmp1, *tmp2;
hardware-multi-threading help
// assume arrays are allocated here
performance?
// compute E = D + ((A + B) * C)
add(n, A, B, tmp1);
mul(n, tmp1, C, tmp2);
add(n, tmp2, D, E);

Program 2
void fused(int n, float* A, float* B, float* C, float* D, float* E) {
for (int i=0; i<n; i++)
E[i] = D[i] + (A[i] + B[i]) * C[i];
}

// compute E = D + (A + B) * C
fused(n, A, B, C, D, E);

Stanford CS348V, Winter 2018


Visualizing interleaved and simultaneous
multi-threading
(and combinations thereof)

Stanford CS348V, Winter 2018


Interleaved multi-threading
Consider a processor with:
▪ Two execution contexts
▪ One fetch and decode unit (one instruction per clock)
▪ One ALU (to execute the instruction)
time (clocks)

Thread 0

Thread 1
= ALU executing T0 at this time
= ALU executing T1 at this time

In an interleaved multi-threading scenario, the threads share the processor.


(This is a visualization of when threads are having their instructions executed by the ALU.)
Stanford CS348V, Winter 2018
Interleaved multi-threading
Consider a processor with:
▪ Two execution contexts
▪ One fetch and decode unit (one instruction per clock)
▪ One ALU (to execute the instruction)

time (clocks)

Thread 0

Thread 1
= ALU executing T0 at this time
= ALU executing T1 at this time

Same as previous slide, but now just a different scheduling order of the threads
(fine-grained interleaving)

Stanford CS348V, Winter 2018


Simultaneous multi-threading
Consider a processor with:
▪ Two execution contexts
▪ Two fetch and decode units (two instructions per clock)
▪ Two ALUs (to execute the two instructions)
time (clocks)

Thread 0

Thread 1
= ALU executing T0 at this time
= ALU executing T1 at this time

In an simultaneous multi-threading scenario, the threads execute simultaneously on


the two ALUs. (note, no ILP in a thread since each thread is run sequentially on one ALU)

Stanford CS348V, Winter 2018


Combining simultaneous and interleaved multi-threading
Consider a processor with:
▪ Four execution contexts
▪ Two fetch and decode units (two instructions per clock, choose two of four threads)
▪ Two ALUs (to execute the two instructions)
time (clocks)

Thread 0

Thread 1

Thread 2

Thread 3

= some ALU executing T0 at this time = some ALU executing T2 at this time
= some ALU executing T1 at this time = some ALU executing T3 at this time
Stanford CS348V, Winter 2018
Another way to visualize execution (ALU-centric view)
Consider a processor with:
▪ Four execution contexts
▪ Two fetch and decode units (two instructions per clock, choose two of four threads)
▪ Two ALUs (to execute the two instructions)

Now the graph is visualizing what each ALU is doing each clock:
time (clocks)

ALU 0

ALU 1

= executing T0 at this time = executing T2 at this time


= executing T1 at this time = executing T3 at this time

Stanford CS348V, Winter 2018


Instructions can be drawn from same thread (ILP)
Consider a processor with:
▪ Four execution contexts
▪ Two fetch and decode units (two instructions per clock, choose any two
independent instructions from the four threads)
▪ Two ALUs (to execute the two instructions)

time (clocks)

ALU 0

ALU 1

Two instructions from same thread executing simultaneously.

= executing T0 at this time = executing T2 at this time


= executing T1 at this time = executing T3 at this time
Stanford CS348V, Winter 2018

You might also like