How to Compute This Fast?
• Performing the same operations on many data items
• Example: SAXPY
L1: ldf [X+r1]->f1 // I is in r1
for (I = 0; I < 1024; I++) { mulf f0,f1->f2 // A is in f0
CIS 501: Computer Architecture }
Z[I] = A*X[I] + Y[I]; ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1}
addi r1,4->r1
Unit 11: Data-Level Parallelism: blti r1,4096,L1
Vectors & GPUs • Instruction-level parallelism (ILP) - fine grained
• Loop unrolling with static scheduling –or– dynamic scheduling
Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'' • Wide-issue superscalar (non-)scaling limits benefits
with'sources'that'included'University'of'Wisconsin'slides'
by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood' • Thread-level parallelism (TLP) - coarse grained
• Multicore
• Can we do some “medium grained” parallelism?
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 2
Data-Level Parallelism Example Vector ISA Extensions (SIMD)
• Data-level parallelism (DLP) • Extend ISA with floating point (FP) vector storage …
• Single operation repeated on multiple data elements • Vector register: fixed-size array of 32- or 64- bit FP elements
• SIMD (Single-Instruction, Multiple-Data) • Vector length: For example: 4, 8, 16, 64, …
• Less general than ILP: parallel insns are all same operation • … and example operations for vector length of 4
• Exploit with vectors • Load vector: ldf.v [X+r1]->v1
• Old idea: Cray-1 supercomputer from late 1970s ldf [X+r1+0]->v10
• Eight 64-entry x 64-bit floating point “vector registers” ldf [X+r1+1]->v11
• 4096 bits (0.5KB) in each register! 4KB for vector register file ldf [X+r1+2]->v12
• Special vector instructions to perform vector operations ldf [X+r1+3]->v13
• Load vector, store vector (wide memory operation) • Add two vectors: addf.vv v1,v2->v3
• Vector+Vector or Vector+Scalar addf v1i,v2i->v3i (where i is 0,1,2,3)
• addition, subtraction, multiply, etc. • Add vector to scalar: addf.vs v1,f2,v3
• In Cray-1, each instruction specifies 64 operations! addf v1i,f2->v3i (where i is 0,1,2,3)
• ALUs were expensive, so one operation per cycle (not parallel) • Today’s vectors: short (128 or 256 bits), but fully parallel
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 4
Example Use of Vectors – 4-wide Vector Datapath & Implementatoin
ldf [X+r1]->f1 ldf.v [X+r1]->v1
mulf f0,f1->f2 mulf.vs v1,f0->v2 • Vector insn. are just like normal insn… only “wider”
ldf [Y+r1]->f3 ldf.v [Y+r1]->v3 • Single instruction fetch (no extra N2 checks)
addf f2,f3->f4 addf.vv v2,v3->v4
stf f4->[Z+r1] stf.v v4,[Z+r1] • Wide register read & write (not multiple ports)
addi r1,4->r1 addi r1,16->r1
blti r1,4096,L1 blti r1,4096,L1 • Wide execute: replicate floating point unit (same as superscalar)
7x1024 instructions 7x256 instructions • Wide bypass (avoid N2 bypass problem)
• Operations (4x fewer instructions) • Wide cache read & write (single cache tag check)
• Load vector: ldf.v [X+r1]->v1
• Multiply vector to scalar: mulf.vs v1,f2->v3 • Execution width (implementation) vs vector width (ISA)
• Add two vectors: addf.vv v1,v2->v3 • Example: Pentium 4 and “Core 1” executes vector ops at half width
• Store vector: stf.v v1->[X+r1] • “Core 2” executes them at full width
• Performance?
• Because they are just instructions…
• Best case: 4x speedup
• …superscalar execution of vector instructions
• But, vector instructions don’t always have single-cycle throughput
• Multiple n-wide vector instructions per cycle
• Execution width (implementation) vs vector width (ISA)
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 6
Intel’s SSE2/SSE3/SSE4/AVX… Other Vector Instructions
• Intel SSE2 (Streaming SIMD Extensions 2) - 2001 • These target specific domains: e.g., image processing, crypto
• 16 128bit floating point registers (xmm0–xmm15) • Vector reduction (sum all elements of a vector)
• Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) • Geometry processing: 4x4 translation/rotation matrices
• Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) • Saturating (non-overflowing) subword add/sub: image processing
• Or 1x64b or 1x32b FP (just normal scalar floating point) • Byte asymmetric operations: blending and composition in graphics
• Original SSE: only 8 registers, no packed integer support • Byte shuffle/permute: crypto
• Population (bit) count: crypto
• Other vector extensions • Max/min/argmax/argmin: video codec
• AMD 3DNow!: 64b (2x32b) • Absolute differences: video codec
• PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) • Multiply-accumulate: digital-signal processing
• Special instructions for AES encryption
• Looking forward for x86 • More advanced (but in Intel’s Xeon Phi)
• Intel’s “Sandy Bridge” brings 256-bit vectors to x86 • Scatter/gather loads: indirect store (or load) from a vector of pointers
• Intel’s “Xeon Phi” multicore will bring 512-bit vectors to x86 • Vector mask: predication (conditional execution) of specific elements
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 8
Using Vectors in Your Code Recap: Vectors for Exploiting DLP
• Write in assembly • Vectors are an efficient way of capturing parallelism
• Ugh • Data-level parallelism
• Avoid the N2 problems of superscalar
• Use “intrinsic” functions and data types
• For example: _mm_mul_ps() and “__m128” datatype
• Avoid the difficult fetch problem of superscalar
• Area efficient, power efficient
• Use vector data types
• typedef double v2df __attribute__ ((vector_size (16))); • The catch?
• Use a library someone else wrote • Need code that is “vector-izable”
• Let them do the hard work • Need to modify program (unlike dynamic-scheduled superscalar)
• Matrix and linear algebra packages • Requires some help from the programmer
• Let the compiler do it (automatic vectorization, with feedback) • Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors
• GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n
• More flexible (vector “masks”, scatter, gather) and wider
• Limited impact for C/C++ code (old, hard problem)
• Should be easier to exploit, more bang for the buck
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 10
Graphics Processing Units (GPU) GPUs and SIMD/Vector Data Parallelism
• Killer app for parallelism: graphics (3D games)
• How do GPUs have such high peak FLOPS & FLOPS/Joule?
• Exploit massive data parallelism – focus on total throughput
• Remove hardware structures that accelerate single threads
• Specialized for graphs: e.g., data-types & dedicated texture units
• “SIMT” execution model
• Single instruction multiple threads
• Similar to both “vectors” and “SIMD”
Tesla S870! • A key difference: better support for conditional control flow
• Program it with CUDA or OpenCL
• Extensions to C
• Perform a “shader task” (a snippet of scalar computation) over
many elements
• Internally, GPU uses scatter/gather and vector mask operations
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 12
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
13 14
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
15 16
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
17 18
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
19 20
Data Parallelism Summary
• Data Level Parallelism
• “medium-grained” parallelism between ILP and TLP
• Still one flow of execution (unlike TLP)
• Compiler/programmer must explicitly expresses it (unlike ILP)
• Hardware support: new “wide” instructions (SIMD)
• Wide registers, perform multiple operations in parallel
• Trends
• Wider: 64-bit (MMX, 1996), 128-bit (SSE2, 2000),
256-bit (AVX, 2011), 512-bit (Xeon Phi, 2012?)
• More advanced and specialized instructions
• GPUs
• Embrace data parallelism via “SIMT” execution model
• Becoming more programmable all the time
• Today’s chips exploit parallelism at all levels: ILP, DLP, TLP
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 21