Data-Level Parallelism Vector and GPU

This document discusses data-level parallelism and vectorization. It begins by explaining how performing the same operation on many data items can be parallelized using single-instruction multiple-data (SIMD) techniques. It then provides an example of vectorizing a SAXPY loop to perform the operation on multiple elements with each instruction. The document goes on to describe example vector instruction set extensions, how vectors can speed up code by reducing the number of instructions, and how vector operations are implemented in hardware by replicating functional units to operate in parallel on wide vector registers. It concludes by discussing other specialized vector instructions and how vectors continue to be expanded in modern CPUs.

Uploaded by

hùng nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views6 pages

Data-Level Parallelism Vector and GPU

Uploaded by

hùng nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

How to Compute This Fast?

•  Performing the same operations on many data items

•  Example: SAXPY
L1: ldf [X+r1]->f1 // I is in r1
for (I = 0; I < 1024; I++) { mulf f0,f1->f2 // A is in f0
CIS 501: Computer Architecture }
Z[I] = A*X[I] + Y[I]; ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1}
addi r1,4->r1
Unit 11: Data-Level Parallelism: blti r1,4096,L1

Vectors & GPUs •  Instruction-level parallelism (ILP) - fine grained

•  Loop unrolling with static scheduling –or– dynamic scheduling
Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'' •  Wide-issue superscalar (non-)scaling limits benefits
with'sources'that'included'University'of'Wisconsin'slides'
by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood' •  Thread-level parallelism (TLP) - coarse grained
•  Multicore
•  Can we do some “medium grained” parallelism?
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 2

Data-Level Parallelism Example Vector ISA Extensions (SIMD)

•  Data-level parallelism (DLP) •  Extend ISA with floating point (FP) vector storage …
•  Single operation repeated on multiple data elements •  Vector register: fixed-size array of 32- or 64- bit FP elements
•  SIMD (Single-Instruction, Multiple-Data) •  Vector length: For example: 4, 8, 16, 64, …
•  Less general than ILP: parallel insns are all same operation •  … and example operations for vector length of 4
•  Exploit with vectors •  Load vector: ldf.v [X+r1]->v1
•  Old idea: Cray-1 supercomputer from late 1970s ldf [X+r1+0]->v10
•  Eight 64-entry x 64-bit floating point “vector registers” ldf [X+r1+1]->v11
•  4096 bits (0.5KB) in each register! 4KB for vector register file ldf [X+r1+2]->v12
•  Special vector instructions to perform vector operations ldf [X+r1+3]->v13
•  Load vector, store vector (wide memory operation) •  Add two vectors: addf.vv v1,v2->v3
•  Vector+Vector or Vector+Scalar addf v1i,v2i->v3i (where i is 0,1,2,3)
•  addition, subtraction, multiply, etc. •  Add vector to scalar: addf.vs v1,f2,v3
•  In Cray-1, each instruction specifies 64 operations! addf v1i,f2->v3i (where i is 0,1,2,3)
•  ALUs were expensive, so one operation per cycle (not parallel) •  Today’s vectors: short (128 or 256 bits), but fully parallel
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 4
Example Use of Vectors – 4-wide Vector Datapath & Implementatoin
ldf [X+r1]->f1 ldf.v [X+r1]->v1
mulf f0,f1->f2 mulf.vs v1,f0->v2 •  Vector insn. are just like normal insn… only “wider”
ldf [Y+r1]->f3 ldf.v [Y+r1]->v3 •  Single instruction fetch (no extra N2 checks)
addf f2,f3->f4 addf.vv v2,v3->v4
stf f4->[Z+r1] stf.v v4,[Z+r1] •  Wide register read & write (not multiple ports)
addi r1,4->r1 addi r1,16->r1
blti r1,4096,L1 blti r1,4096,L1 •  Wide execute: replicate floating point unit (same as superscalar)
7x1024 instructions 7x256 instructions •  Wide bypass (avoid N2 bypass problem)
•  Operations (4x fewer instructions) •  Wide cache read & write (single cache tag check)
•  Load vector: ldf.v [X+r1]->v1
•  Multiply vector to scalar: mulf.vs v1,f2->v3 •  Execution width (implementation) vs vector width (ISA)
•  Add two vectors: addf.vv v1,v2->v3 •  Example: Pentium 4 and “Core 1” executes vector ops at half width
•  Store vector: stf.v v1->[X+r1] •  “Core 2” executes them at full width

•  Performance?
•  Because they are just instructions…
•  Best case: 4x speedup
•  …superscalar execution of vector instructions
•  But, vector instructions don’t always have single-cycle throughput
•  Multiple n-wide vector instructions per cycle
•  Execution width (implementation) vs vector width (ISA)
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 6

Intel’s SSE2/SSE3/SSE4/AVX… Other Vector Instructions

•  Intel SSE2 (Streaming SIMD Extensions 2) - 2001 •  These target specific domains: e.g., image processing, crypto
•  16 128bit floating point registers (xmm0–xmm15) •  Vector reduction (sum all elements of a vector)
•  Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) •  Geometry processing: 4x4 translation/rotation matrices
•  Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) •  Saturating (non-overflowing) subword add/sub: image processing
•  Or 1x64b or 1x32b FP (just normal scalar floating point) •  Byte asymmetric operations: blending and composition in graphics
•  Original SSE: only 8 registers, no packed integer support •  Byte shuffle/permute: crypto
•  Population (bit) count: crypto
•  Other vector extensions •  Max/min/argmax/argmin: video codec
•  AMD 3DNow!: 64b (2x32b) •  Absolute differences: video codec
•  PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) •  Multiply-accumulate: digital-signal processing
•  Special instructions for AES encryption
•  Looking forward for x86 •  More advanced (but in Intel’s Xeon Phi)
•  Intel’s “Sandy Bridge” brings 256-bit vectors to x86 •  Scatter/gather loads: indirect store (or load) from a vector of pointers
•  Intel’s “Xeon Phi” multicore will bring 512-bit vectors to x86 •  Vector mask: predication (conditional execution) of specific elements
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 8
Using Vectors in Your Code Recap: Vectors for Exploiting DLP
•  Write in assembly •  Vectors are an efficient way of capturing parallelism
•  Ugh •  Data-level parallelism
•  Avoid the N2 problems of superscalar
•  Use “intrinsic” functions and data types
•  For example: _mm_mul_ps() and “__m128” datatype
•  Avoid the difficult fetch problem of superscalar
•  Area efficient, power efficient
•  Use vector data types
•  typedef double v2df __attribute__ ((vector_size (16))); •  The catch?
•  Use a library someone else wrote •  Need code that is “vector-izable”
•  Let them do the hard work •  Need to modify program (unlike dynamic-scheduled superscalar)
•  Matrix and linear algebra packages •  Requires some help from the programmer

•  Let the compiler do it (automatic vectorization, with feedback) •  Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors
•  GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n
•  More flexible (vector “masks”, scatter, gather) and wider
•  Limited impact for C/C++ code (old, hard problem)
•  Should be easier to exploit, more bang for the buck
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 10

Graphics Processing Units (GPU) GPUs and SIMD/Vector Data Parallelism

•  Killer app for parallelism: graphics (3D games)
•  How do GPUs have such high peak FLOPS & FLOPS/Joule?
•  Exploit massive data parallelism – focus on total throughput
•  Remove hardware structures that accelerate single threads
•  Specialized for graphs: e.g., data-types & dedicated texture units
•  “SIMT” execution model
•  Single instruction multiple threads
•  Similar to both “vectors” and “SIMD”
Tesla S870! •  A key difference: better support for conditional control flow
•  Program it with CUDA or OpenCL
•  Extensions to C
•  Perform a “shader task” (a snippet of scalar computation) over
many elements
•  Internally, GPU uses scatter/gather and vector mask operations

CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 12
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
13 14

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

15 16
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
17 18

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

19 20
Data Parallelism Summary
•  Data Level Parallelism
•  “medium-grained” parallelism between ILP and TLP
•  Still one flow of execution (unlike TLP)
•  Compiler/programmer must explicitly expresses it (unlike ILP)
•  Hardware support: new “wide” instructions (SIMD)
•  Wide registers, perform multiple operations in parallel
•  Trends
•  Wider: 64-bit (MMX, 1996), 128-bit (SSE2, 2000),
256-bit (AVX, 2011), 512-bit (Xeon Phi, 2012?)
•  More advanced and specialized instructions
•  GPUs
•  Embrace data parallelism via “SIMT” execution model
•  Becoming more programmable all the time
•  Today’s chips exploit parallelism at all levels: ILP, DLP, TLP
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 21

CS2103/CS2103T Summary
100% (2)
CS2103/CS2103T Summary
6 pages
Jtag System: With Openocd Explanation
100% (1)
Jtag System: With Openocd Explanation
17 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Unit 2
No ratings yet
Unit 2
43 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
19 Computer Architecture Vector Processor
No ratings yet
19 Computer Architecture Vector Processor
20 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
No ratings yet
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
31 pages
Vector
No ratings yet
Vector
38 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Native Shader Compilation With LLVM PDF
No ratings yet
Native Shader Compilation With LLVM PDF
37 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Gpu-Arc
No ratings yet
Gpu-Arc
37 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
Vector
No ratings yet
Vector
42 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
SIMD
No ratings yet
SIMD
44 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
Slide 7
No ratings yet
Slide 7
40 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Lecture #4
No ratings yet
Lecture #4
16 pages
Why Vector Processing: Deep Pipeline More Parallelism
No ratings yet
Why Vector Processing: Deep Pipeline More Parallelism
7 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Slides 18 645 Simd
No ratings yet
Slides 18 645 Simd
37 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
Ca Part 3
No ratings yet
Ca Part 3
20 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
MCA Unit-5 QB
No ratings yet
MCA Unit-5 QB
3 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
Department of Cse CP7103 Multicore Architecture Unit - 2, DLP in Vector, Simd and Gpu Architectures 100% THEORY Question Bank
No ratings yet
Department of Cse CP7103 Multicore Architecture Unit - 2, DLP in Vector, Simd and Gpu Architectures 100% THEORY Question Bank
3 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
MCA - HW - Lecture 7and8 - Prelim
No ratings yet
MCA - HW - Lecture 7and8 - Prelim
146 pages
Aca Syllabus Details Made by Dr. Rahul Mirajkar
No ratings yet
Aca Syllabus Details Made by Dr. Rahul Mirajkar
1 page
Unit 3 Notes
No ratings yet
Unit 3 Notes
35 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Vector Code Example
No ratings yet
Vector Code Example
6 pages
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
No ratings yet
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
8 pages
15 20-15 55-18 05 06 VEXT-bcn-v1
No ratings yet
15 20-15 55-18 05 06 VEXT-bcn-v1
76 pages
GPU Compute
100% (1)
GPU Compute
58 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
C, Embedded Linux, LDD Interview Questions: 1. Explain About Compilation Process in C
No ratings yet
C, Embedded Linux, LDD Interview Questions: 1. Explain About Compilation Process in C
564 pages
Solid State Drives The Future of Data Storage
No ratings yet
Solid State Drives The Future of Data Storage
8 pages
IoT Aware Architecture For Smart Living
No ratings yet
IoT Aware Architecture For Smart Living
12 pages
Python Programming Language
No ratings yet
Python Programming Language
11 pages
بنك الأســئــلـة الفـصــــل 2
No ratings yet
بنك الأســئــلـة الفـصــــل 2
7 pages
UNIT I - CS8602 Compiler Design Notes
No ratings yet
UNIT I - CS8602 Compiler Design Notes
26 pages
1 L1 - Introduction To Data Structure and Algorithms
100% (1)
1 L1 - Introduction To Data Structure and Algorithms
96 pages
OSprakash
No ratings yet
OSprakash
33 pages
Web Programming
No ratings yet
Web Programming
109 pages
How To Run The Practice Test CFAP MSA
0% (1)
How To Run The Practice Test CFAP MSA
3 pages
A Seminar Report ON: Rom-Bios
No ratings yet
A Seminar Report ON: Rom-Bios
12 pages
(Aos) M2
No ratings yet
(Aos) M2
8 pages
Concurrency: Mutual Exclusion and Synchronization: Ninth Edition, Global Edition by William Stallings
No ratings yet
Concurrency: Mutual Exclusion and Synchronization: Ninth Edition, Global Edition by William Stallings
69 pages
TRUCK
No ratings yet
TRUCK
6 pages
Power Off Reset Reason Backup
No ratings yet
Power Off Reset Reason Backup
4 pages
Case Study Atm
100% (1)
Case Study Atm
16 pages
Mongodb Crud Operations
No ratings yet
Mongodb Crud Operations
48 pages
Bba 1 Sem Computer File
No ratings yet
Bba 1 Sem Computer File
15 pages
4.7 Input-Output Processor (IOP)
No ratings yet
4.7 Input-Output Processor (IOP)
4 pages
Barudan TES
No ratings yet
Barudan TES
42 pages
PWD VJ
100% (1)
PWD VJ
12 pages
Useful Linux Wireless Commands
No ratings yet
Useful Linux Wireless Commands
22 pages
Solutions To All DIY Questions
No ratings yet
Solutions To All DIY Questions
15 pages
Iot 3
No ratings yet
Iot 3
5 pages
Mikrotik HAP AC
No ratings yet
Mikrotik HAP AC
106 pages
At 501 Linux SW User Manual 1.4
No ratings yet
At 501 Linux SW User Manual 1.4
31 pages
CIS Module 8 - Cloud Computing Primer
No ratings yet
CIS Module 8 - Cloud Computing Primer
32 pages
NCR RealPOS 70 7402 r1.4 User Guide
No ratings yet
NCR RealPOS 70 7402 r1.4 User Guide
306 pages

Data-Level Parallelism Vector and GPU

Uploaded by

Data-Level Parallelism Vector and GPU

Uploaded by

How to Compute This Fast?

• Performing the same operations on many data items

Vectors & GPUs • Instruction-level parallelism (ILP) - fine grained

Data-Level Parallelism Example Vector ISA Extensions (SIMD)

Intel’s SSE2/SSE3/SSE4/AVX… Other Vector Instructions

Graphics Processing Units (GPU) GPUs and SIMD/Vector Data Parallelism

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

You might also like

•  Performing the same operations on many data items

Vectors & GPUs •  Instruction-level parallelism (ILP) - fine grained