AMD Gem5 APU Simulator Micro 2015 Final PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

THE AMD gem5 APU

SIMULATOR: MODELING
HETEROGENEOUS SYSTEMS
IN gem5
AMD RESEARCH
DECEMBER 6, 2015

OBJECTIVES AND SCOPE


Objectives
Introduce the Heterogeneous System Architecture
(HSA) and AMDs GCN GPUs
Describe the gem5-based APU simulator

Scope
Emphasis on the GPU side of the simulator
APU (CPU+GPU) model, not discrete GPU

Limitations and comparison to other GPU simulators

Why are we releasing our code?


Encourage HSA-relevant (and AMD-relevant) research
Improve academic collaborations
Enable intern candidates to get experience before arriving
Enable interns to take their experience back to school

Acknowledgement
AMD Researchs gem5 Team
2 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Modeling
an APU
systems

ACKNOWLEDGEMENTS
MANY CONTRIBUTORS OVER THE PAST 5+ YEARS
Alex Dutu

David Roberts

Kunal Korgaonkar

Nagesh Lakshminarayana

Ali Jafri

Derek Hower

Lisa Hsu

Nilay Vaish

Arka Basu

Dmitri Yudanov

Manish Arora

Onur Kayiran

Ayse Yilmazer

Eric Van Tassell

Marc Orr

Si Li

Binh Pham

Gagan Sachdev

Mario Mendez-Lojo

Sooraj Puthoor

Blake Hechtman

Jason Power

Mark Leather

Steve Reinhardt

Brad Beckmann

Joel Hestness

Mark Wilkening

Tanmay Gangwani

Brandon Potter

Jieming Yin

Martin Brown

Tim Rogers

Can Hankendi

John Alsop

Matt Poremba

Tony Gutierrez

James Wang

Joe Gross

Mike Chu

Tushar Krishna

David Hashe

John Kalamatianos Myrto Papadopoulou


Monir Mozumder

3 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Yatin Manerkar
Yasuko Eckert

QUICK SURVEY
How many of you are:
Graduate students?
Faculty members?
Working for government research labs?
Working for industry?

Have you written an GPU program (CUDA, OpenCLTM, other languages)?


Have you used the following simulators:
GPGPU-Sim?
Multi2Sim?
gem5?

Do you know anything about HSA?

4 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

5 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

BACKGROUND
Terminology and system overview
HSA Features
Coherent shared virtual memory
HSAIL: HSA Intermediate Language

HSA software stack

6 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU TERMINOLOGY
GPU I-Cache

SQC

GPU
Core

GPU
Core

GPU
Core

GPU
Core

CU

CU

CU

CU

L1D

L1D

L1D

L1D

TCP

TCP

TCP

TCP

L2

TCC

AMD terminology

CU: Compute Unit (SM in Nvidia terminology), TCP: Texture Cache per Pipe,
TCC: Texture Cache per Channel, SQC: Sequencer Cache
7 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

EXAMPLE APU SYSTEM


GPU + CPU CORE-PAIR WITH A SHARED DIRECTORY
Sequencer Cache (SQC)

GPU

CPU

CPU I-Cache

GPU
Core

GPU
Core

GPU
Core

GPU
Core

CPU0

CPU1

L1D

L1D

L1D

L1D

L1D

L1D

L2

Directory

L2

Memory
Controller

8 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Memory

TRADITIONAL DISCRETE GPU

Separate memory
Separate addr space
CPU CPU
CPU
1
2 N

CU
1

CU
2

CU
CU
3 M

Explicit data copying

PCIe

Coherent System
Memory

No pointer-based
data structures

High latency
Low bandwidth

GPU Memory

Need lots of
compute on GPU to
amortize copy
overhead
Very limited GPU
memory capacity

9 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

hUMA UNIFIED MEMORY

Unified address space


CPU CPU
CPU CU
1
2 N
1

CU
2

CU
CU
3 M

GPU uses user virtual addrs


Fully coherent

No explicit copying
Data move on demand

Unified Coherent Memory

Pointer-based data
structures shared across CPU
& GPU
Pageable virtual addresses
No GPU capacity constraints

10 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

HSAIL: THE HSA INTERMEDIATE LANGUAGE


A portable virtual ISA for vendor-independent compilation and distribution
Like Java bytecode for GPUs, similar to Nvidia PTX

Generated by a language-specific compiler (LLVM, GCC, Javac, etc.)


Application binaries may ship with embedded HSAIL (text) or BRIG (binary)

Low-level IR, close to machine ISA level


Most optimizations (including register allocation) performed before HSAIL

Compiled down to target ISA by a vendor-specific finalizer


Finalizer may execute at run time, install time, or build time

11 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

PROGRAMMING LANGUAGES PROLIFERATING ON HSA


High-level compilers target HSA
specification
Examples: Shared virtual memory,
scoped synchronization
Runtime infrastructure provides necessary
software support to meet specification
Examples: Address translation H
SA
Works in concert with GPU hardware
IL
The HSA hardware specifications are
publicly available
http://hsafoundation.com
The HSA software stack is open sourced
http://github.com/HSAFoundation

12 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OpenCL
App

Java App

C++ App

Python
App

OpenCL
Runtime

Java JVM
(Sumatra)

HCC: C++ for


Heterogeneous
Computing

Continuums
Numba
Compiler

HSA
Helper Libraries

HSA Core
Runtime

Kernel Fusion
Driver (KFD)

HSA
Finalizer

HSA BUILDING BLOCKS

http://hsafoundation.com
http://github.com/HSAFoundation

HSA Hardware Building Blocks


Shared Virtual Memory

Single address space


Coherent
Pageable
Fast access from all components
Can share pointers

OpenSource

HSA Software Building Blocks


Portable, parallel, compiler IR
Instruction definition

HSA Platform
System Arch
Specification

HSA Runtime

Signals
Platform Atomics

HSA
Programmers
Reference
Manual

HSA System
Runtime
Specification

Multiple high level compilers

OpenSource

CLANG/LLVM/HSAIL
C++, OpenMP, OpenACC, Python, OpenCL, etc

Defined Memory Model

Industry standard, architected requirements


for how devices share memory and
communicate with each other

OpenSource

Create queues
Allocate memory
Device discovery

Architected User-Level Queues


Context Switching

OpenSource

HSAIL

Industry standard compiler IR and runtime to


enable existing programming languages to
target the GPU

13 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

APU SIMULATION SUPPORT

HSA Hardware Building Blocks

HSA Software Building Blocks

Shared Virtual Memory

HSAIL

Single address space


Coherent
Pageable
Fast access from all components
Can share pointers

Portable, parallel, compiler IR


Instruction definition

HSA Runtime (OpenCLTM Runtime)


Create queues
Allocate memory
Device discovery

Architected User-Level Queues


Signals
Context Switching
Platform Atomics

Multiple high-level compilers

Defined Memory Model


Basic Acquire and Release operations
No SC data
Coarse-grain memory

14 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

CLANG/LLVM/HSAIL
C++, OpenMP, OpenACC, Python, OpenCL, etc

Legend
Included in first release
Work-in-progress / may be released
Longer term work

HSA TERMINOLOGY IN A NUTSHELL


HSA Programming Abstraction
Light abstractions of parallel physical hardware
Captures basic OpenCL constructs and much more

GPU Architecture
GPU
GPU Core

Thread block
in CUDA

GPU Core

HSA Model
NDRange

Workgroup

Thread in
Work-item
CUDA
NDRange: N-Dimensional (N = 1, 2, or 3) index space
Partitioned into workgroups, wavefronts, and work-items
15 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Grid in CUDA

Workgroup

Wavefront

Warp in CUDA

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

16 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OVERVIEW OF gem5
Open-source modular platform for system architecture research
Integration of M5 (Univ. of Michigan) and GEMS (Univ. of Wisconsin)
Actively used in academia and industry

Discrete-event simulation platform with numerous models


CPU models at various performance/accuracy trade-off points
Multiple ISAs: x86, ARM, Alpha, Power, SPARC, MIPS

Two memory system models: Ruby and classic (M5)


Including caches, DRAM controllers, interconnect, coherence protocols, etc.

I/O devices: disk, Ethernet, video, etc.


Full system or app-only (system-call emulation)

Cycle-level modeling (not cycle accurate)


Accurate enough to capture first-order performance effects
Flexible enough to allow prototyping new ideas reasonably quickly

See http://www.gem5.org
17 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

HIGH-LEVEL COMPILATION FLOW


CPU handles x86 host binary

Benchmarks

HSA 1.0F OpenCL compilation


flow generates kernels
CL Offline Compiler (CLOC)
F stands for final
Combination of 4 different tools
Available on GitHub

GCC

HSA 1.0F OpenCL


Compilation Flow

x86 host
binary

GPU kernel binary


objects

System call emulation


simulation
No OS or device driver
Execution-driven evaluation

GPU microarchitecture model


directly executes HSAIL
instructions
18 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Emulated
OpenCL, OS,
driver ops.

HSAIL BRIG loader

APU extensions

HSA 1.0F COMPILATION FLOW


EXAMPLE MAKEFILE COMMANDS

CL
Kernels

# Step 1: frontend CL compiler converts CL to LLVM bytecode


clc2 --enable-hsail-extensions --cl-std=CL2.0 my_kernel.cl -o my_kernel.fe.bc
# Step 2: link HSAIL builtins (available with compiler) with bytecode

CL compiler

llvm-link my_kernel.fe.bc l builtins-hsail.bc o my_kernel.linked.bc


# Step 3: optimize bytecode for GPU

LLVM Linker

opt -O3 -gpu -whole -verify my_kernel.linked.bc -o my_kernel.opt.bc


# Step 4: compile optimized bytecode to HSAIL
llc -filetype=obj -hsail-enable-gcn=0 -march=hsail-64 my_kernel.opt.bc -o my_kernel.asm

Optimizer

# Step 5 (optional): disassemble BRIG-HSAIL binary into human readable disassembly


hsailasm -disassemble my_kernel.asm -o my_kernel.dasm

Compiler toolchain available at:


https://github.com/HSAFoundation/CLOC

Low-level
Compiler

BRIG
HSAIL
binary

HSA 1.0F OpenCL


Compilation Flow

19 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

HSAILbuiltins

EMULATED CL RUNTIME
Our implementation of OpenCL 2.0 runtime API
Simplifies OpenCL runtime for use with simulator
No OS kernel driver in SE mode, all driver calls captured by emulated driver
open()
Standard Unix system call for opening a device
Returns file descriptor for open device

ioctl()
(I/O control) standard Unix system call for sending commands to a device
Sends device-specific request codes, which are provided by the driver

Emulates kernel launch API, memory allocation, etc.


Maintains in-memory HSA tasks and launch doorbell visible to GPU HW
HSAIL is directly executed, no need to build/compile kernel through OpenCL

Available at: http://gem5.org/GPU_Models

20 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

HIGH-LEVEL APU SIMULATION FLOW


APPLICATION EXECUTION

AMD added components

GPU kernel binary


GPU kernel binary
objects
GPUobjects
kernels

x86 host
binary

Supports simultaneous execution of


multiple kernels

Details
GCN GPU model
Flexible memory system

Emulated
OpenCL, OS,
driver ops.

HSAIL BRIG
loader

CPU-GPU Communication
via coherent shared virtual memory

X86CPU
CPU
X86
X86
CPU
X86
CPU
X86CPU
CPU
X86
X86
CUCPU

CPU

GPU Cache
Hierarchy

Ruby

21 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

DETAILED VIEW OF KERNEL LAUNCH


Application

Run on simulated CPU

clEnqueueNDRangeKernel()

Emulated OpenCL runtime

memcpy(HsaQueueEntry, KernelObject);

open()/ioctl()

*Doorbell = 0; // MEM[Addr B] = 0;

Emulated Driver
BRIG Loader

HSA Kernels
then stored in
kernel object

HsaQueueEntry Addr A
Doorbell Addr B

CUs
Decoder

Instructions are
pre-decoded
Launch kernel

Simulated Mem.

gem5

Dispatcher

write(Addr B);

Shader

kernels.asm (BRIG HSAIL kernel binary)

22 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

DETAILED VIEW OF KERNEL LAUNCH


// Doorbell mapped to GPU I/O space

GPU Dispatcher

volatile uint32_t *dispatcherDoorbell = (uint32_t*)0x10000000;

// Task mapped to GPU I/O space


HsaQueueEntry *hsaTaskPtr = (HsaQueueEntry*)0x10000008;
clEnqueueNDRangeKernel()
{

/* Reads/writes to these data stuctures will trigger

I/O reads/writes to dispatcher. Copies newly


created kernel to GPU. */
memcpy(hsaTaskPtr, &curHsaTask, sizeof(HsaQueueEntry));

ID

NDRangeMap

NDRange

NDRange

NDRange

curTask

Copy HSA task to Dispatcher

// Notify dispatcher that kernel copy is finished;


// launch kernel.
*dispatchDoorbell = 0;
return CL_SUCCESS;
}

Schedule workgroup dispatch


Create and store a new NDRange
Dispatcher I/O space mapped to 0x10000000 0x10000FFF

23 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

DISPATCHER WORKGROUP ASSIGNMENT


Shader
GPU Dispatcher
CU

CU

CU

ID

NDRangeMap

NDRange

NDRange

NDRange

curTask

NDRange
wg(0, 0, 0)

wg(1, 0, 0)

work-item

1) Try to dispatch queued WGs on every tick

wg(0, 1, 0)

wg(1, 1, 0)

24 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

2) Pick oldest NDRange in queue; if it has


unexecuted WGs, try to schedule them on a
CU
3) Dispatch WG to CU if there are enough WF
slots, enough VGPRs, and enough LDS space.

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

25 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU CORE BASED ON GCN ARCHITECTURE

fetch_unit.hh/cc
fetch_stage.hh/cc

wavefront.hh/cc

brig_object.hh/cc

lds_state.hh/cc

vector_register_file.hh/cc

hsail_code.hh/cc

local_memory_pipeline.hh/cc

vector_register_state.hh/cc

arch/hsail/decoder.hh

global_memory_pipeline.hh/cc

arch/hsail/decoder.cc (auto-generated)

gpu_static_inst.hh/cc

compute_unit.hh/cc

auto-generated ISA-specific instruction classes

shader.hh/cc

GPU core in the APU simulator modeled after Graphics Core Next (GCN) Architecture
More details available here: GCN Architecture Whitepaper www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

26 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU CORE MODULES


GPU CORE MODULES VS. RUBY MODULES
SQC (I-Cache)
GPU
Core
L1D

GPU
Core

GPU
Core

L1D

L1D

GPU
Core
L1D

GPU Core
Modules

Ruby
Modules

APU
Simulator

L2

Hardware building blocks

Simulator software modules

GPU (shader unit) contains multiple CUs

GPU core is the compute unit (compute_unit.[cc|hh])


Resources inside GPU Core
Instruction buffering, Registers, Vector ALUs

Resources outside GPU Core


TCP, TCC, I-Cache (Ruby based)

Shader (shader.[cc|hh]): Object containing all GPU cores along with other misc. components
27 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU CORE MODULE INTERNALS


SHARED VS. PRIVATE STRUCTURES
SQC (I-Cache)

GPU
Core

GPU
Core

GPU
Core

GPU
Core

L1D

L1D

L1D

L1D

Instruction Fetch
WF 0-9
Contexts

L2

WF 10-19
Contexts

WF 20-29
Contexts

WF 30-39
Contexts

Instruction Decode

Compute unit (CU)


Four 16-wide SIMD units for vector
processing
SIMD hosts wavefronts (WF)
Private resources to each SIMD
Instruction buffering
Registers
Vector ALUS

SIMD 0
Vector
Registers

SIMD 1
Vector
Registers

SIMD 2
Vector
Registers

SIMD 3
Vector
Registers

Vector
ALU

Vector
ALU

Vector
ALU

Vector
ALU

Shared resources
Fetch and decode
TCP (L1D)
Local data share (LDS)

Local Data Share (LDS)

Similar to Nvidia shared memory


28 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

TCP (L1D)

GPU CORE TIMING


DESIGN METHODOLOGY
Functional and timing code separation
Wherever applicable, and similar to gem5
Global memory operation timings handled by Ruby
model

Functional

Timing schedule accesses with some delay and delay


responses
Functional will complete instruction in a single cycle

Timing

Progressively detailed component models


Examples: register file, TLB operations, etc.

Register file
Simple register allocation model available
Different register organizations and access arbitration
policies possible using its API

Less Detailed

More detailed

29 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU CORE TIMING


CONCEPTUAL TIMING STAGES
Fetched WFs

Fetch

Ready WFs

Scoreboard

Executing WFs

Schedule

Execute

Execute-in-execute philosophy
Pipeline stages

Fetch: fetch for dispatched wavefronts - fetch_stage.[hh|cc]


Scoreboard: Check which wavefronts are ready - scoreboard_check_stage.[hh|cc]
Schedule: Select a wave from the ready pool - schedule_stage.[hh|cc]
Execute: Run WF on execution resource - exec_stage.[hh|cc]
Memory pipeline: Execute lds/global memory operation
local memory (LDS)
local_memory_pipeline.[hh|cc]
global memory (TCP)
global_memory_pipeline.[hh|cc]

Instructions are decoded when the kernels are loaded

30 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Memory
pipeline

FETCH AND WAVEFRONT CONTEXTS


SQC shared by 4 CUs (GPU_RfO-SQ.sm, GPU_VIPER-SQC.sm)

4-CU-shared SQC (I-Cache)

# of SQCs and CUs are configurable in simulator

Fetch (fetch_unit.[hh|cc], fetch_stage.[hh|cc])

Instruction Fetch
WF 0-9
Contexts

WF 10-19
Contexts

WF 20-29
Contexts

Shared and arbitrated between SIMDs in a CU


Fetch to each SIMD unit
Dont fetch if instruction buffer (IB) contains branch

WF 30-39
Contexts

WF Contexts (wavefront.[hh|cc])

Instruction Decode

10 WFs per SIMD, 40 per CU


SIMD 0

SIMD 1

SIMD 2

SIMD 3

Vector
Registers

Vector
Registers

Vector
Registers

Vector
Registers

Vector
ALU

Vector
ALU

Vector
ALU

Vector
ALU

Local Data Share (LDS)

PC and instruction buffers


Reconvergence stack (for branch divergence)
Used to determine reconvergence point for work items that
experience branch divergence typically immediate post-dominator

BB1

TCP (L1D)

SIMD Phase
Wavefront 0
Wavefront 1

PC

PC
IB

31 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

else

BB1

BB1
BB1

Wavefront 9

PC
IB

if

IB

reconvergence point

DECODE AND ISSUE


Decode and issue
Instruction Fetch
WF 0-9
Contexts

WF 10-19
Contexts

WF 20-29
Contexts

WF 30-39
Contexts

All instructions pre-decoded by loader and cached by PC


After fetching an instruction it is retrieved from decode cache
Decode and issue N instructions from the SIMDs 10 WFs
Instruction types supported:
Branch, vector ALU, vector memory, local data share, export, and
special instructions

Issuing restrictions:

Instruction Decode
SIMD 0

SIMD 1

SIMD 2

SIMD 3

Vector
Registers

Vector
Registers

Vector
Registers

Vector
Registers

Vector
ALU

Vector
ALU

Vector
ALU

Vector
ALU

Local Data Share (LDS)

TCP (L1D)

1.
2.

Only one instruction type at a time per SIMD


Each instruction from a different WF

SIMD Phase
Wavefront 0
Wavefront 1

PC

PC
IB

32 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Wavefront 9

PC
IB

IB

VECTOR REGISTER FILE


Vector General Purpose Registers (vGPRs)
Partitioned into 4 independent slices, 1 per SIMD
Configurable size
Instruction Fetch
SIMD 0
Phases

SIMD 1
Phases

SIMD 2
Phases

SIMD 3
Phases

Instruction Decode
SIMD 0

SIMD 1

SIMD 2

SIMD 3

Vector
Registers

Vector
Registers

Vector
Registers

Vector
Registers

Vector
ALU

Vector
ALU

Vector
ALU

Vector
ALU

Local Data Share (LDS)

TCP (L1D)

Why? Because each SIMD executes independent WF


32-bit wide
Combine adjacent vGPRs for 64-bit or 128-bit data
Each WF also has a set of 32-bit (SReg), 64-bit (DReg),
and 1-bit predicate (CReg)

Register Allocation Done by a Simple Pool


Manager
Modular design more advance pool managers can
be swapped into VRF seamlessly
Simple timing model with constant delay

vGPRs

33 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

VECTOR ALUs
16-lane vector pipeline per SIMD
Each lane has a set of functional units
One work-item per lane

Instruction Fetch
WF 0-9
Contexts

WF 10-19
Contexts

WF 20-29
Contexts

WF 30-39
Contexts

Instruction Decode
SIMD 0

SIMD 1

SIMD 2

SIMD 3

Vector
Registers

Vector
Registers

Vector
Registers

Vector
Registers

Vector
ALU

Vector
ALU

Vector
ALU

Vector
ALU

4 cycles to execute a WF for all 64 work-items (in


the best case)
In gem5, 64 work-items are executed in one tick and
ticks are multiplied by 4

SIMD execution may take longer if work-items in


WF have dissimilar behaviors
Example 1: Branch (or spatial) divergence
Branches executed through predication
When control flow diverges, all lanes take all paths
4 cycles for one path and another 4 cycles for the other

Example 2: Memory (or temporal) divergence


Local Data Share (LDS)

TCP (L1D)

Longer access latency by one work-item stalls entire WF

Vector ALU in SIMD


Lane 0

34 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Lane 1

Lane 15

GPU CORE TIMING


HANDLING MEMORY INSTRUCTIONS
GPU dynamic memory instruction

Create
packet

Global/LDS
operation

Write back

Memory instructions are handled through memory transactions


Part of GPU core modules (gpu_dyn_inst.[cc|hh])

Memory instruction handled in multiple phases


Appropriate instruction specific execute() methods per instruction per ISA

New machine ISAs can use this capability to support its own memory instructions
Individual stages contribute to the memory instruction timing
Additionally memory end timing handled by ruby and memory technology parameters
global_memory_pipeline.[hh|cc] and local_memory_pipline.[hh.cc]

35 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

VECTOR MEMORY EXECUTION


Address

Tag
Coalesce

Write data

Data

Instruction Fetch

Read data
WF 0-9
Contexts

WF 10-19
Contexts

WF 20-29
Contexts

WF 30-39
Contexts

Instruction Decode
SIMD 0

SIMD 1

SIMD 2

SIMD 3

Vector
Registers

Vector
Registers

Vector
Registers

Vector
Registers

Vector
ALU

Vector
ALU

Vector
ALU

Vector
ALU

Local Data Share (LDS)

TCP (L1D)

Decompression

In gem5:

Address calculation: arch/hsail/inst/mem_impl.hh


Address coalescing mem/ruby/system/GPUCoalescer.[hh|cc]
TCP in mem/protocol/GPU_RfO-TCP.sm, GPU_VIPER-TCP.sm
TCC in mem/protocol/GPU_RfO-TCC.sm, GPU_VIPER-TCC.sm

Local Data Store (LDS) (lds_state.[hh|cc])

User-managed address space


Scratchpad for WFs in workgroup
Used for data sharing and synchronization within workgroup
Cleared when workgroup completes
In gem5, functional model with a pointer per workgroup

4-CU-shared TCC (L2)

36 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

To shared TCC (L2)

TCP

HSAIL ISA SUPPORT


Decouple GPU microarchitecture model from
ISA specification
Currently supports HSAIL

WF-related interfaces
Static instruction objects with no dynamic
information

Others could easily be added

Architecture description files


Instructions are generated similar to gem5 CPU
ISAs
Appropriate class hierarchies (e.g., as in HSAIL)

Interfaces with the GPU core modules

GPU core
model
components

ISA-specific
instruction
classes and
methods

Execute methods are instruction specific


Standard set of APIs via instruction wrapper
(GPUStaticInst class)
Wavefront exposes the APIs to instructions

Instruction and ISA specific classes


Dynamic instruction information and
interfaces
37 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

ISA DESCRIPTION/MICROARCHITECTURE SEPARATION

GPUStaticInst & GPUDynInst (gpu_static_inst.[hh|cc]


gpu_dyn_inst.[hh|cc])
Architecture-specific code src/gpu/arch/
Base instruction classes
Define API for instruction execution
E.g., execute() perform instruction execution
Implemented by ISA-specific instruction classes e.g., HsailGPUStaticInst
(arch/hsail/insts/gpu_static_inst.[hh|cc])

Wavefront related interfaces


Static instruction objects with no dynamic
information

GPUExecContext gpu_exec_context.[hh|cc]
Define API for accessing ISA state

GPU core state [src/gpu]


Shader
Compute Units
Wavefronts

GPU Core/ISA API


definition [src/gpu]

GPU core
model
components

ISA specific
instruction
classes and
methods

GPUStaticInst
GPUDynInst
GPUExecContext
HSACode
HSAIL Static Inst
HSAIL Code
HSAIL Decoder
Operands
ISA State

ISA-specific state [src/gpu/arch]


38 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Instruction and ISA specific classes


Dynamic instruction information and
interfaces

PSEUDO-INSTRUCTION
Magic instructions for GPU kernels: researcher-defined functionality

Explore new hardware mechanisms


Profile and trace
Debug simulated kernels
Uses HSAIL Call instruction
Calls function based on signature
Prefix function with __gem5_hsail_op
Call::execute() checks for __gem5_hsail_op in function name, calls appropriate pseudo op

Examples include
HSAIL instructions not exposed in high-level languages (e.g., cross-lane instructions)
Print statements and panic instruction within the GPU kernel
GDB break points

Source files
[gem5] src/gpu/arch/hsail/insts/decl.hh
[gem5] src/gpu/arch/hsail/insts/pseudo_inst.cc

39 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

40 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

41 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

RUBY BACKGROUND
Flexible Memory System
Rich configuration
Simulate combination of caches, coherence, interconnect, etc

Rapid prototyping
Domain-Specific Language (SLICC) for coherence protocols
Modular components

Detailed statistics
Latency distributions for requests
Generated state transitions, network utilization, etc.

Detailed component simulation


Network (fixed/flexible Garnet pipelines and simple)
Caches (pluggable replacement policies)
Memory (shared memory controllers between Classic and Ruby)

42 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

SYNCHRONIZATION BACKGROUND
Traditional synchronization
Kernel Begin: All stores from CPU and prior kernel completions are visible.
Kernel End: All stores from a kernel are visible to CPU and future kernels.
Barrier: All members of a workgroup are at the same PC and all prior stores in program
order will be visible.

HSA specification includes Load-Acquire and Store-Release


Load-Acquire (LdAcq): A load that occurs before all memory operations later in
program order (like Kernel Begin).
Store-Release (StRel): A store that occurs after all prior memory operations in program
order (like Kernel End or Barrier).

43 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

APU CONTRIBUTIONS TO RUBY/SLICC


4 APU SLICC protocols
1 GPU Read-for-Ownership (RFO) protocol
Used by Hechtman et al. [HPCA 2014] and Orr et al. [ISCA 2014]

3 GPU Write-Through protocols called VIPER


Used by Hechtman et al. [HPCA 2014] and Power et al. [MICRO 2013]

All 4 protocols use the same CPU cache controllers


Core-Pair design with write-through L1 caches (separate L1 D, shared L1 I & L2)

Also includes 1 CPU-only protocol: MOESI_AMD_Base


Support single coherent address space and copies between emulated address spaces

Request coalescing
Hierarchical network topology configuration

44 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU RFO COHERENCE


READ-FOR-OWNERSHIP COHERENCE, SEQUENTIAL CONSISTENCY

Maintains the single writer / multiple reader


invariant per cache block
Typical for existing CPU protocols
Non-typical for GPUs, but provides good comparison

Wavefront coalescing of writes at the CU


LdAcq, StRel are simply Ld and St operations
Invalidations, dirty-writeback and data responses

Tested using the ruby random tester


Uses an inclusive directory at the TCC

45 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU RFO PROTOCOL


MOESI_AMD_Base-CorePair.sm

CPU
L1I (SQC)
GPU
Core

GPU
Core

GPU
Core

GPU
Core

L1D

L1D

L1D

L1D

L2

Directory

GPU
GPU_RfO-SQC.sm

CPU I-Cache
CPU0

CPU1

L1D

L1D

GPU_RfO-TCP.sm
GPU_RfO-TCC.sm
GPU_RfO-TCCdir.sm
L2 directory

Memory
Controller

MOESI_AMD_Base-dir.sm
46 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

L2

Memory

GPU VIPER COHERENCE PROTOCOLS


WRITE-THROUGH COHERENCE, RELEASE CONSISTENCY

GPU write-through protocols


Writes performed immediately
No stalling for exclusive permissions
Maintains per-byte dirty masks

LdAcq -> invalidate entire L1 cache


StRel -> ensure prior write-throughs complete
Minimal data verification testing
A great opportunity for research

Uses the same CPU core-pair cache controller as


GPU_RfO
3 variants with different directory implementations
GPU_VIPER: Stateless Directory
GPU_VIPER_Baseline: Probe Filter Directory
GPU_VIPER_Region: Region Directory
47 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

GPU VIPER PROTOCOL


MOESI_AMD_Base-CorePair.sm

CPU
L1I (SQC)

GPU
GPU_VIPER-SQC.sm

GPU
Core

GPU
Core

GPU
Core

GPU
Core

GPU_VIPER-TCP.sm

L1D

L1D

L1D

L1D

GPU_VIPER-TCC.sm

L1I
CPU0

CPU1

L1D

L1D

L2

Stateless
Directory

L2

Memory
Controller

MOESI_AMD_Base-dir.sm
48 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Memory

GPU VIPER BASELINE PROTOCOL


MOESI_AMD_Base-CorePair.sm

CPU
L1I (SQC)

GPU
GPU_VIPER-SQC.sm

GPU
Core

GPU
Core

GPU
Core

GPU
Core

GPU_VIPER-TCP.sm

L1D

L1D

L1D

L1D

GPU_VIPER-TCC.sm

L1I
CPU0

CPU1

L1D

L1D

L2

Probe
Filter

L2

Memory
Controller

MOESI_AMD_Base-probeFilter.sm
49 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Memory

GPU VIPER REGION PROTOCOL


MOESI_AMD_Base-Region-CorePair.sm
L1I (SQC)
GPU
Core

GPU
Core

GPU
Core

GPU
Core

L1D

L1D

L1D

L1D

GPU
GPU_VIPER-SQC.sm

L1I
CPU0

CPU1

L1D

L1D

GPU_VIPER-TCP.sm

GPU_VIPER-Region-TCC.sm
L2
RegionBuffer

MOESI_
AMD_Ba
seRegion
Dir.sm

RegionDir

Directory

L2

RegionBuffer
MOESI_AMD_Base-RegionBuffer.sm
Memory
Controller

MOESI_AMD_Base-Region-dir.sm
50 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

Memory

RUBY / GPU CORE INTERFACE


GPU MEMORY COALESCING CRITICAL FOR PERFORMANCE

GPU
Core

64 M5 Ports
one pkt per work-item
request (byte address)

L1D

Ruby Port
GPU Coalescer
Processed and buffered using
higher priority events at the
beginning of the cycle

Coalesced using lower priority


events at the end of the cycle

1 Mandatory Queue
one RubyRequest per cache
block (block-aligned address)

51 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

APU SYSTEM: A HIERARCHY OF CLUSTERS


DEFAULT CONFIGURATION

GPU
Cluster

GPU I-Cache
GPU
Core

GPU
Core

GPU
Core

GPU
Core

L1D

L1D

L1D

L1D

CPU
Cluster

CPU I-Cache
CPU0

CPU1

L1D

L1D

L2

Main
Cluster

Directory

L2

Memory
Controller

Memory

Mapping to gem5 directories:


src/mem/ruby and
src/mem/protocol

src/gpu-compute

52 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

src/cpu

TILED APU SYSTEM

ANOTHER GPU SUBSYSTEM CONFIGURATION


GPU
CORE
L1
GPU
Core

L1

L2

NOC
Router

Requires minor modifications to the configuration files (src/config/ruby/GPU*.py)


53 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

54 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

55 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

COMPARISON TO OTHER GPU SIMULATORS


GPGPU-Sim
Primarily focused on running Nvidia PTX instructions and CUDA applications
Functional CPU model, oriented to model discrete GPU systems
Wisconsins gem5-gpu added gem5 timing CPU models
And a Ruby memory system protocol

Differences from gem5-gpu:


HSAIL instructions and OpenCL apps
Multiple Ruby protocols
Unified under the gem5 source control repo

Multi2Sim
Supports multiple ISAs including AMD Southern Islands Machine ISA
Limited instruction support
No transient states in coherence protocol

This is very different than the gem5 NoMALI emulated gpu device
56 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

LIMITATIONS
No IOMMU
Primitive TLB model

Not full-system
The driver is not supporting different HSA memory segments
There is no support for flat addressing in the emulated cl-runtime

Direct execution of HSAIL


Does not execute finalized machine ISA kernels
Lacks detail such as the scalar core/registers, etc.

Use of backing store for memory data


When running applications, the data from the caches is not used

Public model has not been correlated against real hardware

No graphics functionality compute only

57 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OBVIOUS IMPROVEMENTS
Detailed performance correlation
Validation of coherence protocols / memory models
4 new coherence protocols

More programming models


Beyond OpenCL

Other GPU ISAs

58 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

APU SIMULATOR CODE ORGANIZATION


cl-runtime separate tarball available:
gem5.org/gpu_models
gem5 Top level directory
--- src
------ gpu-compute GPU core model
------ mem/protocol APU memory model
------ mem/ruby APU memory model
--- configs
------ example apu_se.py sample script
------ ruby APU protocol configs

gpucompute

59 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

gem5

src

configs

mem/

mem/

protocol

ruby

cl-runtime

SUMMARY
Covered a very high-level overview of:
Introduction to the gem5 APU simulator
Mapping between APU system and gem5 APU simulator

Topics discussed
HSA and GPU Background
Compilation and Simulation Flow
GPU Core modules
GPU memory system models in Ruby
Comparisons/Limitations/Improvements
Code organization

Much more detail in the gem5 source code


Please contribute back to this community tool!
60 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

OUTLINE
Topic

Presenter

Time

Background

Brad

8:45 9:05

Compilation and Simulation flow

Tony

9:05 9:30

GPU Core Model

Tony

9:30 10:00

Break

10:00 10:30

Ruby Memory Contributions

Brad

10:30 11:00

Demo

Tony

11:00 11:20

Comparisons/Limitations/Future Work

Brad

11:20 11:45

Questions

Both

11:45 12:00

61 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,
ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). OpenCL is a
trademark of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.

62 | THE AMD gem5 APU SIMULATOR | DECEMBER 6, 2015 | MICRO 2015 TUTORIAL

You might also like