0% found this document useful (0 votes)

142 views

Native Shader Compilation With LLVM PDF

This document discusses using LLVM to compile shaders natively rather than interpreting them. It begins by explaining the benefits of a SIMD interpreter but also its drawbacks like low instruction parallelism. Native compilation avoids interpretation overhead but requires handling issues like derivatives across points. The document explores vectorizing shaders to operate on batches of points, and discusses generating code for CPUs and GPUs. It introduces LLVM as a compiler infrastructure that can be used to generate optimized code from an intermediate representation.

Uploaded by

thi minh phuong nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views

Native Shader Compilation With LLVM PDF

Uploaded by

thi minh phuong nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Native shader compilation

with LLVM
Mark Leone

Why compile shaders?

RenderMans SIMD interpreter is hard to beat.

Amortizes interpretive overhead over batches of points.

Shading is dominated by floating point calculations.

SIMD interpreter
For each instruction in shader:
Decode and dispatch instruction.
For each point in batch:
If runflag is on:
Load operands.
Compute.
Store result.

SIMD interpreter: example inner loop

void add(int numPoints, bool* runflags,
float* dest, float* src1, float* src2)
{
for (int i = 0; i < numPoints; ++i)
{
if (runflags[i])
dest[i] = src1[i] + src2[i];
}
}

SIMD interpreter: benefits

Interpretive overhead is amortized (if batch is large).
Uniform operations can be executed once per batch.
Derivatives are easy: neighboring values are always ready.

SIMD interpreter: drawbacks

Low compute density, poor instruction-level parallelism.

SIMD interpreter: example inner loop

void add(int numPoints, bool* runflags,
float* dest, float* src1, float* src2)
{
for (int i = 0; i < numPoints; ++i)
{
if (runflags[i])
dest[i] = src1[i] + src2[i];
}
}

SIMD interpreter: drawbacks

Low compute density, poor instruction-level parallelism
Load, compute, store, repeat.
Poor locality, high memory traffic
Intermediate results are stored in memory, not registers.
High overhead for small batches
Difficult to vectorize (pointers and conditionals).

Compiled shader execution

For each point in batch:
Load inputs.
For each instruction in shader:
Compute.
Store outputs.

Benefits of native compilation

Eliminates interpretive overhead. Good for small batches.
Good locality and register utilization.
Intermediate results are stored in registers, not memory.
Good instruction-level parallelism.
Instruction scheduling avoids pipeline stalls.
Vectorizes easily.

Issues: batch shading

Use vectorized shaders on small batches.
Uniform operations: once per grid, not once per point.
Some are very expensive (e.g. plugin calls).
Derivatives: need "previously" computed values from
neighboring points.

RSL permits derivatives of arbitrary expressions.

Why vectorize?
Consider batch execution of a compiled shader:

For each point in batch:

Load inputs.
For each instruction in shader:
Compute.
Store outputs.

Why vectorize?
Consider batch execution of a vectorized shader:

For each block of 4 or 8 points in batch:

Load inputs.
For each instruction in shader:
Compute on vector registers (with mask)
Store outputs.

Simple vector code generation

Consider using SSE instructions only for vectors and matrices:
float dot(vector v1, vector v2)
{
vector v0 = v1 * v2;
return v0.x + v0.y + v0.z;
}

load4
load4
mult4
move
add
add

r1,
r2,
r3,
r0,
r0,
r0,

[v1]
[v2]
r1, r2
r3.x
r3.y
r3.z

Vector
utilization

Shader vectorization
To vectorize, first scalarize:
float dot(vector v1, vector v2)
{
vector v0 = v1 * v2;
return v0.x + v0.y + v0.z;
}

float dot(vector v1, vector v2)

{
float x = v1.x * v2.x;
float y = v1.y * v2.y;
float z = v1.z * v2.z;
return x + y + z;
}

Scalar code generation

Next, generate ordinary scalar code:
float dot(vector v1, vector v2)
{
float x = v1.x * v2.x;
float y = v1.y * v2.y;
float z = v1.z * v2.z;
return x + y + z;
}

load
load
mult
load
load
mult
load
load
mult
add
add

r1,
r2,
r0,
r1,
r2,
r3,
r1,
r2,
r3,
r0,
r0,

[v1.x]
[v2.x]
r1, r2
[v1.y]
[v2.y]
r1, r2
[v1.z]
[v2.z]
r1, r2
r0, r3
r0, r3

Vectorize for batch of four

Finally, widen each instruction for a batch size of four:
float dot(vector v1, vector v2)
{
float x = v1.x * v2.x;
float y = v1.y * v2.y;
float z = v1.z * v2.z;
return x + y + z;
}

load4
load4
mult4
load4
load4
mult4
load4
load4
mult4
add4
add4

r1,
r2,
r0,
r1,
r2,
r3,
r1,
r2,
r3,
r0,
r0,

[v1.x]
[v2.x]
r1, r2
[v1.y]
[v2.y]
r1, r2
[v1.z]
[v2.z]
r1, r2
r0, r3
r0, r3
Vector
utilization

Struct of arrays (SOA)

Normally a batch of vectors is an array of structs (AOS):
x y z x y z x y z x y z . . .

Vector load instructions (in SSE) require contiguous data.

Store batch of vectors as a struct of arrays (SOA):
x x x x . . . y y y y . . . z z z z . . .

Masking / blending
Use a mask to avoid clobbering components of registers used
by the other branch.

No masking in SSE.

Use variable blend in SSE4:

blend(a, b, mask)
{
return (a & mask) | ~(b & mask)
}

No need to blend each instruction

Blend at basic block boundaries (at phi nodes in SSA).

Vectorization: recent work

ispc: Intel SPMD program compiler (Matt Pharr)
Beyond Programmable Shading course, SIGGRAPH 2011
Open source: ispc.github.com
Whole function vectorization in AnySL (Karrenberg et al.)
Code Generation and Optimization 2011

Film shading on GPUs

Previous work
LightSpeed (Ragan-Kelly et al. SIGGRAPH 2007)
RenderAnts (Zhou et al. SIGGRAPH Asia 2009)
Code generation is easier now (thanks CUDA, OpenCL).
PTX
AMD IL
LLVM and Clang

GPU code generation with LLVM

NVIDIAs LLVM to PTX code generator (Grover)
Not to be confused with PTX to LLVM front end (PLANG)
Incomplete PTX support in llvm-trunk (Chiou)
Google summer of code project (Holewinski)
Experimental PTX back end for AnySL (Rhodin)
LLVM to AMD IL (Villmow)

Issues: GPU code generation

Film shaders interoperate with the renderer.
File I/O: textures, pointclouds, etc. (out of core).
Shader plugins (DSOs).
Sampling, ray tracing.
Answer: multi-pass partitioning (Riffel et al. GH 2004)

Partitioning

normalize

faceforward

Nf = faceforward(
normalize(N), I);
Ci = Os * Cs *
( Ka*ambient() +
Kd*diffuse(Nf) );

ambient

diffuse

scale

mult

add

Cs
Os

scale

Multi-pass partitioning for CPU

Synchronize for GPU calls, uniform operations, derivatives.
Does not require hardware threads or locks.
A thread yields by returning (to a scheduler).
Intermediate data is stored in a cactus stack (Cilk)
or continuation closures (CPS).

Data management and scheduling is a key problem

(Budge et al. Eurographics 2009)

Issues: summary
CPU code generation (perhaps JIT)
Vectorization
GPU code generation
Multi-pass partitioning

Introduction to LLVM
Mid-level intermediate representation (IR)
High-level types: structs, arrays, vectors, functions.
Control-flow graph: basic blocks with branches
Many modular analysis and optimization passes.
Code generation for x86, x64, ARM, ...
Just-in-time (JIT) compiler too.

Example: from C to LLVM IR

float sqrt(float f) {
return (f > 0.0f) ? sqrtf(f) : 0.0f;
}
define float @sqrt(float %f)
{
entry: %0 = fcmp ogt float %f, 0.0
br i1 %0, label %bb1, label %bb2
...
}

Example: from C to LLVM IR

float sqrt(float f) {
return (f > 0.0f) ? sqrtf(f) : 0.0f;
}
define float @sqrt(float %f)
{
entry: %0 = fcmp ogt float %f, 0.0
br i1 %0, label %bb1, label %bb2
bb1:
%1 = call float @fabsf(float %f)
ret float %1
bb2:
ret float 0.0
}

Example: from C to LLVM IR

void foo(int x, int y)
{
int z = x + y;
...
}
define void @foo(i32 %x, i32 %y)
{
%z = alloca i32
%1 = add i32 %y, %x
store i32 %1, i32* %z
...
}

Writing a simple code generator

class BinaryExp : public Exp
{
char m_operator;
Exp* m_left;
Exp* m_right;

};

virtual llvm::Value*
Codegen(llvm::IRBuilder* builder);

Writing a simple code generator

llvm::Value*
BinaryExp::Codegen(llvm::IRBuilder* builder)
{
llvm::Value* L = m_left->Codegen(builder);
llvm::Value* R = m_right->Codegen(builder);

switch
case
case
case
...
}

(m_operation) {
'+': return builder->CreateFAdd(L, R);
'-': return builder->CreateFSub(L, R);
'*': return builder->CreateFMul(L, R);

Advantages of LLVM
Well designed intermediate representation (IR).
Wide range of optimizations (configurable).
JIT code generation.
Interoperability.

Interoperability
Shaders can call out to renderer via C ABI.
We can inline library code into compiled shaders.
Compile C++ to LLVM IR with Clang.
This greatly simplifies code generation.

Weaknesses of LLVM
No automatic vectorization.
Poor support for vector-oriented code generation.
No predication.
Few vector instructions, must resort to SSE/AVX intrinsics.

LLVM resources
www.llvm.org/docs
Language Reference Manual
Getting Started Guide
LLVM Tutorial (section 3)
Relevant open source projects
ispc.github.com
github.com/MarkLeone/PostHaste

Questions?
Mark Leone
mleone@wetafx.co.nz

Disaster MGMT - theIAShub - Part 1 PDF
No ratings yet
Disaster MGMT - theIAShub - Part 1 PDF
35 pages
Pacific Motoryacht Magazine - Leopard 47 Powercat Review
No ratings yet
Pacific Motoryacht Magazine - Leopard 47 Powercat Review
8 pages
LLVM Clang - Advancing Compiler Technology
No ratings yet
LLVM Clang - Advancing Compiler Technology
28 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
TechTalk Kruppe Espasa RISC V Vectors and LLVM
No ratings yet
TechTalk Kruppe Espasa RISC V Vectors and LLVM
23 pages
Huber a CPlusPlus Toolchain for Your GPU
No ratings yet
Huber a CPlusPlus Toolchain for Your GPU
24 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Unit 2 ppt
No ratings yet
Unit 2 ppt
43 pages
GPU Compute
100% (1)
GPU Compute
58 pages
Slusallek SIG2011course AnySL
No ratings yet
Slusallek SIG2011course AnySL
36 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
07_gpuarch
No ratings yet
07_gpuarch
73 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Part2 22
No ratings yet
Part2 22
97 pages
LLVM Homework
100% (1)
LLVM Homework
7 pages
LLVM Tutorial
100% (1)
LLVM Tutorial
59 pages
Cuda Basics
No ratings yet
Cuda Basics
44 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
No ratings yet
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
31 pages
Cublas Library
No ratings yet
Cublas Library
146 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Vector
No ratings yet
Vector
38 pages
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
No ratings yet
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
25 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
LLVM
No ratings yet
LLVM
1,703 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
SIMD Assembly Tutorial Neon (Mozilla)
No ratings yet
SIMD Assembly Tutorial Neon (Mozilla)
60 pages
Gpu-Arc
No ratings yet
Gpu-Arc
37 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Scalable Vectorizationin LLVMIR
No ratings yet
Scalable Vectorizationin LLVMIR
74 pages
Cublas Library: User Guide
No ratings yet
Cublas Library: User Guide
248 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
GPUs DP Accelerators MSC
No ratings yet
GPUs DP Accelerators MSC
191 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Implementing Low Level GPU Hans Kristian Munich 2019
No ratings yet
Implementing Low Level GPU Hans Kristian Munich 2019
44 pages
Lec 1
No ratings yet
Lec 1
27 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
15 20-15 55-18 05 06 VEXT-bcn-v1
No ratings yet
15 20-15 55-18 05 06 VEXT-bcn-v1
76 pages
Dense Matrix Algebra On The GPU
No ratings yet
Dense Matrix Algebra On The GPU
22 pages
VAPL Facade Pattern
No ratings yet
VAPL Facade Pattern
5 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
3 Tobias Grosser 2017 Day2
No ratings yet
3 Tobias Grosser 2017 Day2
122 pages
Building An LLVM Backend
No ratings yet
Building An LLVM Backend
65 pages
C Programming
From Everand
C Programming
Netra
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Compilation For GPU Accelerated Ray Tracing in OptiX PDF
No ratings yet
Compilation For GPU Accelerated Ray Tracing in OptiX PDF
70 pages
Specular Showdown in The Wild West - Self Shadow
No ratings yet
Specular Showdown in The Wild West - Self Shadow
9 pages
HSDS Book 4
No ratings yet
HSDS Book 4
346 pages
Jin Yong - Eagle Shooting Hero (Book 2) PDF
100% (1)
Jin Yong - Eagle Shooting Hero (Book 2) PDF
320 pages
HSDS Book 2
100% (1)
HSDS Book 2
313 pages
Jin Yong - ROCH (Book 2) PDF
100% (1)
Jin Yong - ROCH (Book 2) PDF
322 pages
ROCH Book 3
No ratings yet
ROCH Book 3
314 pages
(Ebook) 180 Days of Reading for Fourth Grade by Margot Kinberg ISBN 9781425809256, 1425809251 instant download
100% (1)
(Ebook) 180 Days of Reading for Fourth Grade by Margot Kinberg ISBN 9781425809256, 1425809251 instant download
59 pages
Maintaining IT Equipment and Consumables
No ratings yet
Maintaining IT Equipment and Consumables
154 pages
Four Week Get Fit Challenge From
No ratings yet
Four Week Get Fit Challenge From
6 pages
Day 17 NAT and PAT
No ratings yet
Day 17 NAT and PAT
18 pages
MIE 258 Midterm Fall 2007 With Solutions
No ratings yet
MIE 258 Midterm Fall 2007 With Solutions
8 pages
ACS 42 of IRBM
No ratings yet
ACS 42 of IRBM
5 pages
Scientific Application of SCS in Manufacturing Industries
No ratings yet
Scientific Application of SCS in Manufacturing Industries
7 pages
cointool_sol_2025_04_01_00_43_05
No ratings yet
cointool_sol_2025_04_01_00_43_05
12 pages
CA FND SFB MAY 2025 ECO QP
No ratings yet
CA FND SFB MAY 2025 ECO QP
5 pages
Computer Graphics Viva Lab
No ratings yet
Computer Graphics Viva Lab
4 pages
Mba Syllabus Telangana University
No ratings yet
Mba Syllabus Telangana University
33 pages
Gopi Gita
No ratings yet
Gopi Gita
2 pages
Republic of The Philippines Department of Educaion (Deped) Division of Leyte Annual Instructional Supervisory Plan S.Y. 2017-2018
No ratings yet
Republic of The Philippines Department of Educaion (Deped) Division of Leyte Annual Instructional Supervisory Plan S.Y. 2017-2018
2 pages
Frictape 7003-104 Rede de Pouso Heliporto Helideck Landing Net 15x15m Ficha Tecnica Manual Catalogo Datasheet
No ratings yet
Frictape 7003-104 Rede de Pouso Heliporto Helideck Landing Net 15x15m Ficha Tecnica Manual Catalogo Datasheet
3 pages
Atomic Structure and Amount of Substance Q
No ratings yet
Atomic Structure and Amount of Substance Q
30 pages
2 - Stat-701 Correlation
No ratings yet
2 - Stat-701 Correlation
16 pages
HES - Resume and Letter - 2022 Final
No ratings yet
HES - Resume and Letter - 2022 Final
2 pages
Definitions of Quality of Work Life
No ratings yet
Definitions of Quality of Work Life
5 pages
Atomic Structure MCQS
No ratings yet
Atomic Structure MCQS
3 pages
Alchemy (Updated)
No ratings yet
Alchemy (Updated)
22 pages
Diamond Pressure Washer Parts Manual
No ratings yet
Diamond Pressure Washer Parts Manual
9 pages
Style
No ratings yet
Style
48 pages
Caregiver Cover Letter
100% (1)
Caregiver Cover Letter
4 pages
Melbourne Inner Campus Map
No ratings yet
Melbourne Inner Campus Map
1 page
Se14 HCT Ec 2022en
No ratings yet
Se14 HCT Ec 2022en
9 pages
Study Guide 1
No ratings yet
Study Guide 1
3 pages
Vmware Product Guide
No ratings yet
Vmware Product Guide
103 pages