0% found this document useful (0 votes)
142 views

Native Shader Compilation With LLVM PDF

This document discusses using LLVM to compile shaders natively rather than interpreting them. It begins by explaining the benefits of a SIMD interpreter but also its drawbacks like low instruction parallelism. Native compilation avoids interpretation overhead but requires handling issues like derivatives across points. The document explores vectorizing shaders to operate on batches of points, and discusses generating code for CPUs and GPUs. It introduces LLVM as a compiler infrastructure that can be used to generate optimized code from an intermediate representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views

Native Shader Compilation With LLVM PDF

This document discusses using LLVM to compile shaders natively rather than interpreting them. It begins by explaining the benefits of a SIMD interpreter but also its drawbacks like low instruction parallelism. Native compilation avoids interpretation overhead but requires handling issues like derivatives across points. The document explores vectorizing shaders to operate on batches of points, and discusses generating code for CPUs and GPUs. It introduces LLVM as a compiler infrastructure that can be used to generate optimized code from an intermediate representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Native shader compilation

with LLVM
Mark Leone

Why compile shaders?


RenderMans SIMD interpreter is hard to beat.

Amortizes interpretive overhead over batches of points.


Shading is dominated by floating point calculations.

SIMD interpreter
For each instruction in shader:
Decode and dispatch instruction.
For each point in batch:
If runflag is on:
Load operands.
Compute.
Store result.

SIMD interpreter: example inner loop


void add(int numPoints, bool* runflags,
float* dest, float* src1, float* src2)
{
for (int i = 0; i < numPoints; ++i)
{
if (runflags[i])
dest[i] = src1[i] + src2[i];
}
}

SIMD interpreter: benefits


Interpretive overhead is amortized (if batch is large).
Uniform operations can be executed once per batch.
Derivatives are easy: neighboring values are always ready.

SIMD interpreter: drawbacks


Low compute density, poor instruction-level parallelism.

SIMD interpreter: example inner loop


void add(int numPoints, bool* runflags,
float* dest, float* src1, float* src2)
{
for (int i = 0; i < numPoints; ++i)
{
if (runflags[i])
dest[i] = src1[i] + src2[i];
}
}

SIMD interpreter: drawbacks


Low compute density, poor instruction-level parallelism
Load, compute, store, repeat.
Poor locality, high memory traffic
Intermediate results are stored in memory, not registers.
High overhead for small batches
Difficult to vectorize (pointers and conditionals).

Compiled shader execution


For each point in batch:
Load inputs.
For each instruction in shader:
Compute.
Store outputs.

Benefits of native compilation


Eliminates interpretive overhead. Good for small batches.
Good locality and register utilization.
Intermediate results are stored in registers, not memory.
Good instruction-level parallelism.
Instruction scheduling avoids pipeline stalls.
Vectorizes easily.

Issues: batch shading


Use vectorized shaders on small batches.
Uniform operations: once per grid, not once per point.
Some are very expensive (e.g. plugin calls).
Derivatives: need "previously" computed values from
neighboring points.

RSL permits derivatives of arbitrary expressions.

Why vectorize?
Consider batch execution of a compiled shader:

For each point in batch:


Load inputs.
For each instruction in shader:
Compute.
Store outputs.

Why vectorize?
Consider batch execution of a vectorized shader:

For each block of 4 or 8 points in batch:


Load inputs.
For each instruction in shader:
Compute on vector registers (with mask)
Store outputs.

Simple vector code generation


Consider using SSE instructions only for vectors and matrices:
float dot(vector v1, vector v2)
{
vector v0 = v1 * v2;
return v0.x + v0.y + v0.z;
}

load4
load4
mult4
move
add
add

r1,
r2,
r3,
r0,
r0,
r0,

[v1]
[v2]
r1, r2
r3.x
r3.y
r3.z

Vector
utilization

Shader vectorization
To vectorize, first scalarize:
float dot(vector v1, vector v2)
{
vector v0 = v1 * v2;
return v0.x + v0.y + v0.z;
}

float dot(vector v1, vector v2)


{
float x = v1.x * v2.x;
float y = v1.y * v2.y;
float z = v1.z * v2.z;
return x + y + z;
}

Scalar code generation


Next, generate ordinary scalar code:
float dot(vector v1, vector v2)
{
float x = v1.x * v2.x;
float y = v1.y * v2.y;
float z = v1.z * v2.z;
return x + y + z;
}

load
load
mult
load
load
mult
load
load
mult
add
add

r1,
r2,
r0,
r1,
r2,
r3,
r1,
r2,
r3,
r0,
r0,

[v1.x]
[v2.x]
r1, r2
[v1.y]
[v2.y]
r1, r2
[v1.z]
[v2.z]
r1, r2
r0, r3
r0, r3

Vectorize for batch of four


Finally, widen each instruction for a batch size of four:
float dot(vector v1, vector v2)
{
float x = v1.x * v2.x;
float y = v1.y * v2.y;
float z = v1.z * v2.z;
return x + y + z;
}

load4
load4
mult4
load4
load4
mult4
load4
load4
mult4
add4
add4

r1,
r2,
r0,
r1,
r2,
r3,
r1,
r2,
r3,
r0,
r0,

[v1.x]
[v2.x]
r1, r2
[v1.y]
[v2.y]
r1, r2
[v1.z]
[v2.z]
r1, r2
r0, r3
r0, r3
Vector
utilization

Struct of arrays (SOA)


Normally a batch of vectors is an array of structs (AOS):
x y z x y z x y z x y z . . .

Vector load instructions (in SSE) require contiguous data.


Store batch of vectors as a struct of arrays (SOA):
x x x x . . . y y y y . . . z z z z . . .

Masking / blending
Use a mask to avoid clobbering components of registers used
by the other branch.

No masking in SSE.

Use variable blend in SSE4:

blend(a, b, mask)
{
return (a & mask) | ~(b & mask)
}

No need to blend each instruction


Blend at basic block boundaries (at phi nodes in SSA).

Vectorization: recent work


ispc: Intel SPMD program compiler (Matt Pharr)
Beyond Programmable Shading course, SIGGRAPH 2011
Open source: ispc.github.com
Whole function vectorization in AnySL (Karrenberg et al.)
Code Generation and Optimization 2011

Film shading on GPUs


Previous work
LightSpeed (Ragan-Kelly et al. SIGGRAPH 2007)
RenderAnts (Zhou et al. SIGGRAPH Asia 2009)
Code generation is easier now (thanks CUDA, OpenCL).
PTX
AMD IL
LLVM and Clang

GPU code generation with LLVM


NVIDIAs LLVM to PTX code generator (Grover)
Not to be confused with PTX to LLVM front end (PLANG)
Incomplete PTX support in llvm-trunk (Chiou)
Google summer of code project (Holewinski)
Experimental PTX back end for AnySL (Rhodin)
LLVM to AMD IL (Villmow)

Issues: GPU code generation


Film shaders interoperate with the renderer.
File I/O: textures, pointclouds, etc. (out of core).
Shader plugins (DSOs).
Sampling, ray tracing.
Answer: multi-pass partitioning (Riffel et al. GH 2004)

Partitioning

normalize

faceforward

Nf = faceforward(
normalize(N), I);
Ci = Os * Cs *
( Ka*ambient() +
Kd*diffuse(Nf) );

ambient

Ka

diffuse

scale

scale

mult

add

Cs
Os

scale

Ci

Kd

Multi-pass partitioning for CPU


Synchronize for GPU calls, uniform operations, derivatives.
Does not require hardware threads or locks.
A thread yields by returning (to a scheduler).
Intermediate data is stored in a cactus stack (Cilk)
or continuation closures (CPS).

Data management and scheduling is a key problem


(Budge et al. Eurographics 2009)

Issues: summary
CPU code generation (perhaps JIT)
Vectorization
GPU code generation
Multi-pass partitioning

Introduction to LLVM
Mid-level intermediate representation (IR)
High-level types: structs, arrays, vectors, functions.
Control-flow graph: basic blocks with branches
Many modular analysis and optimization passes.
Code generation for x86, x64, ARM, ...
Just-in-time (JIT) compiler too.

Example: from C to LLVM IR


float sqrt(float f) {
return (f > 0.0f) ? sqrtf(f) : 0.0f;
}
define float @sqrt(float %f)
{
entry: %0 = fcmp ogt float %f, 0.0
br i1 %0, label %bb1, label %bb2
...
}

Example: from C to LLVM IR


float sqrt(float f) {
return (f > 0.0f) ? sqrtf(f) : 0.0f;
}
define float @sqrt(float %f)
{
entry: %0 = fcmp ogt float %f, 0.0
br i1 %0, label %bb1, label %bb2
bb1:
%1 = call float @fabsf(float %f)
ret float %1
bb2:
ret float 0.0
}

Example: from C to LLVM IR


void foo(int x, int y)
{
int z = x + y;
...
}
define void @foo(i32 %x, i32 %y)
{
%z = alloca i32
%1 = add i32 %y, %x
store i32 %1, i32* %z
...
}

Writing a simple code generator


class BinaryExp : public Exp
{
char m_operator;
Exp* m_left;
Exp* m_right;

};

virtual llvm::Value*
Codegen(llvm::IRBuilder* builder);

Writing a simple code generator


llvm::Value*
BinaryExp::Codegen(llvm::IRBuilder* builder)
{
llvm::Value* L = m_left->Codegen(builder);
llvm::Value* R = m_right->Codegen(builder);

switch
case
case
case
...
}

(m_operation) {
'+': return builder->CreateFAdd(L, R);
'-': return builder->CreateFSub(L, R);
'*': return builder->CreateFMul(L, R);

Advantages of LLVM
Well designed intermediate representation (IR).
Wide range of optimizations (configurable).
JIT code generation.
Interoperability.

Interoperability
Shaders can call out to renderer via C ABI.
We can inline library code into compiled shaders.
Compile C++ to LLVM IR with Clang.
This greatly simplifies code generation.

Weaknesses of LLVM
No automatic vectorization.
Poor support for vector-oriented code generation.
No predication.
Few vector instructions, must resort to SSE/AVX intrinsics.

LLVM resources
www.llvm.org/docs
Language Reference Manual
Getting Started Guide
LLVM Tutorial (section 3)
Relevant open source projects
ispc.github.com
github.com/MarkLeone/PostHaste

Questions?
Mark Leone
mleone@wetafx.co.nz

You might also like