NVIDIA CUDA Computational Finance Geeks3D

Computational Finance in CUDA
Options Pricing with Black-Scholes and Monte Carlo

Overview
CUDA is ideal for finance computations

Massive data parallelism in finance
Highly independent computations
High computational intensity (ratio of compute to I/O)
All of this results in high scalability
We’ll cover European Options pricing in CUDA with

two methods
Black-Scholes
Monte Carlo simulation
© NVIDIA Corporation 2008 2

Black-Scholes pricing for European
options using CUDA
Overview
This presentation will show how to implement an option

pricer for European put and call options using CUDA:
•Generate random input data on host

•Transfer data to GPU
•Compute prices on GPU
•Transfer prices back to host
•Compute prices on CPU
•Check the results
Simple problem to map, each option price is computed

independently.

European options
V call = S ⋅ CND ( d 1 ) − X ⋅ e − r T ⋅ CND ( d 2 )
V put = X ⋅ e − rT ⋅ CND ( − d 2 ) − S ⋅ CND ( − d 1 )

S v2
log( )+ ( r + )T
d1 = X 2
v T
S v2
log( )+ ( r − )T
d2 = X 2
v T
CND ( − d ) = 1 − CND ( d )
S is current stock price, X is the strike price, CND is the Cumulative Normal
Distribution function, r is the risk-free interest rate, ν is the volatility

Cumulative Normal Distribution Function
u2
1 x −
N ( x) =
2π ∫−∞
e 2
du
Computed with a polynomial approximation (see Hull):

six-decimal place accuracy with a 5th degree polynomial
__host__ __device__ float CND(float d)

{
float K = 1.0f / (1.0f + 0.2316419f * fabsf(d));
float CND = RSQRT2PI * expf(- 0.5f * d * d) *
(K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
if(d > 0)
CND = 1.0f - CND;
return CND;
}
•Make your code float safe.

•Compiler will generate 2 functions, one for the host , one for the device
Implementation steps
The following steps need to be performed:
1. Allocate arrays on host: hOptPrice(N), hOptStrike (N) ,…

2. Allocate arrays on device: dOptPrice(N),dOptStrike(N),…
3. Initialize input arrays
4. Transfer arrays from host memory to the corresponding arrays
in device memory
5. Compute option prices on GPU with fixed configuration
6. Transfer results from the GPU back to the host
7. Compute option prices on GPU
8. Compare results
9. Clean-up memory

Code walk-through (steps 1-3)
/* Allocate arrays on the host */
float *hOptPrice, *hOptStrike, *hOptYear;
hOptPrice = (float *) malloc(sizeof(float*N);
hOptStrike = (float *) malloc(sizeof(float*N);
hOptYear = (float *) malloc(sizeof(float*N);
/* Allocate arrays on the GPU with cudaMalloc */

float *dOptPrice, *dOptStrike, *dOptYear;
cudaMalloc( (void **) &dOptPrice, sizeof(float)*N);
cudaMalloc( (void **) &dOptStrike, sizeof(float)*N);
cudaMalloc( (void **) &dOptYear , sizeof(float)*N);
………………
/* Initialize hOptPrice, hOptStrike, hOptYear on the host */

……………

Code walk-through (steps 4-5)
/*Transfer data from host to device with

cudaMemcpy(target, source, size, direction)*/
cudaMemcpy (dOptPrice, hOptPrice, sizeof(float)*N ,
cudaMemcpyHostToDevice);
cudaMemcpy (dOptStrike, hOptStrike, sizeof(float)*N ,
cudaMemcpy (dOptYears, hOptYears, sizeof(float)*N,
/* Compute option prices on GPU with fixed configuration

<<<Nblocks, Nthreads>>>*/
BlackScholesGPU<<<128, 256>>>(
dCallResult, dPutResult,
dOptionStrike, dOptionPrice, dOptionYears,
RISKFREE, VOLATILITY, OPT_N);

Code walk-through (step 6-9)
/*Transfer data from device to host with
cudaMemcpy(target, source, size, direction)*/
cudaMemcpy (hCallResult , dCallResult , sizeof(float)*N,
cudaMemcpyDeviceToHost);
cudaMemcpy (hPutResult , dPutResult , sizeof(float)*N,
cudaMemcpyDeviceToHost);
/* Compute option prices on the CPU */

BlackScholesCPU(…..);
/* Compare results */
…….
/* Clean up memory on host and device*/
free( hOptPrice);
………..
cudaFree(dOptPrice);
…….

BlackScholesGPU
How to deal with a generic number of options OptN?
Maximum number of blocks = 65536, Max number of threads per block = 512
(this will limit OptN to 33M)
Solution: Each thread processes multiple options.
__global__ void BlackScholes (float *…., int OptN)

{
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
const int THREAD_N = blockDim.x * gridDim.x;
for(int opt = tid; opt < OptN; opt += THREAD_N)

BlackScholesBody(
d_CallResult[opt], d_PutResult[opt], d_OptionPrice[opt],
d_OptionStrike[opt],d_OptionYears[opt], Riskfree,Volatility );
}
optN
tid tid+THREAD_N
THREAD_N=BlockDim.x*gridDim.x
Compile and run
Compile the example BlackScholes.cu:
nvcc –O3 –o BlackScholes BlacScholes.cu \\

-I../../common/inc/ -L ../../lib/ -lcutil -lGL –lglut
(path to the libraries and include may be different on your system)
Run the example: BlackScholes

European Options with Black-Scholes formula ( 1000000 options)
Copying input data to GPU mem. Transfer time: 10.496000 msecs.
Executing GPU kernel... GPU time: 0.893000 msecs.
Reading back GPU results... Transfer time: 15.471000 msecs.
Checking the results...

...running CPU calculations. CPU time: 454.683990 msecs.
Comparing the results...

L1 norm: 5.984729E-08 Max absolute error: 1.525879E-05
Improve PCI-e transfer rate
PCI-e transfers from regular host memory have a
bandwidth of ~1-1.5 GB/s (depending on the host
CPU, chipset).
Using page-locked memory allocation, the transfer

speed can reach over 3 GB/s (4 GB/s on
particular NVIDIA chipset with “LinkBoost”).
CUDA has special memory allocation functions for

this purpose.

How to use pinned memory
Replace cudaMalloc with cudaMallocHost

Replace cudaFree with cudaFreeHost
/* Memory allocation (instead of regular malloc)*/

cudaMallocHost ((void **) &h_CallResultGPU, OPT_SZ);
/* Memory clean-up (instead of regular free) */

cudaFreeHost(h_CallResultGPU);

Compile and run
Compile the example BlackScholesPinned.cu:
nvcc –O3 –o BlackScholesPinned BlacScholesPinned.cu \\

Run the example: BlackScholesPinned

European Options with Black-Scholes formula ( 1000000 options)
Copying input data to GPU mem. Transfer time: 3.714000 msecs. (was 10.4)
Executing GPU kernel... GPU time: 0.893000 msecs.
Reading back GPU results... Transfer time: 2.607000 msecs. (was 15.4)

...running CPU calculations. CPU time: 454.683990 msecs.
Comparing the results...

L1 norm: 5.984729E-08 Max absolute error: 1.525879E-05
MonteCarlo simulation for European
options using CUDA
Overview
This presentation will show how to implement a MonteCarlo

simulation for European call options using CUDA.
Montecarlo simulations are very suitable for parallelization:
•High regularity and locality

•Very high compute to I/O ratio
•Very good scalability
Vanilla MonteCarlo implementation (no variance reductions

techniques such as antithetic variables or control variate)

MonteCarlo approach
The MC simulation for option pricing can be described as:
i. Simulate sample paths for underlying asset price
ii. Compute corresponding option payoff for each sample
path
iii. Average the simulation payoffs and discount the average
value to yield the price of an option
% Monte Carlo valuation for a European call in MATLAB

% An Introduction to Financial Option Valuation: Mathematics, Stochastics and
% Computation , D. Higham
S = 2; E = 1; r = 0.05; sigma = 0.25; T = 3; M = 1e6;

Svals = S*exp((r-0.5*sigma^2)*T + sigma*sqrt(T)*randn(M,1));
Pvals = exp(-r*T)*max(Svals-E,0);
Pmean = mean(Pvals)
width = 1.96*std(Pvals)/sqrt(M);
conf = [Pmean - width, Pmean + width]
MonteCarlo example
i. Generate M random numbers: M=200,000,000

i. Uniform distribution via Mersenne Twister (MT)
ii. Box-Müller transformation to generate Gaussian
distribution
ii. Compute log-normal distributions for N options: N=128
iii. Compute sum and sum of the squares for each option to
recover mean and variance
iv. Average the simulation payoffs and discount the average
values to yield the prices of the options.
NB: This example can be easily extended to run on multiple GPUs,

using proper initial seeds for MT. See the “MonteCarloMultiGPU” sample
in the CUDA SDK v1.1

Generate Uniformly Distributed Random Numbers
The Random Number Generator (RNG) used in this example is a

parallel version of the Mersenne Twister by Matsumoto and
Nishimura, known as Dynamic Creator (DCMT):
• It is fast
• It has good statistical properties
• It generates many independent Mersenne Twisters
The initial parameters are computed off-line and stored in a file.
RandomGPU<<<32,128>>>( d_Random, N_PER_RNG, seed);
32 blocks *128 threads: 4096 independent random streams
On Tesla C870 (single GPU):

200 Million samples in 80 millisecond
2.5 Billion samples per second !!!!

Generating Gaussian Normal Distribution
Use the Box-Müller transformation to generate Gaussian normal

distribution from uniformly distributed random values
BoxMullerGPU<<<32,128>>>( d_Random, N_PER_RNG, seed);
On Tesla C870 (single GPU):

200 Million samples in 120 milliseconds
#define PI 3.14159265358979323846264338327950288f
__device__ void BoxMuller(float& u1, float& u2){
float r = sqrtf(-2.0f * logf(u1));
float phi = 2 * PI * u2;
u1 = r * cosf(phi);
u2 = r * sinf(phi);
}
•Possible alternative, Beasley-Springer-Moro algorithm for approximating the

inverse normal
Log-normal distributions and partial sums
void MonteCarloGPU(d_Random,….)
{
// Break the sums in 64*256 (16384) partial sums
MonteCarloKernelGPU<<<64, 256, 0>>>(d_Random);
//Read back the partial sums to the host
cudaMemcpy(h_Sum, d_Sum, ACCUM_SZ, cudaMemcpyDeviceToHost) ;

cudaMemcpy(h_Sum2, d_Sum2, ACCUM_SZ, cudaMemcpyDeviceToHost) ;
// Compute sum and sum of squares on host
double dblSum = 0, dblSum2 = 0;

for(int i = 0; i < ACCUM_N; i++){
dblSum += h_Sum[i];
dblSum2 += h_Sum2[i];
}
}

Log-normal distributions and partial sums
__global__ void MonteCarloKernelGPU(…)
{
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
const int threadN = blockDim.x * gridDim.x;
//...
for(int iAccum = tid; iAccum < accumN; iAccum += threadN) {

float sum = 0, sum2 = 0;
for(int iPath = iAccum; iPath < pathN; iPath += accumN) {

float r = d_Random[iPath];
//...
sum += endOptionPrice;
sum2 += endOptionPrice * endOptionPrice;
}
d_Sum[iAccum] = sum;
d_Sum2[iAccum] = sum2;
}
Accurate Floating-Point Summation
The standard way of summing a sequence of N numbers , ai , is the
recursive formula:
S0 =0
Si = Si-1+ ai
S = Sn
When using floating-point arithmetics an error analysis (Wilkinson,

1963) shows that the accumulated round-off error can grow as fast as
N2.
By forming more than one intermediate sum, the accumulated round-

off error can be significantly reduced
This is exactly how parallel summation works!

Compile and run
Compile the example Montecarlo.cu:
nvcc –O3 –o Montecarlo Montecarlo.cu \\

Run the example: Montecarlo

MonteCarlo simulation for European call options ( 200000000 paths)
Generate Random Numbers on GPU 80.51499 msecs.
Box Muller on GPU 122.62799 msecs.
Average time for option. 15.471000 msecs.

MC: 6.621926; BS: 6.621852
Abs: 7.343292e-05; Rel: 1.108948e-05;
MC: 8.976083; BS: 8.976007

Abs: 7.534027e-05; Rel: 8.393517e-06;
Optimizing for smaller problems
This Monte Carlo implementation is optimized for

gigantic problems
e.g. 16M paths on 256 underlying options
Most real simulations are much smaller

e.g. 256K paths on 64 underlying options
Before optimization, take a detailed look at

performance
Comparing performance for different problem sizes can
provide a lot of insight

Monte Carlo Samples Per Second

1.0000E+11
Samples Per Second
1.0000E+10
1.0000E+09
8 underlying options
16
32
1.0000E+08
64
128
256
1.0000E+07
16
32
96
92
72
44
88
38
76
53
57
15
30
60
10
21
42
72
44
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
# Paths


1.0000E+11
Excellent!
Samples Per Second
1.0000E+10
1.0000E+09 Poor!
8 underlying options
16
32
1.0000E+08
64
128
256
1.0000E+07
16
32
96
92
72
44
88
38
76
53
57
15
30
60
10
21
42
72
44
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
# Paths
This graph should be a horizontal line!

Monte Carlo Options Per Second
10000
Options Per Second
1000
8 underlying Options
16
32
64
128
256
100
16
32
96
92
72
44
88
38
76
53
57
15
30
60
10
21
42
72
44
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
# Paths

10000
Poor!
Options Per Second
1000
8 underlying Options
16
32
64
128 Excellent!
256
100
16
32
96
92
72
44
88
38
76
53
57
15
30
60
10
21
42
72
44
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
# Paths
This graph should be a straight diagonal line!

Inefficiencies
Looking at the code, there were some inefficiencies

Final sum reduction on CPU rather than GPU
Loop over options on CPU rather than GPU
Multiple thread blocks launched per option
Not evident for large problems because completely

computation bound
NB: In further comparisons, we choose 64 options

as our optimization case

Move the reduction onto the GPU
Monte Carlo Options Per Second for 64 Options
100000
Final summation on the GPU
using parallel reduction is a
Options Per Second

significant speedup. 10000
1000
Monte Carlo Samples Per Second for 64 Options

Original
1.0000E+11 Reduce on GPU
100
16
32
96
92
72
44
88
38
76
53
57
15
30
60
10
21
42
72
44
40
81
Samples Per Second
1.0000E+10
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
# Paths
1.0000E+09
1.0000E+08 Original
Read back a single sum and
Reduce on GPU
1.0000E+07
sum of squares for each
thread block
6
16
32
96
92
72
44
88
38
76
53
57
15
30
60
10
21
42
72
44
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
© NVIDIA Corporation 2008 # Paths 32

All options in a single kernel launch
Initial code looped on host, invoking the kernel

once per option
1D grid, multiple thread blocks per option
Rather than looping, just launch a 2D grid

One row of thread blocks per option

All options in a single kernel launch
1000000
100000
Options Per Second

10000
Monte Carlo Samples Per Second for

1000
64 Options
1.0000E+11
100
16
32
96
92
72
44
88
38
76
53
57
15
30
60
Samples Per Second
10
21
42
72
44
1.0000E+10
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
# Paths
1.0000E+09
Original
1.0000E+08
Reduce on GPU
Combine Options into a Single Kernel Launch

1.0000E+07
6
2
6
84
68
36
72
44
88
3
9
57
15
30
60
3
10
21
42
72
44
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
© NVIDIA Corporation 2008 # Paths 34

One Thread Block Per Option
Pricing an option using multiple blocks requires

multiple kernel launches
First kernel produces partial sums
Second kernel performs sum reduction to get final values
For very small problems, kernel launch overhead

dominates cost
And cudaMemcpy dominates if we reduce on CPU
Solution: for small # paths, use a single thread

block per option with a new kernel
Summation for entire option is computed in this kernel

How small is small enough?
We still want to use the old, two kernel method for

large # paths
How do we know when to switch?
Imperically, we determined criterion:
bool multiBlock = ((numPaths / numOptions) >= 8192);
If multiBlock is false, we run the single block kernel

Otherwise, run multi-block kernel followed by
reduction kernel

One Thread Block Per Option
1000000
These lines are much

100000
Options Per Second

straighter!
10000
1000
Monte Carlo Samples Per Second for 64 Options
1.0000E+11 100
2
96
92
21
43
38
76
53
07
14
28
0
85
71
43
86
40
81
77
54
16
32
65
8
13
26
52
5
10
20
41
83
16
33
Samples Per Second
1.0000E+10
# Paths
1.0000E+09
Original
1.0000E+08
Reduce on GPU Good performance for
Combine Options into a Single Kernel Launch
One Thread Block Per Option small, medium, and

1.0000E+07
large problems!
6
16
32
96
92
72
44
88
38
76
53
57
15
30
60
10
21
42
72
44
40
81
48
97
94
88
16
32
65
77
55
13
26
52
10
20
41
83
16
33
© NVIDIA Corporation 2008 # Paths

37
Details
For details of these optimizations see the

“MonteCarlo” sample code in the CUDA SDK 1.1
Latest version was optimized based on these experiments
Included white paper provides further discussion
Also see the “MonteCarloMultiGPU” example to see

how to distribute the problem across multiple GPUs
in a system

Conclusion
CUDA is well-suited to computational finance
Important to tune code for your specific problem

Study relative performance for different problem sizes
Understand source of bottlenecks for different size
problems
Optimizations may differ depending on your needs

NVIDIA CUDA Computational Finance Geeks3D

Uploaded by

Copyright:

Available Formats

NVIDIA CUDA Computational Finance Geeks3D

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NVIDIA CUDA Computational Finance Geeks3D

Uploaded by

Copyright:

Available Formats

Computational Finance in CUDA

Options Pricing with Black-Scholes and Monte Carlo

CUDA is ideal for finance computations

We’ll cover European Options pricing in CUDA with

© NVIDIA Corporation 2008 2

This presentation will show how to implement an option

•Generate random input data on host

Simple problem to map, each option price is computed

© NVIDIA Corporation 2008 4

V put = X ⋅ e − rT ⋅ CND ( − d 2 ) − S ⋅ CND ( − d 1 )

© NVIDIA Corporation 2008 5

Computed with a polynomial approximation (see Hull):

__host__ __device__ float CND(float d)

•Make your code float safe.

1. Allocate arrays on host: hOptPrice(N), hOptStrike (N) ,…

© NVIDIA Corporation 2008 7

/* Allocate arrays on the GPU with cudaMalloc */

/* Initialize hOptPrice, hOptStrike, hOptYear on the host */

© NVIDIA Corporation 2008 8

/*Transfer data from host to device with

/* Compute option prices on GPU with fixed configuration

© NVIDIA Corporation 2008 9

/* Compute option prices on the CPU */

© NVIDIA Corporation 2008 10

Solution: Each thread processes multiple options.

__global__ void BlackScholes (float *…., int OptN)

for(int opt = tid; opt < OptN; opt += THREAD_N)

nvcc –O3 –o BlackScholes BlacScholes.cu \\

Run the example: BlackScholes

Copying input data to GPU mem. Transfer time: 10.496000 msecs.

Executing GPU kernel... GPU time: 0.893000 msecs.

Reading back GPU results... Transfer time: 15.471000 msecs.

Checking the results...

Comparing the results...

Using page-locked memory allocation, the transfer

CUDA has special memory allocation functions for

© NVIDIA Corporation 2008 13

Replace cudaMalloc with cudaMallocHost

/* Memory allocation (instead of regular malloc)*/

/* Memory clean-up (instead of regular free) */

© NVIDIA Corporation 2008 14

nvcc –O3 –o BlackScholesPinned BlacScholesPinned.cu \\

Run the example: BlackScholesPinned

Executing GPU kernel... GPU time: 0.893000 msecs.

Checking the results...

Comparing the results...

This presentation will show how to implement a MonteCarlo

Montecarlo simulations are very suitable for parallelization:

•High regularity and locality

Vanilla MonteCarlo implementation (no variance reductions

© NVIDIA Corporation 2008 17

% Monte Carlo valuation for a European call in MATLAB

S = 2; E = 1; r = 0.05; sigma = 0.25; T = 3; M = 1e6;

i. Generate M random numbers: M=200,000,000

NB: This example can be easily extended to run on multiple GPUs,

© NVIDIA Corporation 2008 19

The Random Number Generator (RNG) used in this example is a

The initial parameters are computed off-line and stored in a file.

RandomGPU<<<32,128>>>( d_Random, N_PER_RNG, seed);

32 blocks *128 threads: 4096 independent random streams

host device float CND(float d)

global void BlackScholes (float *…., int OptN)