0% found this document useful (0 votes)
73 views61 pages

CS 179: GPU Computing: Lecture 16: Simulations and Randomness

This document discusses simulations and randomness, focusing on Monte Carlo methods and parallel random number generation. It introduces Monte Carlo simulations as a way to solve problems that are hard to solve directly, like estimating probabilities. As an example, it shows how Monte Carlo can estimate pi by randomly generating points and calculating the fraction that fall within a circle. It then discusses how the general Monte Carlo method works and how its trials can potentially be parallelized. Finally, it covers challenges with parallel random number generation and introduces an approach using multiple generator sequences.

Uploaded by

Rajul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views61 pages

CS 179: GPU Computing: Lecture 16: Simulations and Randomness

This document discusses simulations and randomness, focusing on Monte Carlo methods and parallel random number generation. It introduces Monte Carlo simulations as a way to solve problems that are hard to solve directly, like estimating probabilities. As an example, it shows how Monte Carlo can estimate pi by randomly generating points and calculating the fraction that fall within a circle. It then discusses how the general Monte Carlo method works and how its trials can potentially be parallelized. Finally, it covers challenges with parallel random number generation and introduces an approach using multiple generator sequences.

Uploaded by

Rajul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

CS 179: GPU Computing

Lecture 16: Simulations and


Randomness
Simulations

Exa Corporation, http://www.exa.com/images/f16.png


South Bay Simulations,
http://www.panix.com/~brosen/graphics/iacc.400.jpg

Max-Planck Institut, http://www.mpa-


Flysurfer Kiteboarding, http://www.flysurfer.com/wp- garching.mpg.de/gadget/hydrosims/
content/blogs.dir/3/files/gallery/research-and-development/zwischenablage07.jpg
Simulations
• But what if your problem is hard to solve? e.g.
– EM radiation attenuation
– Estimating complex probability distributions
– Complicated ODEs, PDEs
• (e.g. option pricing in last lecture)
– Geometric problems w/o closed-form solutions
• Volume of complicated shapes
Simulations
• Potential solution: Monte Carlo methods
– Run simulation with randomly chosen inputs
• (Possibly according to some distribution)
– Do it again… and again… and again…
– Aggregate results
Monte Carlo example
• Estimating the value of π
Monte Carlo example
• Estimating the value of π
– Quarter-circle of radius r:
• Area = (πr2)/4
– Enclosing square:
• Area = r2
– Fraction of area: π/4

"Pi 30K" by CaitlinJo - Own workThis mathematical image was created with
Mathematica. Licensed under CC BY 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Pi_30K.gif#/media/File:Pi_30K.gif
Monte Carlo example
• Estimating the value of π
– Quarter-circle of radius r:
• Area = (πr2)/4
– Enclosing square:
• Area = r2
– Fraction of area: π/4 ≈ 0.79

• “Solution”: Randomly generate lots of points,


calculate fraction within circle "Pi 30K" by CaitlinJo - Own workThis
mathematical image was created with
Mathematica. Licensed under CC BY 3.0 via

– Answer should be pretty close! Wikimedia Commons -


http://commons.wikimedia.org/wiki/File:Pi_30K
.gif#/media/File:Pi_30K.gif
Monte Carlo example
• Pseudocode:
(simulate on N points)
(assume r = 1)

points_in_circle = 0
for i = 0,…,N-1:
randomly pick point (x,y) from
uniform distribution in [0,1]2
if (x,y) is in circle:
points_in_circle++

return (points_in_circle / N) * 4

"Pi 30K" by CaitlinJo - Own workThis mathematical image was created with
Mathematica. Licensed under CC BY 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Pi_30K.gif#/media/File:Pi_30K.gif
Monte Carlo example
• Pseudocode:
(simulate on N points)
(assume r = 1)

points_in_circle = 0
for i = 0,…,N-1:
randomly pick point (x,y) from
uniform distribution in [0,1]2
if x^2 + y^2 < 1:
points_in_circle++

return (points_in_circle / N) * 4

"Pi 30K" by CaitlinJo - Own workThis mathematical image was created with
Mathematica. Licensed under CC BY 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Pi_30K.gif#/media/File:Pi_30K.gif
Monte Carlo simulations

Planetary Materials Microanalysis Facility, , Northern Arizona University,


http://www4.nau.edu/microanalysis/microprobe-
sem/Images/Monte_Carlo.jpg

Center for Air Pollution Impact & Trend Analysis, Washington University in St.
Louis, http://www4.nau.edu/microanalysis/microprobe-
sem/Images/Monte_Carlo.jpg

http://www.cancernetwork.com/sites/default/files/cn_import/n0011bf1.jpg
General Monte Carlo method
• Pseudocode:
for (number of trials):
randomly pick value from a probability distribution
perform deterministic computation on inputs

(aggregate results)
General Monte Carlo method
• Why it works:
– Law of large numbers!
General Monte Carlo method
• Pseudocode:
for (number of trials):
randomly pick value from a probability distribution
perform deterministic computation on inputs

(aggregate results)

• Can we parallelize this?


General Monte Carlo method
• Pseudocode:
for (number of trials):
randomly pick value from a probability distribution
perform deterministic computation on inputs
Trials are
(aggregate results) independent

• Can we parallelize this?


General Monte Carlo method
• Pseudocode:
for (number of trials):
randomly pick value from a probability distribution
perform deterministic computation on inputs
Trials are
(aggregate results) Usually so independent
(e.g. with reduction)

• Can we parallelize this?


General Monte Carlo method
• Pseudocode:
for (number of trials):
randomly pick value from a probability distribution What about this?
perform deterministic computation on inputs
Trials are
(aggregate results) Usually so independent
(e.g. with reduction)

• Can we parallelize this?


Parallelized Random Number
Generation
Early Credits
• Algorithm and presentation based on:

– “Parallel Random Numbers: As Easy as 1, 2, 3”


• (Salmon, Moraes, Dror, Shaw) at D. E. Shaw Research
• Developed for biomolecular simulations on Anton
(massively parallel ASIC-based supercomputer)
• Also applicable to CPUs, GPUs
Random Number Generation
• Generating random data computationally
is hard
– Computers are deterministic!

https://cdn.tutsplus.com/vector/uploads/legacy/tuts/165_Shiny_Dice/27.jpg
Random Number Generation
• Two methods:
– Hardware random number generator
• aka TRNG (“True” RNG)
• Uses data collected from environment (thermal,
optical, etc)
• Very slow!
– Pseudorandom number generator (PRNG)
• Algorithm that produces “random-looking”
numbers
• Faster – limited by computational power
Demonstration
Random Number Generation
• PRNG algorithm should be:
– High-quality
• Produce “good” random data
– Fast
• (In its own right)
– Parallelizable!

• Can we do it?
– (Assume selection from uniform distribution)
A Very Basic PRNG
• “Linear congruential generator” (LCG)
– e.g. C’s rand() //from glibc

int32_t val = state[0];


val = ((state[0] * 1103515245) + 12345)
& 0x7fffffff;
state[0] = val;
*result = val;

– General formula:
𝑋𝑛+1 = 𝑎𝑋𝑛 + 𝑐 mod 𝑚

• X0 is the “seed” (e.g. system time)


A Very Basic PRNG
• “Linear congruential generator” (LCG)
– e.g. C’s rand() //from glibc

int32_t val = state[0];


val = ((state[0] * 1103515245) + 12345)
& 0x7fffffff;
state[0] = val;
*result = val;

– General formula:
𝑋𝑛+1 = 𝑎𝑋𝑛 + 𝑐 mod 𝑚
Non-parallelizable
recurrence relation!
Linear congruential generators
𝑋𝑛+1 = 𝑎𝑋𝑛 + 𝑐 mod 𝑚

• Not high quality!


– Clearly non-uniform

• Fast to compute
• Not parallelizable!

"Lcg 3d". Licensed under CC BY-SA 3.0 via Wikimedia Commons -


http://commons.wikimedia.org/wiki/File:Lcg_3d.gif#/media/File:Lcg_3d.gif
Measures of RNG quality
• Impossible to prove a sequence is “random”

• Possible tests:
– Frequency
– Periodicity - do the values repeat too early?
– Linear dependence
–…
PRNG Parallelizability
• Many PRNGs (like the LCG) have a
non-parallelizable appearance:

𝑋𝑛+1 = 𝑓(𝑋𝑛 )

– (Better chance of good data when):


• All 𝑋𝑖 in some large state space
• Complicated function f
PRNG Parallelizability
• Possible “approach” to GPU parallelization:
– Assign a PRNG to each thread!
• Initialize with e.g. different X0

• Thread 0 produces sequence 𝑋𝑛+1,0 = 𝑓(𝑋𝑛,0 )


• Thread 1 produces sequence 𝑋𝑛+1,1 = 𝑓(𝑋𝑛,1 )
•…
PRNG Parallelizability
• Possible “approach” to GPU parallelization:
– Assign a PRNG to each thread!
• Initialize with e.g. different X0

• Thread 0 produces sequence 𝑋𝑛+1,0 = 𝑓(𝑋𝑛,0 )


• Thread 1 produces sequence 𝑋𝑛+1,1 = 𝑓(𝑋𝑛,1 )
•…

– In practice, often cannot get high quality


• Repeated values, lack of good, enumerable parameters
PRNG Parallelizability
• Instead of:
𝑋𝑛+1 = 𝑓(𝑋𝑛 )

• Suppose we had:
𝑋𝑛+1 = 𝑏 𝑛

– This is parallelizable! (Without our previous “trick”)

• Is this possible?
More General PRNG
• “Keyed” PRNG given by:
– Transition function: 𝑓: 𝑆 → 𝑆
– Output function: 𝑔: 𝐾 × 𝑆 → 𝑈

• S: Internal (hidden) state space


• U: Output space
• K: “Key space”
– Can “seed” output behavior without relying on X0 alone –
useful for scientific reproducibility!
More General PRNG
• “Keyed” PRNG given by:
– Transition function: 𝑓: 𝑆 → 𝑆
– Output function: 𝑔: 𝐾 × 𝑆 → 𝑈
If S has J times more bits than
• S: Internal (hidden) state space U, can produce J outputs per
transition.
• U: Output space
Assume J = 1 in this lecture
• K: “Key space”
– Can “seed” output behavior without relying on X0 alone –
useful for scientific reproducibility!
More General PRNG
• “Keyed” PRNG given by:
– Transition function: 𝑓: 𝑆 → 𝑆
– Output function: 𝑔: 𝐾 × 𝑆 → 𝑈

– “Trivial” example: LCG 𝑋𝑛+1 = 𝑎𝑋𝑛 + 𝑐 mod 𝑚


• 𝑓 𝑋𝑛 = 𝑎𝑋𝑛 + 𝑐
• 𝑔 𝑋𝑛 = 𝑋𝑛

• S is (for example) the space of 32-bit integers


• U=S
• K is “trivial” (no keys used)
More General PRNG
• “Keyed” PRNG given by:
– Transition function: 𝑓: 𝑆 → 𝑆
– Output function: 𝑔: 𝐾 × 𝑆 → 𝑈

– “Trivial” example: LCG 𝑋𝑛+1 = 𝑎𝑋𝑛 + 𝑐 mod 𝑚


• 𝑓 𝑋𝑛 = 𝑎𝑋𝑛 + 𝑐
• 𝑔 𝑋𝑛 = 𝑋𝑛

• f is more complicated than g!


More General PRNG
• “Keyed” PRNG given by:
– Transition function: 𝑓: 𝑆 → 𝑆
– Output function: 𝑔: 𝐾 × 𝑆 → 𝑈

– General theme: f is complicated, g is simple


• What if we flipped that?
More General PRNG
• “Keyed” PRNG given by:
– Transition function: 𝑓: 𝑆 → 𝑆
– Output function: 𝑔: 𝐾 × 𝑆 → 𝑈

– General theme: f is complicated, g is simple


• What if we flipped that?

• What if f were so simple that it could be evaluated


explicity?
More General PRNG
• i.e. what if we had:
– Simple transition function (p-bit integer state space):
𝑓 𝑠 = (𝑠 + 1) mod 2𝑝

• This is just a counter! Can expand into explicit formula


𝑓 𝑛 = (𝑛 + 𝑛0 ) mod 2𝑝

• These form counter-based PRNGs


– Complicated output function g

• Would this work?


More General PRNG
• i.e. what if we had:
– Simple transition function f
– Complicated output function g(k, n)
• Should be bijective w/r/to n
– Guarantees period of 2p
• Shouldn’t be too difficult to compute
Bijective Functions
• Cryptographic block ciphers!
– AES (Advanced Encryption Standard), Threefish, …

– Must be bijective!
• (Otherwise messages can’t be encrypted/decrypted)
AES-128 Algorithm
• 1) Key Expansion
– Determine all keys k from initial cipher key kB
• Used to strengthen weak keys

Sohaib Majzoub and Hassan Diab, Reconfigurable


Systems for Cryptography and Multimedia
Applications,
http://www.intechopen.com/source/html/38442/m
edia/image19_w.jpg
AES-128 Algorithm
• 2) Add round key
– Bitwise XOR state s with key k0

By User:Matt Crypto - Own work. Licensed under Public Domain via


Wikimedia Commons - http://commons.wikimedia.org/wiki/File:AES-
AddRoundKey.svg#/media/File:AES-AddRoundKey.svg
AES-128 Algorithm
• 3) For each round… (10 rounds total)
– a) Substitute bytes
• Use lookup table to switch positions

By User:Matt Crypto - Own work. Licensed under Public Domain via


Wikimedia Commons - http://commons.wikimedia.org/wiki/File:AES-
AddRoundKey.svg#/media/File:AES-AddRoundKey.svg
AES-128 Algorithm
• 3) For each round…
– b) Shift rows

By User:Matt Crypto - Own work. Licensed under Public Domain via


Wikimedia Commons - http://commons.wikimedia.org/wiki/File:AES-
AddRoundKey.svg#/media/File:AES-AddRoundKey.svg
AES-128 Algorithm
• 3) For each round…
– c) Mix columns
• Multiply by constant matrix

By User:Matt Crypto - Own work. Licensed under Public Domain via


Wikimedia Commons - http://commons.wikimedia.org/wiki/File:AES-
AddRoundKey.svg#/media/File:AES-AddRoundKey.svg
AES-128 Algorithm
• 3) For each round…
– d) Add round key (as before)

By User:Matt Crypto - Own work. Licensed under Public Domain via


Wikimedia Commons - http://commons.wikimedia.org/wiki/File:AES-
AddRoundKey.svg#/media/File:AES-AddRoundKey.svg
AES-128 Algorithm
• 4) Final round
– Do everything in normal round except mix
columns
AES-128 Algorithm
• Summary:
– 1) Expand keys
– 2) Add round key
– 3) For each round (10 rounds total)
• Substitute bytes
• Shift rows
• Mix columns
• Add round key
– 4) Final round:
• (do everything except mix columns)
Algorithmic Improvements
• We have a good PRNG!
– Simple transition function f
• Counter
– Complicated output function g(k, n)
• AES-128
Algorithmic Improvements
• We have a good PRNG!
– Simple transition function f
• Counter
– Complicated output function g(k, n)
• AES-128

– High quality!
• Passes Crush test suite (more on that later)
– Parallelizable!
• f and g only depend on k, n !
– Sort of slow to compute
• AES is sort of slow without special instructions (which GPUs
don’t have)
Algorithmic Improvements
• Can we “make AES go faster”?
– AES is a cryptographic algorithm, but we’re using it
for PRNG
– Can we change the algorithm for our purposes?
AES-128 Algorithm
• Summary:
– 1) Expand keys
– 2) Add round key
– 3) For each round (10 rounds total)
• Substitute bytes
• Shift rows
• Mix columns
• Add round key
– 4) Final round:
• (do everything except mix columns)
AES-128 Algorithm
Purpose of this step is to
hide key from attacker
• Summary: using chosen plaintext.
Not relevant here.
– 1) Expand keys
– 2) Add round key
– 3) For each round (10 rounds total)
• Substitute bytes
• Shift rows
• Mix columns
• Add round key
– 4) Final round:
• (do everything except mix columns)
AES-128 Algorithm
Purpose of this step is to
hide key from attacker
• Summary: using chosen plaintext.
Not relevant here.
– 1) Expand keys
– 2) Add round key
– 3) For each round (10 rounds total)
• Substitute bytes Do we really need
this many rounds?
• Shift rows
• Mix columns
• Add round key
– 4) Final round: Other changes?
• (do everything except mix columns)
Key Schedule Change
• Old key schedule: • New key schedule:
– The first n bytes of the expanded key are simply the


encryption key.
The rcon iteration value i is set to 1 – k0 = kB
– Until we have b bytes of expanded key, we do the following

– ki+1 = ki + constant
to generate n more bytes of expanded key:
• We do the following to create 4 bytes of expanded key:
– We create a 4-byte temporary variable, t
– We assign the value of the previous four bytes in the


expanded key to t
We perform the key schedule core (see above) on t, with i as
the rcon iteration value
• e.g. golden ratio
– We increment i by 1
– We exclusive-OR t with the four-byte block n bytes before the
new expanded key. This becomes the next 4 bytes in the
expanded key
• We then do the following three times to create the next twelve
bytes of expanded key:
– We assign the value of the previous 4 bytes in the expanded
key to t
– We exclusive-OR t with the four-byte block n bytes before the
new expanded key. This becomes the next 4 bytes in the
expanded key
• If we are processing a 256-bit key, we do the following to
generate the next 4 bytes of expanded key:
– We assign the value of the previous 4 bytes in the expanded
key to t
– We run each of the 4 bytes in t through Rijndael's S-box
– We exclusive-OR t with the 4-byte block n bytes before the
new expanded key. This becomes the next 4 bytes in the
expanded key.

Copied from Wikipedia (Rijndael Key Schedule)


AES-128 Algorithm
• Summary:
– 1) Expand keys using simplified algorithm
– 2) Add round key
– 3) For each round (10 5 rounds total)
• Substitute bytes
• Shift rows
• Mix columns
• Add round key
Other simplifications
– 4) Final round: possible!
• (do everything except mix columns)
Algorithmic Improvements
• We have a good PRNG!
– Simple transition function f
• Counter
– Complicated output function g(k, n)
• Modified AES-128 (known as ARS-5)

– High quality!
• Passes Crush test suite (more on that later)
– Parallelizable!
• f and g only depend on k, n !
– Moderately faster to compute
Even faster parallel PRNGs
• Use a different g, e.g.
– Threefish cipher
• Optimized for PRNG – known as “Threefry”
– “Philox”
• (see paper for details)
• 202 GB/s on GTX580!
– Fastest known PRNG in existence
General Monte Carlo method
• Pseudocode:
for (number of trials):
randomly pick value from a probability distribution What about this?
perform deterministic computation on inputs
Trials are
(aggregate results) Usually so independent
(e.g. with reduction)

• Can we parallelize this?


General Monte Carlo method
• Pseudocode:
for (number of trials):
randomly pick value from a probability distribution Yes!
perform deterministic computation on inputs
Trials are
(aggregate results) Usually so independent
(e.g. with reduction)

• Can we parallelize this?


– Yes!
– Part of cuRAND
Summary
• Monte Carlo methods
– Very useful in scientific simulations
– Parallelizable because of…

• Parallelized random number generation


– Another story of “parallel algorithm analysis”
Credits (again)
• Parallel RNG algorithm and presentation
based on:

– “Parallel Random Numbers: As Easy as 1, 2, 3”


• (Salmon, Moraes, Dror, Shaw) at D. E. Shaw Research

You might also like