0% found this document useful (0 votes)

144 views19 pages

Scientific GPU Computing With Go

This document discusses using the Go programming language for scientific GPU computing. It presents a real-world example of using Go and CUDA to simulate micromagnetism. Go allows calling CUDA kernels from C code and has tools for profiling performance. While pure Go number crunching is slower than C, Go is suitable for GPU computing when combined with CUDA via C/C++ libraries. It provides memory safety and garbage collection while interfacing with low-level GPU code.

Uploaded by

William

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views19 pages

Scientific GPU Computing With Go

Uploaded by

William

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Scientific GPU computing with Go

A novel approach to highly reliable CUDA HPC

1 February 2014
Arne Vansteenkiste
Ghent University

Real-world example (micromagnetism)

DyNaMat LAB @ UGent: Microscale Magnetic Modeling:

Hard Disks
Magnetic RAM
Microwave components
...
Real-world example (micromagnetism)

2nm

Real-world example (micromagnetism)

MuMax3 (GPU, script + GUI): ~ 11,000 lines CUDA, Go
(http://mumax.github.io)

Compare to:

OOMMF (script + GUI): ~100,000 lines C++, tcl

Magnum (GPU, script only): ~ 30,000 lines CUDA, C++, Python
How suitable is Go for HPC?
Pure Go number crunching
Go plus {C, C++, CUDA} number crunching
Concurrency

Go is
compiled
statically typed

but also

garbage collected
memory safe
dynamic
Hello, math!
func main() {
fmt.Println("(1+1e-100)-1 =", (1+1e-100)-1)
fmt.Println("-1 =", cmplx.Sqrt(-1))
fmt.Println("J(0.3) =", math.J1(0.3))
fmt.Println("Bi(666, 333) =", big.NewInt(0).Binomial(666, 333))
} Run

Go math features:

precise compile-time constants

(1+1e-100)-1 = 1e-100
complex numbers -1 = (0+1i)
J(0.3) = 0.148318816273104
special functions Bi(666, 333) = 946274279373497391369043379702061302514484178751053564

Program exited.
big numbers.

But missing:

matrices
Run Kill Close
matrix libraries (BLAS, FFT, ...)

Performance
Example: dot product

func Dot(A, B []float64) float64{

dot := 0.0
for i := range A{
dot += A[i] * B[i]
}
return dot
}
Performance
func Dot(A, B []float64) float64{
dot := 0.0
for i := range A{
dot += A[i] * B[i]
}
return dot
}

func BenchmarkDot(b *testing.B) {

A, B := make([]float64, 1024), make([]float64, 1024)
PASS
sum := 0.0
BenchmarkDot 1000000 1997 ns/op
for i:=0; i<b.N; i++{
sum += Dot(A, B)
Program exited.
}
fmt.Fprintln(DevNull, sum) // use result
} Run

go test -bench .

times all BenchmarkXXX functions

Run Kill Close

Profiling
Go has built-in profiling

go tool pprof

outputs your program's call graph with time spent per function
28

github.com/mumax/3/engine.(*_setter).Set
0 (0.0%)
of 113 (10.2%)

113 81 102 81

github.com/mumax/3/engine.SetTorque github.com/mumax/3/engine.SetEectiveField github.com/mumax/3/engine.SetDemagField

102 0 (0.0%) 0 (0.0%) 0 (0.0%)
of 113 (10.2%) of 102 (9.2%) of 81 (7.3%)

108 17 24

github.com/mumax/3/engine.SetLLTorque github.com/mumax/3/engine.(*_adder).AddTo github.com/mumax/3/engine.demagConv

0 (0.0%) 0 (0.0%) 0 (0.0%)
of 108 (9.7%) of 17 (1.5%) of 24 (2.2%)

6 10 23

github.com/mumax/3/mag.DemagKernel
github.com/mumax/3/engine.AddExchangeField github.com/mumax/3/engine.AddAnisotropyField
0 (0.0%) 0 (0.0%) 20 (1.8%)
of 6 (0.5%) of 10 (0.9%)
of 23 (2.1%)

2 10
Performance
Dot product example

Go (gc) 1 980 ns/op

Go (gcc -O3) 1 570 ns/op

C (gcc -O3) 1 460 ns/op

C (gcc -march=native) 760 ns/op

Java 2 030 ns/op

Python 200 180 ns/op

Typically, Go is ~10% slower than optimized, portable C

But can be 2x - 3x slower than machine-tuned C

Pure Go number crunching

On the up side

Good standard math library

Built-in testing, benchmarking & profiling
Managed memory

On the down side

Still slower than machine-tuned C

No matrix libraries etc.
How suitable is Go for HPC?
Pure Go number crunching
Go plus {C, C++, CUDA} number crunching
Concurrency

Hello, GPU!
Go can call C/C++ libs

//#include <cuda.h>
//#cgo LDFLAGS: -lcuda
import "C"
import "fmt"

func main() {
buf := C.CString(string(make([]byte, 256)))
C.cuDeviceGetName(buf, 256, C.CUdevice(0))
fmt.Println("Hello, your GPU is:", C.GoString(buf))
Hello, your GPU is: GeForce GT 650M
} Run
Program exited.

Building:

go build

All build information is in the source Run Kill Close

Hello, GPU! (wrappers)
import(
"github.com/barnex/cuda5/cu"
"fmt"
)

func main(){
fmt.Println("Hello, your GPU is:", cu.Device(0).Name())
} Run

Hello, your GPU is: GeForce GT 650M

Installing 3rd party code: Program exited.

go get github.com/user/repo

(dependencies are compiled-in)

Run Kill Close

Calling CUDA kernels (the C way)

GPU (code for one element)

global void add(float a, float b, float *c, N) {

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
c[i] = a[i] + b[i];
}

CPU wrapper (divide and launch)

void gpu_add(float a, float b, float *c, int N){

dim3 block = ...
add<<<N/BLOCK, BLOCK>>>(a, b, c);
}

Go wrapper wrapper

func Add(a, b, c []float32){

C.gpu_add(unsafe.Pointer(&a[0]), unsafe.Pointer(&b[0]),
unsafe.Pointer(&c[0]), C.int(len(a)))
}
Calling CUDA kernels (cuda2go)
CUDA kernel to Go wrapper (calling nvcc once).
Further deployment without nvcc or CUDA libs.

Others to fetch your CUDA project the usual way:

go get github.com/user/my-go-cuda-project

// THIS FILE IS GENERATED BY CUDA2GO, EDITING IS FUTILE

func Add(a, b, c unsafe.Pointer, N int, cfg *config) {
args := add_args_t{a, b, c, N}
cu.LaunchKernel(add_code, cfg.Grid.X, cfg.Grid.Y, cfg.Grid.Z, cfg.Block.X, cfg.Block.Y, cfg.Block.Z, 0, stream0,
}

// PTX assembly
const add_ptx_20 = `
.version 3.1
.target sm_20
.address_size 64

.visible .entry add(

A note on memory (CPU)

Go is memory-safe, garbage collected.
Your typical C library is not.

Fortunately:

Go is aware of C memory (no accidental garbage collection).

Go properly aligns memory (needed by some HPC libraries)

Allocate in Go, pass to C, let Go garbage collect

A note on memory (GPU)
GPU memory still needs to be managed manually.
But a GPU memory pool is trivial to implement in Go.

var pool = make(chan cu.DevicePtr, 16)

func initPool(){
for i:=0; i<16; i++{
pool <- cu.MemAlloc(BUFSIZE)
}
}

func recycle(buf cu.DevicePtr){

pool <- buf
}

func main(){
initPool()

GPU_data := <- pool

defer recycle(GPU_data)
// ...
} Run Run

Vector add example

Adding two vectors on GPU (example from nvidia)

#include "../common/book.h"
#define N 10

int main( void ) {

int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

// fill the arrays 'a' and 'b' on the CPU

for (int i=0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}
// copy the arrays 'a' and 'b' to the GPU
HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice ) );
add<<<N,1>>>( dev_a, dev_b, dev_c ); // copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ) );
// display the results
for (int i=0; i<N; i++) {
Vector add example
Adding two vectors on GPU (Go)

package main

import "github.com/mumax/3/cuda"

func main(){

N := 3
a := cuda.NewSlice(N)
b := cuda.NewSlice(N)
c := cuda.NewSlice(N)
defer a.Free()
defer b.Free()
defer c.Free()

a.CopyHtoD([]float32{0, -1, -2})

b.CopyHtoD([]float32{0, 1, 4})

cfg := Make1DConfig(N)
add_kernel(a.Ptr(), b.Ptr(), c.Ptr(), cfg)

fmt.Println("result:", a.HostCopy())
}

Go plus {C, C++, CUDA} number crunching

On the downside

Have to write C wrappers

On the upside

You can call C

Have Go manage your C memory
How suitable is Go for HPC?
Pure Go number crunching
Go plus {C, C++, CUDA} number crunching
Concurrency

Real-world concurrency (MuMax3)

There's more to HPC then number crunching and memory management

I/O
Interactive supercomputing
...
Real-world concurrency (MuMax3)
Output: GPU does not wait for hard disk

GPU

main loop

chan

User async
script I/O

1 thread
16 threads

Real-world concurrency (MuMax3)

Go channels are like type-safe UNIX pipes between threads.

var pipe = make(chan []float64, BUFSIZE)

func runIO(){
for{
data := <- pipe // receive data from main
save(data)
}
}

func main() {
go runIO() // start I/O worker
pipe <- data // send data to worker
} Run Run

Real example: 60 lines Go, ~2x I/O speed-up

Real-world concurrency (MuMax3)
You can send function closures over channels.

var pipe = make(chan func()) // channel of functions

func main() {
for {
select{
case f := <- pipe: // execute function if in pipe
f()
default: doCalculation() // nothing in pipe, crunch on
}
}
}

func serveHttp(){
pipe <- func(){ value = 2 } // send function to main loop
...
} Run Run

Concurrency without mutex locking/unlocking.

Real-world concurrency (MuMax3)

GUI: change parameters while running,
without race conditions

1 thread / request
GPU

GUI http
server
main loop

chan

User async
script I/O

1 thread
16 threads
And we can prove it's thread-safe
Go has built-in testing for race conditions

go build -race

enables race testing. Output if things go wrong:

==================
WARNING: DATA RACE
Write by goroutine 3:
main.func001()
/home/billgates/buggycode/race.go:10 +0x38

Previous read by main goroutine:

main.main()
/home/billgates/buggycode/race.go:21 +0x9c

Goroutine 3 (running) created at:

main.main()
/home/billgates/buggycode/race.go:12 +0x33
==================

Go concurrency
On the up side

Easy, safe, built-in concurrency

On the down side

There is no downside
Demonstration
Input script

setgridsize(512, 256, 1)
setcellsize(5e-9, 5e-9, 5e-9)
ext_makegrains(40e-9, 256, 0)

Aex = 10e-12 // J/m

Msat = 600e3 // A/m
alpha = 0.1
m = uniform(0, 0, 1)

// set random parameters per grain

for i:=0; i<256; i++{
AnisU.SetRegion(i, vector(0.1*(rand()-0.5), 0.1*(rand()-0.5), 1))

for j:=i+1; j<256; j++{

ext_scaleExchange(i, j, rand())
}
}

// Write field
f := 0.5e9 // Hz
B_ext = sin(2*pi*f*t)

// spin HD and write

Demonstration

Demonstration
Demonstration

Demonstration
Hard disk magnetization (white = up = 1, black = down = 0)
Thank you
Arne Vansteenkiste
Ghent University
Arne.Vansteenkiste@Ugent.be (mailto:Arne.Vansteenkiste@Ugent.be)
http://mumax.github.io (http://mumax.github.io/)

Harrington Graduação Eletromagnetics
No ratings yet
Harrington Graduação Eletromagnetics
1 page
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA
No ratings yet
CUDA
33 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
L06_GPGPU_CUDA_Programming_1
No ratings yet
L06_GPGPU_CUDA_Programming_1
23 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Unit V Part B and C - 240514 - 220831
No ratings yet
Unit V Part B and C - 240514 - 220831
17 pages
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
No ratings yet
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
8 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Week 11
No ratings yet
Week 11
21 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
GPU Computing For Data Science - John Joo
No ratings yet
GPU Computing For Data Science - John Joo
34 pages
Gpu Computing Gems Jade PDF
No ratings yet
Gpu Computing Gems Jade PDF
3 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
Endsem Imp Hpc Unit 5
No ratings yet
Endsem Imp Hpc Unit 5
24 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
govind_6
No ratings yet
govind_6
4 pages
HPC 1
No ratings yet
HPC 1
27 pages
3.3.1 Multi-GPU Programming with CUDA
No ratings yet
3.3.1 Multi-GPU Programming with CUDA
13 pages
GPU_Architecture_and_Programming_Lecture
No ratings yet
GPU_Architecture_and_Programming_Lecture
9 pages
Scipy09 Pycuda Tut
No ratings yet
Scipy09 Pycuda Tut
162 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Chapter7_GPU
No ratings yet
Chapter7_GPU
45 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Review: Experiences With Shallow Water Acoustics: Richard E. Thorne
No ratings yet
Review: Experiences With Shallow Water Acoustics: Richard E. Thorne
5 pages
Electromagnetic Fields and Waves
No ratings yet
Electromagnetic Fields and Waves
2 pages
Nstructions: Office XP/2003 Toolbar Office 2007+ Tab
No ratings yet
Nstructions: Office XP/2003 Toolbar Office 2007+ Tab
2 pages
Feko Ebook
100% (2)
Feko Ebook
209 pages
Vector Machines Electromagnetics PDF
No ratings yet
Vector Machines Electromagnetics PDF
120 pages
How To Update Your PC BIOS in 3 Easy Steps by Wim Bervoets - 2015 PDF
No ratings yet
How To Update Your PC BIOS in 3 Easy Steps by Wim Bervoets - 2015 PDF
24 pages
Genetic Algorithm Toolbox FAQ
No ratings yet
Genetic Algorithm Toolbox FAQ
4 pages
Features of Online Hostel Management System
No ratings yet
Features of Online Hostel Management System
4 pages
Rajesh Singh SE Principal-1
No ratings yet
Rajesh Singh SE Principal-1
1 page
Untitled
No ratings yet
Untitled
171 pages
Welcome To (OLT) : Automated Online Test System Presented by Zelda Colson
100% (1)
Welcome To (OLT) : Automated Online Test System Presented by Zelda Colson
22 pages
Logcat
No ratings yet
Logcat
215 pages
01 PPT Alpro-Praktek
No ratings yet
01 PPT Alpro-Praktek
25 pages
VBNET
No ratings yet
VBNET
6 pages
Sonar Qube
No ratings yet
Sonar Qube
3 pages
K Scheme Java IMP Questions by VJTech Academy
No ratings yet
K Scheme Java IMP Questions by VJTech Academy
66 pages
Rockcalo 4: Theory
No ratings yet
Rockcalo 4: Theory
3 pages
Difference Between The Programming Language and
No ratings yet
Difference Between The Programming Language and
2 pages
Gamya Rithika Kolla SD
No ratings yet
Gamya Rithika Kolla SD
2 pages
Full Stack Next.js, FastAPI, PostgreSQL Tutorial - Travis Luong
No ratings yet
Full Stack Next.js, FastAPI, PostgreSQL Tutorial - Travis Luong
2 pages
Lab 1: Accessing Cloudera Distribution For Hadoop (Vmware & Cluster Environment)
No ratings yet
Lab 1: Accessing Cloudera Distribution For Hadoop (Vmware & Cluster Environment)
13 pages
Chrome_shell
No ratings yet
Chrome_shell
102 pages
Lecture6-CS6004 Application Development
No ratings yet
Lecture6-CS6004 Application Development
16 pages
Create A Language Compiler For The Dot NET
No ratings yet
Create A Language Compiler For The Dot NET
31 pages
Unit IV Servlet
No ratings yet
Unit IV Servlet
13 pages
Csv-Files-Using-Pandas-With-Examples/ Read CSV Files Using Pandas - With Examples
No ratings yet
Csv-Files-Using-Pandas-With-Examples/ Read CSV Files Using Pandas - With Examples
6 pages
Senior QA Engineer Task
No ratings yet
Senior QA Engineer Task
2 pages
IBM Registration Process Sample Format
No ratings yet
IBM Registration Process Sample Format
12 pages
Use Case Diagram: Web Browser
No ratings yet
Use Case Diagram: Web Browser
2 pages
create-react-app steps
No ratings yet
create-react-app steps
3 pages
Akash QA automation testing resume
No ratings yet
Akash QA automation testing resume
3 pages
Exception Handling
No ratings yet
Exception Handling
12 pages
Defect Template and Management
No ratings yet
Defect Template and Management
22 pages
IP VIVA
No ratings yet
IP VIVA
13 pages
PYTHON-PROGRAMMING BY H_PANCHAL
No ratings yet
PYTHON-PROGRAMMING BY H_PANCHAL
61 pages
Dot Net Interview Questions
100% (15)
Dot Net Interview Questions
396 pages
Build Simple Websites Using Commercial Programs
100% (1)
Build Simple Websites Using Commercial Programs
51 pages