Algorithmic Skeletons
Algorithmic Skeletons
Algorithmic Skeletons
Chris Cummins
NI VER
U S
E
IT
TH
Y
O F
H
G
E
R
D I U
N B
iii
Acknowledgements
I would like to sincerely thank supervisors Hugh Leather and Pavlos Petoumenos
for their excellent guidance, for suffering through early revisions of this work, and
for making the beginning of my metamorphosis from fledgling undergraduate to
cynical postgraduate an utter joy. My thanks to Michel Steuwer for his patient
and enthusiastic expositions of the SkelCL internals, and to Murray Cole and the
CDT staff for creating this great environment for learning. Finally, thank you to
my fellow PPar colleagues who share my love of strong coffee and weak humour.
iv
Contents
1 Introduction 1
1.1 Sacrificing Performance for Ease of Use . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Algorithmic Skeletons . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Abstracting Task and Data Parallelism . . . . . . . . . . . 9
2.2.2 Algorithmic Skeleton Frameworks . . . . . . . . . . . . . . 10
2.3 GPGPU Programming . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 The OpenCL Programming Model . . . . . . . . . . . . . 11
2.4 High-Level GPU Programming with SkelCL . . . . . . . . . . . . 13
2.4.1 Pattern definitions . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . 18
2.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Related Work 25
3.1 Automating Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Iterative Compilation & Machine Learning . . . . . . . . . . . . . 27
3.2.1 Training with Synthetic Benchmarks . . . . . . . . . . . . 29
3.3 Performance Tuning for Heterogeneous Parallelism . . . . . . . . . 30
3.4 Autotuning Algorithmic Skeletons . . . . . . . . . . . . . . . . . . 31
v
3.5 Code Generation and Autotuning for Stencils . . . . . . . . . . . 31
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi
7 Evaluation 59
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Statistical Soundness . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3 Workgroup Size Optimisation Space . . . . . . . . . . . . . . . . . 62
7.3.1 Oracle Workgroup Sizes . . . . . . . . . . . . . . . . . . . 62
7.3.2 Workgroup Size Legality . . . . . . . . . . . . . . . . . . . 63
7.3.3 Baseline Parameter . . . . . . . . . . . . . . . . . . . . . . 67
7.3.4 Speedup Upper Bounds . . . . . . . . . . . . . . . . . . . 68
7.3.5 Human Expert . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3.6 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4 Autotuning Workgroup Sizes . . . . . . . . . . . . . . . . . . . . . 73
7.4.1 Evaluating Classifiers . . . . . . . . . . . . . . . . . . . . . 76
7.4.2 Evaluating Regressors . . . . . . . . . . . . . . . . . . . . 77
7.4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . 78
7.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8 Conclusions 87
8.1 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vii
List of Figures
ix
7.16 Classification error heatmaps . . . . . . . . . . . . . . . . . . . . . 85
x
List of Tables
xi
Chapter 1
Introduction
1
2 Chapter 1. Introduction
dictions, but simply sufficiently performant results so that, when combined with
the vast accessibility improvements offered by algorithmic skeletons, it provides
a convincing argument for algorithmic skeletons as the solution to the parallel
programmability crisis.
1.2 Contributions
The key contributions of this thesis are:
(a) (b)
Figure 1.1: Workgroup size optimisation space of a stencil benchmark across devices:
(a) Intel CPU, (b) NVIDIA GPU. This shows that stencil performance is dependent
on properties of the underlying architecture, with different optimal workgroup sizes
(56 × 20 vs. 64 × 4) for the two devices shown.
1.3 Motivation
In this section I present the case for autotuning the workgroup size of SkelCL
stencil skeletons. Stencil workgroup sizes presents a two dimensional parameter
space, consisting of a number of rows and columns. It is constrained by properties
of both the stencil code and underlying architecture. For a detailed discussion of
the parameter space and experimental methodology, see Chapters 2 and 6.
By comparing the mean runtime of a stencil program using different work-
group sizes while keeping all other conditions constant, we can assess the relative
performance of different points in the optimisation space. Plotting this two di-
mensional optimisation space using a three dimensional bar chart provides a quick
visual overview of the optimisation space. The two horizontal axes are used to
represent the number of rows and columns in a workgroup, while the height of
each bar shows the performance of a program at that point in the space (higher
is better).
If the performance of workgroup sizes were not dependent on the execution
device, we would expect the relative performance of points in the optimisation
space to be consistent across devices. As shown in Figure 1.1, this is not the case,
with the optimisation space of the same benchmark on different devices being
radically different. Not only does the optimal workgroup size change between
1.3. Motivation 5
(a) (b)
Figure 1.2: Workgroup size optimisation space of two stencils on the same device.
Despite the underlying hardware being the same, the relative performance of work-
group sizes varies greatly between the two programs. The optimal workgroup sizes
are 128 × 2 and 256 × 4 respectively.
1.4 Structure
The remainder of the document is structured as follows:
1.5 Summary
This introductory chapter has outlined the need for higher levels of abstraction
for parallel programming and the difficulty that this provides for performance
tuning. It advocates the use of adaptive tuning for algorithmic skeletons, and
describes the contributions of this thesis towards this goal. In the next chapter,
I provide an overview of the techniques and methodology used in this thesis.
Chapter 2
Background
2.1 Introduction
This chapter provides a detailed though non-exhaustive description of the theory
and techniques used in this thesis. First is an overview of algorithmic skeletons,
GPGPU programming, and the combination of the two in SkelCL. Second is
an overview of the machine learning techniques used in this thesis, followed by a
description of the Statistical tools tools and methodologies used in the evaluation.
9
10 Chapter 2. Background
Algorithmic
Parallel Shared object file
Skeleton
Algorithms
Framework
API Linker
Figure 2.1: Typical usage of a library based Algorithmic Skeleton Framework. Other
approaches to algorithmic skeletons involve embedding the API into the core language
itself, or using template and macro substitution to remove the need for linking with
a library.
data parallel operations provided by SkelCL are described in detail in Section 2.4.
Task parallel skeletons treat the data as a singular object and instead paral-
lelise the execution of multiple tasks. Tasks are assigned to threads, which can
communicate with each other by passing data between threads. Examples of task
parallel algorithmic skeletons include pipe, task farm, and for loops.
Host
Global Memory
……
Workitem Workitem Workitem Workitem
Workitem Workitem Workitem Workitem
Workitem
Private …
…
Workitem
Private
… Workitem
Private …
…
Workitem
Private
Private … Private Private … Private
Memory
Private Memory
Private Memory
Private Memory
Private
Memory Memory Memory Memory
Memory Memory Memory Memory
Workgroup Workgroup
Workgroup Workgroup
Workgroup Workgroup
Device
Device
Device
Figure 2.2: The OpenCL memory model. The host communicates with each device
through transfers between global memory spaces. The capacity of each type of mem-
ory is dependent on the device hardware. In general, private memory is the fastest
and smallest, and global memory is the largest and slowest.
in the presence of flow control, as workitems must be stalled across divering flow
paths.
Memory Model
Unlike the flat model of CPUs, OpenCL uses a hierarchical memory model. The
host and each OpenCL device has a single global memory address space. Each
workgroup has a local memory space, and each workitem has a region of private
memory.
Workgroups cannot access the memory of neighbouring workgroups, nor can
workitems access the private memory of other workitems. OpenCL provides syn-
chronisation barriers to allow for communication between workitems within a
single workgroup via the local memory, but not global barriers. Memory trans-
fers between the host and devices occurs between global memory regions. In
the case of programming heterogeneous devices, these transfers must occur over
the connection bus between the CPU and device (e.g. PCIe for discrete GPUs),
which typically creates a performance bottleneck by introducing a performance
overhead to transfer data to the device for processing, then back to the device
afterwards. Direct transfers of data between devices is not supported, requiring
an intermediate transfer to the host memory.
2.4. High-Level GPU Programming with SkelCL 13
Performance Optimisations
Map
The Map operation is a basic building block of data parallel algorithms. Given a
customising function f and a vector x of n elements, the Map operation applies
the function f to each element x1 , x2 , . . . , xn , returning a vector of the same length:
Zip
Reduce
The Reduce operator combines all elements of an input vector and returns a
scalar. Given a binary operator ⊕, its identity i and a vector x of n elements:
For a n × m matrix:
x11 · · · x1m
. ... ..
Reduce . (2.6)
⊕, i, . .
→ x11 ⊕ x12 ⊕ . . . ⊕ xnm
xn1 · · · xnm
Reduction operates by recursively combining elements, storing intermediate re-
sults in fast local memory. No guarantee is made on execution ordering, so the
binary operator must be associative.
Scan
The Scan operator applies a binary function ⊕ to each member of an input array
x of n elements, such that element xi at the ith position is calculated by applying
the binary function to all elements x1 , x2 , . . . xi−1 . For the first element xi , the
value is the identify i of the ⊕ operator:
The SkelCL scan implementation is heavily optimised to make use of local mem-
ory, based on the design from [HSO07].
AllPairs
The AllPairs skeleton is a specialised pattern for Matrix Matrix operations, intro-
duced in [Ste+14]. Given two matrices of size n × d and n × m, a binary operator
⊕:
x11 · · · x1d y11 · · · y1m z11 · · · z1m
. . . . . . . ... ..
All Pairs . . . .. . .. .. . (2.8)
⊕, . , . → . .
xn1 · · · xnd yn1 · · · ynm zn1 · · · znm
where:
h i
zij = [xi1 , xi2 , . . . , xid ] ⊕ yj1 , yj2 , . . . , yjd (2.9)
An additional implementation is provided for when the ⊕ operator is known to
match that of a zip pattern:
h i
zij = xi1 , ⊕yj1 , xi2 ⊕ yj2 , . . . , xid ⊕ yjd (2.10)
Stencil
xi-2,j+2 xi+2,j+2
Sn
xi,j
Ss
xi-2,j-2 xi+2,j-2
Sw Se
Figure 2.3: The border region of a stencil is defined using a stencil shape S, consisting
of the four independent components describing the number of cells north Sn , east
Se , west Sw , and south Ss .
where:
zi−Sn ,j−Sw · · · zi−Sn ,j+Se
.. .. ..
zij = f . . . (2.12)
zi+Ss ,j−Sw · · · zi+Ss ,j+Se
Note that a stencil operation in which the size of the stencil shape S is zero in
every direction is functionally equivalent to a Map operation. Where the border
region includes elements outside of the matrix, values are substituted from either
a predefined padding value, or the value of the nearest cell within the matrix,
depending on user preference.
A popular usage of Stencil codes is for solving problems iteratively, whereby a
stencil operation is repeated over a range of discrete time steps 0 ≤ t ≤ tmax , and
t ∈ Z. An iterative stencil operation g accepts a customising function f , a Stencil
shape S, and a matrix M with initial values Minit . The value of an iterative
stencil can be defined recursively as:
Stencil (f, S, g(f, S, M, t − 1)) ,
if t ≥ 1
g(f, S, M, t) = (2.13)
M
init , otherwise
Stencil Implementation
For the stencil skeleton, each cell maps to a single work item; and this collection
of work items is then divided into workgroups for execution on the target hard-
ware. In a stencil code, each work-item reads the value of the corresponding grid
elements, and the surrounding elements defined by the border region. Since the
border regions of neighbouring elements overlap, the value of all elements within a
workgroup are stored in a tile, which is a region of local memory. As local memory
access times are much smaller than that of global memory, this greatly reduces
the latency of the border region reads performed by each workitem. Changing
the workgroup size thus affects the amount of local memory required for each
workgroup, which in turn affects the number of workgroups which may be simul-
taneously active. So while the user defines the size, type, and border region of
the of the grid being operated upon, it is the responsibility of the SkelCL stencil
implementation to select an appropriate workgroup size to use.
Border region
Workitem
Workgroup
Tile
Grid
Figure 2.4: The components of a stencil: a grid of cells is decomposed into work-
groups, consisting of multiple workitems. Each work item operates on a border region,
and the size of the workgroup and outer border region defines a tile, which is a region
of local memory allocated to each workgroup.
ZeroR
Naive Bayes
Naive Bayes classifiers are probabilistic classifiers which assume strong inde-
pendence between features. That is, the value of one feature is considered
independently of another, and is what lends the Naive portion of the name.
The goal of Naive Bayes is to predict the probability of a class y, given a set
of d independent variables x = x1 , x2 , . . . xd . Naive Bayes applies Bayes theo-
rem, which states that given a prior (i.e. unconditional) probability for each
class P (y), a class-conditional model P (x|y), and a normaliser across all classes
P (x) = y 0 P (x|y
0 )P (y 0 ): the probability of a class y given dependent variables
P
The class which has the highest probability from the set of possible classes Y is
the prediction. Note that the normaliser P (x) does not affect the class which is
20 Chapter 2. Background
most likely. The class conditional model uses “counts” for each observation based
on training data:
d
P (x|y) = P (xi |y) (2.15)
Y
i=1
The simplicity of Naive Bayes makes it attractive for various purposes such as
text classification (using word frequency as features), but the assumption of in-
dependence makes it unsuitable for certain use cases.
Decision Trees
|DV |
Gain(D, x) = H(D) − H(DV ) (2.17)
X
V ∈Values(x)
|D|
Decision trees are very popular and low overhead form of classification, as they
can be implemented using a nested set of if/else statements. However, they
can often overfit to the training data.
Random Forest
Random Forests are an ensemble technique which attempt to overcome the ten-
dency of decision trees to overfit to training data by training multiple decision
trees and using the mode of each tree to select predictions [Bre99]. Alternatively
they can be trained using regression trees, in which case the predicted value is
the mean output of all trees.
2.6. Statistics 21
2.6 Statistics
This section describes a number of the statistical tools used throughout this thesis.
Interquartile Range
For a given set of ordered data values, the quartiles are the three points that
divide the data set into four groups of equal size. The first quartile is the point
which Q1 divides the data equally between the lowest value and the median, Q2 .
The third quartile Q3 is the point which divides the data between the median
and the highest value. The interquartile range (IQR) measures the dispersion of
a data set, and is the difference between the third and first quartiles, Q3 − Q1 .
Confidence Intervals
Confidence intervals estimate the interval for the true range of a population pa-
rameter. Given a set of samples of a population with estimates of a parameter
value for each, the confidence interval (c1 , c2 ) estimates the frequency that the
true parameter value will fall between the per-sample confidence interval. Given
a sample x with sample size n, the confidence interval for a confidence α is found
using:
1X n
x̄ = xi (2.18)
n i=1
sP
i=1 (xi − x̄)
n 2
σ= (2.19)
n−1
σ
c1 = x̄ − z1−α/2 √ (2.20)
n
σ
c2 = x̄ + z1−α/2 √ (2.21)
n
Where the value z1−α/2 assumes a Gaussian distribution of the underlying data,
and the values for popular α values are typically found using pre-computed “Z-
tables”. To calculate confidence intervals for the ratio of two means, x̄1 and x̄2
with sample sizes n1 and n2 , and respective standard deviations σ1 and σ2 :
v
u 2
uσ σ22
σx = t 1
+ (2.22)
n1 n2
c1 = x̄1 − x̄2 − z1−α/2 σx (2.23)
c2 = x̄1 − x̄2 + z1−α/2 σx (2.24)
22 Chapter 2. Background
Histogram
Histograms are a widely used statistical visualisation which shows the distribution
of numerical data. The data is first divided into a set of discrete, equally sized
sub-intervals, or bins. The number of data points in each bin is used to show
visualise the density distribution. The shortcoming of histograms is that their
appearance is heavily influenced by three user-selected parameters: the number
of bins, the width of bins (binwidth), and the endpoints chosen. As such, they
may provide a misleading representation of the data if inappropriate values for
any of these parameters are chosen. Kernel Density estimation is a technique for
showing the distribution of data which circumvents some of these issues.
1 Xn
x − xi
fˆh (x) = K (2.25)
nh i=1 h
Using a smooth kernel such as a Gaussian distribution for the kernel produces
a smooth density estimated, unlike histograms. However, like histograms, the
appearance of a Kernel Density Estimate plot is dependent on the value of the
bandwidth parameter h (equivalent to binwidth in histograms), so care must be
taken to select a value to minimise over or under smoothing. Grouped data can
be shown by plotting multiple KDEs on the same axes, although if the number
of groups is large, a box plot or violin plot may be more appropriate.
2.7. Summary 23
Box plot
Box plots are used to show the distribution of quartile ranges for grouped data.
The contain the following features:
• Vertical lines above and below the upper and lower quartiles to the most
extreme data point within 1.5 IQR of the upper/lower quartile, with hori-
zontal whiskers at the end of the vertical lines.
A variation of box plots used in this thesis is the violin plot, which extends the box
plot with a fitted Kernel Density Estimate plot to show the probability density
of data at different values. To construct a violin plot, KDEs for each group are
rotated and mirrored to generate a smoothed visualisation of the distribution.
2.7 Summary
This chapter gave a brief introduction to general-purpose computing using graph-
ics hardware, the OpenCL framework, and SkelCL. It was followed by a descrip-
tion of the machine learning techniques used throughout this thesis, and the
statistical tools used for the evaluation. In the next chapter, I review the state
of the art in the field of autotuning, and provide context for the contributions of
this thesis.
Chapter 3
Related Work
This chapter begins with a brief survey of the broad field of literature that is
relevant to algorithmic skeletons. This is followed by a review of the current
state of the art in autotuning research, focusing on heterogeneous parallelism,
algorithmic skeletons, and stencil codes. It presents the context and rationale for
the research undertaken for this thesis.
25
26 Chapter 3. Related Work
Both of these challenges are extremely hard to tackle. For the former, the diffi-
culties lie in performing accurate analysis of code behaviour. Obtaining accurate
dynamic dependency analysis at compile time is an unsolved problem, as is re-
solving pointers and points-to analysis [Atk13; GLS01; Hin01].
The result of these challenges is that reliably performant, automatic paralleli-
sation of arbitrary programs remains a far from reached goal; however, there are
many note worthy variations on the theme which have been able to achieve some
measure of success.
One such example is speculative parallelism, which circumvents the issue of
having incomplete dependency information by speculatively executing code re-
gions in parallel while performing dependency tests at runtime, with the possi-
bility to fall back to “safe” sequential execution if correctness guarantees are not
met [Pra10; TG10]. In [Jim+14], Jimborean et al. present a system which com-
bines polyhedral transformations of user code with binary algorithmic skeleton
implementations for speculative parallelisation, reporting speedups over sequen-
tial code of up to 15.62× on a 24 core processor.
Another example is PetaBricks, which is a language and compiler enabling
parallelism through “algorithmic choice” [AC10; Ans09]. With PetaBricks, users
provide multiple implementations of algorithms, optimised for different parame-
ters or use cases. This creates a search space of possible execution paths for a
given program. This has been combined with autotuning techniques for enabling
optimised multigrid programs [Cha+09], with the wider ambition that these au-
totuning techniques may be applied to all algorithmic choice programs [Ans14].
While this helps produce efficient parallel programs, it places a great burden on
the developer, requiring them to provide enough contrasting implementations to
make a search of the optimisation space fruitful.
Annotation-driven parallelism takes a similar approach. The user annotates
their code to provide “hints” to the compiler, which can then perform parallelis-
ing transformations. A popular example of this is OpenMP, which uses com-
piler pragmas to mark code sections for parallel or vectorised execution [DE98].
Previous work has demonstrated code generators for translating OpenMP to
OpenCL [GWO13] and CUDA [LME09]. Again, annotation-driven parallelism
suffers from placing a burden on the developer to identify the potential areas for
parallelism, and lacks the structure that algorithmic skeletons provide.
3.2. Iterative Compilation & Machine Learning 27
and software stacks; and the difficulty to validate techniques due to a lack of
sharing in publications. They propose a system for addressing these concerns,
the Collective Mind knowledge system, which, while in early stages of ongoing
development, is intended to provide a modular infrastructure for sharing auto-
tuning performance data and related artifacts across the internet. In addition
to sharing performance data, the approach taken in this thesis emphasises the
collective exploitation of such performance data, so that data gathered from one
device may be used to inform the autotuning decisions of another. This requires
each device to maintain local caches of shared data to remove the network over-
head that would be present from querying a single centralised data store during
execution of a hot path. The current implementation of Collective Mind uses a
NoSQL JSON format for storing performance data. The relational schema used
in this thesis offers greater scaling performance and lower storage overhead as the
amount of performance data grows.
Whereas iterative compilation requires an expensive offline training phase to
search an optimisation space, dynamic optimisers perform this optimisation space
exploration at runtime, allowing programs to respond to dynamic features “on-
line”. This is a challenging task, as a random search of an optimisation space may
result in configurations with vastly suboptimal performance. In a real world sys-
tem, evaluating many suboptimal configurations will cause a significant slowdown
of the program. Thus a requirement of dynamic optimisers is that convergence
time towards optimal parameters is minimised.
Existing dynamic optimisation research has typically taken a low level ap-
proach to performing optimisations. Dynamo is a dynamic optimiser which per-
forms binary level transformations of programs using information gathered from
runtime profiling and tracing [BDB00]. While this provides the ability to respond
to dynamic features, it restricts the range of optimisations that can be applied
to binary transformations. These low level transformations cannot match the
performance gains that higher level parameter tuning produces.
An interesting related tangent to iterative compilation is the development of
so-called “superoptimisers”. In [Mas87], the smallest possible program which per-
forms a specific function is found through a brute force enumeration of the entire
instruction set. Starting with a program of a single instruction, the superopti-
miser tests to see if any possible instruction passes a set of conformity tests. If
not, the program length is increased by a single instruction and the process re-
3.2. Iterative Compilation & Machine Learning 29
peats. The exponential growth in the size of the search space is far too expensive
for all but the smallest of hot paths, typically less than 13 instructions. The
optimiser is limited to register to register memory transfers, with no support for
pointers, a limitation which is addressed in [JNR02]. Denali is a superoptimiser
which uses constraint satisfaction and rewrite rules to generate programs which
are provably optimal, instead of searching for the optimal configuration through
empirical testing. Denali first translates a low level machine code into guarded
multi-assignment form, then uses a matching algorithm to build a graph of all
of a set of logical axioms which match parts of the graph, before using boolean
satisfiability to disprove the conjecture that a program cannot be written in n
instructions. If the conjecture cannot be disproved, the size of n is increased and
the process repeats.
3.6 Summary
There is already a wealth of research literature on the topic autotuning which
begs the question, why isn’t the majority of software autotuned? In this chapter
I attempted to answer the question by reviewing the state of the art in the
autotuning literature, with specific reference to auotuning for GPUs and stencil
codes. The bulk of this research falls prey of one of two shortcomings. Either they
identify and develop a methodology for tuning a particular optimisation space
but then fail to develop a system which can properly exploit this (for example,
by using machine learning to predict optimal values across programs), or they
present an autotuner which targets too specific of a class of optimisations to be
widely applicable. This project attempts to address both of those shortcomings
by expending great effort to deliver a working implementation which users can
download and use without any setup costs, and by providing a modular and
extensible framework which allows rapid targeting of new autotuning platforms,
enabled by a shared autotuning logic and distributed training data. The following
chapter outlines the design of this system.
Chapter 4
4.1 Introduction
In previous chapters I have advocated the use of machine learning for autotuning.
In this chapter, I present OmniTune, a framework for extensible, distributed
autotuning using machine learning. OmniTune provides a replacement for the
kinds of ad-hoc tuning typically performed by expert programmers by providing
runtime prediction of tunable parameters using collaborative, online learning of
optimisation spaces. First I describe the high level overview of the approach to
autotuning, then I describe the system architecture and set of interfaces exposed
by Omnitune.
35
36 Chapter 4. OmniTune - an Extensible, Dynamic Autotuner
Training engine
Yes
No
Training?
Autotuning
engine
Feature Yes
Caches
Add new
observation
Feature Training
extractor Data
Figure 4.1: The process of selecting optimisation parameter values for a given user
program with OmniTune.
4.3. System Architecture and Interface 37
Localmachine
Local machine
Local machine Remote server
Application
Application
Application
Server
Client Interface Server
Client Interface Server Remote
Client Interface DBUS
DBUS Server Interface
DBUS Server Interface TCP
Server Interface Remote Interface
Application
Application Autotuner
Autotuner
Application Autotuner
Client Interface
Client Interface
Client Interface
Local Training
…
Local
…… Local
Caches
Caches Data
Caches
Application
Application
Application
Client Interface
Client Interface
Client Interface
overhead for the target applications; the second advantage is that this enables
collective tuning, in which training data gathered from a range of devices can be
accessed and added to by any OmniTune server.
Pull() : x, p, y
RequestTraining(x) : p
Training Submit(x, p, y)
Push(x, p, y)
Request(x) : p
Request(x) : p
This set of operations enables the core functionality of an autotuner, while pro-
viding flexibility for the client to control how and when training data is collected.
these local caches are periodically and asynchronously synced with the remote,
to maintain a global store of lookup tables for expensive operations.
On launch, the server requests the latest training data from the remote, which
it uses to build the relevant models for performing prediction of optimisation
parameter values. Servers communicate with remotes by submitting or requesting
training data in batches, using two operations:
4.4 Summary
This chapter has described the architecture of of OmniTune, a distributed auto-
tuner which is capable of performing runtime prediction of optimal workgroup
sizes using a variety of machine learning approaches. OmniTune uses a client-
server model to decouple the autotuning logic from target programs and to main-
tain separation of concerns. It uses lightweight inter-process communication to
achieve low latency autotuning, and uses caches and lookup tables to minimise
the one-off costs of feature extraction.
Chapter 5
5.1 Introduction
In this chapter I apply the OmniTune framework to SkelCL. The publicly avail-
able implementation 1 predicts workgroup sizes for OpenCL stencil skeleton ker-
nels in order to minimise their runtime on CPUs and multi-GPU systems. The
optimisation space presented by the workgroup size of OpenCL kernels is large,
complex, and non-linear. Successfully applying machine learning to such a space
requires plentiful training data, the careful selection of features, and classification
approach. The following sections address these challenges.
5.2 Training
One challenge of performing empirical performance evaluations is gathering enough
applications to ensure meaningful comparisons. Synthetic benchmarks are one
technique for circumventing this problem. The automatic generation of such
benchmarks has clear benefits for reducing evaluation costs; however, creating
meaningful benchmark programs is a difficult problem if we are to avoid the
problems of redundant computation and produce provable halting benchmarks.
In practise, stencil codes exhibit many common traits: they have a tightly
constrained interface, predictable memory access patterns, and well defined nu-
merical input and output data types. This can be used to create a confined space
of possible stencil codes by enforcing upper and lower bounds on properties of
the codes which can not normally be guaranteed for general-purpose programs,
1 https://github.com/ChrisCummins/omnitune
41
42 Chapter 5. Autotuning SkelCL Stencils
e.g. the number of floating point operations. In doing so, it is possible to pro-
gramatically generate stencil workloads which share similar properties to those
which we intend to target.
Based on observations of real world stencil codes from the fields of cellular
automata, image processing, and PDE solvers, I implemented a stencil generator
which uses parameterised kernel templates to produce source codes for collecting
training data. The stencil codes are parameterised by stencil shape (one parame-
ter for each of the four directions), input and output data types (either integers,
or single or double precision floating points), and complexity — a simple boolean
metric for indicating the desired number of memory accesses and instructions per
iteration, reflecting the relatively bi-modal nature of the reference stencil codes,
either compute intensive (e.g. FDTD simulations), or lightweight (e.g. Game of
Life).
Using a large number of synthetic benchmarks helps adress the “small n, large
P ” problem, which describes the difficulty of statistical inference in spaces for
which the set of possible hypotheses P is significantly larger than the number of
observations n. By creating parameterised, synthetic benchmarks, it is possible to
explore a much larger set of the space of possible stencil codes than if relying solely
on reference applications, reducing the risk of overfitting to particular program
features.
Table 5.1: OmniTune SkelCL Stencil features for dataset, kernel, and device.
44 Chapter 5. Autotuning SkelCL Stencils
• Kernel — The user code for a stencil is passed to the OmniTune server,
which compiles the OpenCL kernel to LLVM IR bitcode. The opt InstCount
statistics pass is used to obtain static counts for each type of instruction
present in the kernel, as well as the total number of instructions. The
instruction counts for each type are divided by the total number of instruc-
tions to produce a density of instruction for that type. Examples include
total static instruction count, ratio of instructions per type, ratio of basic
blocks per instruction, etc.
• Dataset — The SkelCL container type is used to extract the input and
output data types, and the 2D grid size.
5.4.1 Constraints
Unlike in many autotuning applications, the space of optimisation parameter
values is subject to hard constraints, and these constraints cannot conviently be
statically determined. Contributing factors are architectural limitations, kernel
constraints, and refused parameters.
5.4. Optimisation Parameters 45
Architectural constraints
Each OpenCL device imposes a maximum workgroup size which can be stati-
cally checked by querying the clGetDeviceInfo() API for that device. These
are defined by archiectural limitations of how code is mapped to the underlying
execution hardware. Typical values are powers of two, e.g. 1024, 4096, 8192.
Kernel constraints
Refused parameters
In addition to satisfying the constraints of the device and kernel, not all points
in the workgroup size optimisation space are guaranteed to provide working pro-
grams. A refused parameter is a workgroup size which satisfies the kernel and
architectural constraints, yet causes a CL_OUT_OF_RESOURCES error to be thrown
when the kernel is enqueued. Note that in many OpenCL implementations, this
error type acts as a generic placeholder and may not necessarily indicate that the
underlying cause of the error was due to finite resources constraints.
Legality
We define a legal workgroup size as one which, for a given scenario (a combination
of program, device, and dataset), satisfies the architectural and kernel constraints,
and is not refused. The subset of all possible workgroup sizes Wlegal (s) ⊂ W that
are legal for a given sceanario s is then:
Wlegal (s) = {w|w ∈ W, w < Wmax (s)} − Wref used (s) (5.1)
Where Wmax (s) can be determined at runtime prior to the kernels execution, but
the set Wref used (s) can only be determined experimentally.
46 Chapter 5. Autotuning SkelCL Stencils
The oracle workgroup size Ω(s) ∈ Wlegal (s) of a sceanrio s is the w value which
provides the lowest mean runtime. This allows for comparing the performance
p(s, w) of a particular workgroup against the maximum available performance for
that scenario, within the range 0 ≤ p(s, w) ≤ 1:
Establishing a Baseline
s∈S
The baseline workgroup size w̄ is the value which provides the best average case
performance across all scenarios. Such a baseline value represents the best possible
performance which can be achieved using a single, statically chosen workgroup
size. By defining Wsaf e ∈ W as the intersection of legal workgroup sizes, the
baseline can be found using:
n o
Wsaf e = ∩ Wlegal (s)|s ∈ S (5.6)
w̄ = arg max p̄(w) (5.7)
w∈Wsaf e
5.5. Machine Learning 47
During testing, the classifier predicts workgroup sizes from the set of oracle work-
group sizes from the training set:
This approach presents the problem that after training, there is no guarantee
that the set of workgroup sizes which may be predicted is within the set of legal
workgroup sizes for future scenarios:
∀s∈Stesting
This may result in a classifier predicting a workgroup size which is not legal
for a scenario, w 6∈ Wlegal (s), either because it exceeds Wmax (s), or because the
parameter is refused. For these cases, I evaluate the effectiveness of three fallback
strategies to select a legal workgroup size:
1. Baseline — select the workgroup size which is known to be safe w < Wsaf e ,
and provides the highest average case performance on training data.
3. Nearest Neighbour — select the workgroup size which from prior observa-
tions is known to be legal, and has the lowest Euclidian distance to the
prediction.
Note that since we cannot know in advance which workgroup sizes will be refused,
that is, Wref used (s) cannot be statically determined, this process must be iterated
until a workgroup size which not refused is selected. Algorithm 2 shows this
process.
5.6 Implementation
The OmniTune framework is implemented as a set of Python classes and in-
terfaces, which are inherited from or implemented to target specific autotuning
cases. The entire framework consists of 8987 lines of Python code, of which 976 is
dedicated to the SkelCL frontend. By design, the client-server model minimises
the impact of number of modifications that are required to enable autotuning in
client applications. The only modification required is to replace the hardcoded
values for workgroup size with a subroutine to request a workgroup size from
the OmniTune server over a DBUS connection. The server is implemented as a
standalone Python program, and uses sqlite to maintain local data caches. The
OmniTune remote is an Amazon Web Services virtual machine instance, using
5.7. Summary 51
MySQL as the relational data store. Figure 5.1 shows the relational database
schema used to store stencil runtime information. Additional tables store data in
coalesced forms for use as machine learning training data.
For classification, five classifiers are supported, chosen for their contrasting
properties: Naive Bayes, SMO, Logistic Regression, J48 Decision tree, and Ran-
dom Forest. For regression, a Random Forest with regression trees is used, chosen
because of its efficient handling of large feature sets, compared to linear models.
5.7 Summary
This section has described has the application of OmniTune for predicting work-
group sizes of SkelCL stencil programs, using three different machine learning
approaches. The first approach is to predict the optimal workgroup size for a
given program based on training data which included the optimal workgroup
sizes for other stencils. The second approach is to select a workgroup sizex by
sweeping the space of possible workgroup sizes, predicting the runtime a program
with each. The third approach is to select a workgroup size by sweeping the space
of possible workgroup sizes, predicting the relative gain of each compared to a
known baseline. In the next section, we will describe the process for collecting
empirical performance data.
52 Chapter 5. Autotuning SkelCL Stencils
8
' (
*-/0
Figure 5.1: Database schema for storing SkelCL stencil runtimes. Feature lookup
tables and normalisation are used to provide extremely compact storage, requiring
only 56 bytes for each additional runtime of a known stencil program.
Chapter 6
6.1 Introduction
This chapter describes an exhaustive enumeration of the workgroup size opti-
misation space for 429 combinations of architecture, program, and dataset. It
contains the methodology used to collect empirical performance data on which
to base performance comparisons of different workgroup sizes, and the steps nec-
essary to obtain repeatable results.
53
54 Chapter 6. Exploring the Workgroup Size Space
Name Compute units Frequency Local Memory Global Cache Global Memory
to the task scheduler. Datasets and programs were stored in an in-memory file
system.
6.2.1 Devices
Table 6.2 describes the OpenCL devices used for testing, as available on the
experimental platforms.
Table 6.3: Stencil kernels, border sizes (north, south, east, and west), and static
instruction counts.
Table 6.3 shows details of the stencils kernels for these reference applications, and
the synthetic training benchmarks used.
6.2.3 Datasets
For each benchmark, multiple dataset sizes were used, as shown in table 6.4.
56 Chapter 6. Exploring the Workgroup Size Space
Table 6.5: Execution phases of a SkelCL stencil skeleton. “Fixed” costs are those
which occur up to once per stencil invocation. “Iterative” costs are those which scale
with the number of iterations of a stencil.
• Kernel execution times This is the time elapsed executing the stencil
kernel, and is representative of “work done”.
6.2. Experimental Setup 57
For each of the six distinct phases of execution, accurate runtime information
can be gathered either through timers embedded in the host code, or using the
OpenCL clGetEventProfilingInfo() API for operations on the execution de-
vices. For single-device stencils, the total time t of a SkelCL stencil application
is simply the sum of all times recorded for each distinct phase:
Note that there are no synchronisation costs s. For applications with n execution
devices, the runtime can be approximate as the sum of the sequential host-side
phases, and the sum of the device-side phases divided by the number of devices:
n
+ 1kiT + 1dTi
Pn T
i=1 1ui
(1cTi ) + 1pT + 1s +
T
(6.2)
X
t≈
i=1 n
Gold standard output was recorded by executing each of the real-world bench-
marks programs using the baseline workgroup size. The output of real-world
benchmarks with other workgroup sizes was compared to this gold standard out-
put to guarantee correct program execution.
58 Chapter 6. Exploring the Workgroup Size Space
6.3 Summary
This section describes the methodology for collecting relative performance data of
SkelCL stencil benchmarks under different combinations of architecture, program,
dataset, and workgroup size. The next chapter evaluates these performance re-
sults, and analyses the performance of OmniTune at predicting workgroup sizes.
Chapter 7
Evaluation
7.1 Introduction
This chapter evaluates the performance of OmniTune when tasked with selecting
workgroup sizes for SkelCL stencil codes. First I discuss measurement noise
present in the experimental results, and the methods used to accommodate for it.
Then I examine the observed effect that workgroup size has on the performance of
SkelCL stencils. The effectiveness of each of the autotuning techniques described
in the previous chapters is evaluated using multiple different machine learning
algorithms. The prediction quality of OmniTune is scrutinised for portability
across programs, devices, and datasets.
The experimental results consist of measured runtimes for a set of test cases,
collected using the methodology explained in the previous chapter. Each test case
τi consists of a scenario, workgroup size pair τi = (si , wi ), and is associated with a
sample of observed runtimes from multiple runs of the program. A total of 269813
test cases were evaluated, which represents an exhaustive enumeration of the
workgroup size optimisation space for 429 scenarios. For each scenario, runtimes
for an average of 629 (max 7260) unique workgroup sizes were measured. The
average sample size of runtimes for each test case is 83 (min 33, total 16917118).
59
60 Chapter 7. Evaluation
2.0 2.5 3.0 3.5 100 120 140 160 180 200 250
Runtime (ms) Runtime (ms) Runtime (ms)
Figure 7.1: Distribution of runtime samples for test cases from three devices. Each
plot contains a 35-bin histogram of 1000 samples, and a fitted kernel density estimate
with bandwidth 0.3. The sample mean is shown as a vertical dashed line. The top row
are from the Intel i5-4570, the second row from the Nvidia GTX 590, and the third
row from the AMD Tahiti 7970. In some of the plots, the distribution of runtimes is
bimodal, and skewed to the lower end of the runtimes range.
7.2. Statistical Soundness 61
The complex interaction between processes competing for the finite resources of
a system introduces many sources for noise in program runtime measurements.
Before making any judgements about the relative performance of optimisation
configurations, we must establish the level of noise present in these measurements.
To do this, we evaluate the distribution of runtimes for a randomly selected 1000
test cases, recording 1000 runtime observations for each. We can then produce
fine-grained histograms of runtimes for individual test cases. Figure 7.1 shows an
example nine of these, for test cases from three devices. The plots show that the
distribution of runtimes is not always Gaussian; rather, it is sometimes bimodal,
and generally skewed to the lower end of the runtime range, with a long “tail” to
the right. This fits our intuition that programs have a hard minimum runtime
enforced by the time taken to execute the instructions of a program, and that
noise introduced to the system extends this runtime. For example, preempting
an OpenCL process on a CPU so that another process may run may cause the
very long tail visible in Figure 7.1a.
The central limit theorem allows the assumption of an underlying Gaussian
distribution for samples of size ≥ 30 [GBE07]. Given our minimum sample size
of 33, we can use 95% confidence intervals to provide statistical confidence that
the arithmetic mean of observed runtimes with respect to the true mean. As
the number or samples increases, we should expect the size of the confidence
interval to shrink. This is illustrated in Figure 7.2, which plots the average
size of 95% confidence intervals across the 1000 test cases, normalised to their
respective means, as a function of sample size. It shows the diminishing returns
that increasing sample size provides. For example, increasing the sample count
from 10 to 30 results in an approximate 50% reduction in confidence interval size.
Increasing the sample size from 30 to 50 results in only a 25% reduction.
By comparing the average confidence interval at different sample counts against
the full experiment results of 269813 test cases, we can assert with 95% confi-
dence that the true mean for each test case is within 2.5% of the sample mean
(given the average number of samples per test case), or 3.7% in the worst case
(at the minimum number of samples). Since the differences between baseline and
optimal workgroup sizes is often well in excess of 100%, there is no overlap of
confidence intervals between competing workgroup sizes.
62 Chapter 7. Evaluation
12%
95% CI / mean 10%
8%
6%
4%
2%
0%
200 400 600 800 1000
Number of samples
100%
80%
Accuracy
60%
40%
20%
0%
0 20 40 60 80 100 120
Number of distinct workgroup sizes
Figure 7.3: Accuracy compared to the oracle as a function of the number of workgroup
sizes used. The best accuracy that can be achieved using a single statically chosen
workgroup size is 15%. Achieving 50% oracle accuracy requires a minimum of 14
distinct workgroup sizes.
group size. This demonstrates the difficult in attempting to tune for optimal
parameter values, since 14 distinct workgroup sizes are needed to achieve just
50% of the oracle accuracy (Figure 7.3), although it is important to make the
distinction that oracle accuracy and performance are not equivalent.
Figure 7.4a shows the distribution of oracle workgroup sizes, demonstrating
that there is clearly no “silver bullet” workgroup size which is optimal for all
scenarios, and that the space of oracle workgroup sizes is non linear and complex.
The workgroup size which is most frequently optimal is w(64×4) , which is optimal
for 15% of scenarios. Note that this is not adequate to use as a baseline for static
tuning, as it does not respect legality constraints, that is w(64×4) 6∈ Wsaf e .
Refused Parameters
4.5
4.0
Oracle frequency (log)
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
2 2
20 20
40 40
wr 60 60
wc
80 80
100 100
(a)
6
Legal frequency (log)
0
2 2
20 20
40 40
wr 60 60
wc
80 80
100 100
(b)
Figure 7.4: Log frequency counts for: (a) optimality, and (b) legality for a subset of
the aggregated workgroup size optimisation space, wc ≤ 100, wr ≤ 100. The space of
oracle workgroup size frequencies is highly irregular and uneven, with a peak frequency
of w(64×4) . Legality frequencies are highest for smaller row and column counts (where
w < Wmax (s)∀s ∈ S), and wc and wr values which are multiples of 8.
7.3. Workgroup Size Optimisation Space 65
1.0
0.8
Coverage (ratio)
0.6
0.4
0.2
0.0
2 2
20 20
40 40
wr 60 60
wc
80 80
100 100
Table 7.1: The thirty most refused parameters, ranked in descending order. There is
little correlation between the size of workgroup and the likelihood that it is refused,
suggesting that the cause of refused parameters is not a resource constraint, but a
behavioural issue.
66 Chapter 7. Evaluation
9% 8%
GenuineIntel
Intel(R) Corporation
nvidia corporation
Intel i5-4570
Intel i7-3820
(a) (b)
Figure 7.6: The ratio of test cases with refused workgroup sizes, grouped by: (a)
OpenCL device ID; (b) device vendor ID. Parameters were refused most frequently
by Intel i5 CPUs, then by previous-generation NVIDIA GPUs. No parameters were
refused by AMD devices.
by the OpenCL runtime and do not provide a functioning program. Of the 8504
unique workgroup sizes tested, 11.4% were refused in one or more test cases. An
average of 5.5% of all test cases lead to refused parameters. For a workgroup size
to be refused, it must satisfy the architectural and program-specific constraints
which are exposed by OpenCL, but still lead to a CL_OUT_OF_RESOURCES error
when the kernel is enqueued. Table 7.1 lists the most frequently refused parame-
ters, and the percentage of test cases for which they were refused. While uncom-
mon, a refused parameter is an obvious inconvenience to the user, as one would
expect that any workgroup size within the specified maximum should behave
correctly, if not efficiently. Figure 7.4b visualises the space of legal workgroup
sizes by showing the frequency counts that a workgroup size is legal. Smaller
workgroup sizes are legal most frequently due to the Wmax (s) constraints. Be-
yond that, workgroup sizes which contain wc and wr values which are multiples
of eight are more frequently legal, which is a common width of SIMD vector
operations [Int12].
Experimental results suggest that the problem is vendor — or at least device
— specific. By grouping the refused test cases by device and vendor, we see a
7.3. Workgroup Size Optimisation Space 67
much greater quantity of refused parameters for test cases on Intel CPU devices
than any other type, while no workgroup sizes were refused by the AMD GPU.
Figure 7.6 shows these groupings. The exact underlying cause for these refused
parameters is unknown, but can likely by explained by inconsistencies or errors
in specific OpenCL driver implementations.
As these OpenCL implementations are still in active development, it is antici-
pated that errors caused by unexpected behaviour will become more infrequent as
the technology matures. Figure 7.6a shows that the ratio of refused parameters
decreases across the three generations of Nvidia GPUs: GTX 590 (2011), GTX
690 (2012), and GTX TITAN (2013). The same trend is apparent for the two In-
tel i5s: i5-2430M (2011), and i5-4570 (2013), although not for the i7-3820 (2012).
For now, it is imperative that any autotuning system is capable of adapting to
these refused parameters by suggesting alternatives when they occur.
The baseline parameter w̄ is the workgroup size which provides the best overall
performance while being legal for all scenarios. It is the workgroup size w ∈ Wsaf e
which maximises the output of the performance function p̄(w). As shown in Ta-
ble 7.2, only a single workgroup size w(4×4) from the set of experimental results is
found to have a legality of 100%, suggesting that an adaptive approach to setting
workgroup size is necessary not just for the sake of maximising performance, but
also for guaranteeing program correctness.
The utility of the baseline parameter is that it represents the best performance
that can be achieved through static tuning of the workgroup size parameter. We
can evaluate the performance of suboptimal workgroup sizes by calculating the
geometric mean of their performance for a particular scenario p(s, w) across all
scenarios, p̄(w). The baseline parameter p̄(w̄) achieves only 24% of the available
performance. Figure 7.7 plots workgroup size legality and performance, showing
that there is no clear correlation between the two. In fact, the workgroup sizes
with the highest mean performance are valid only for scenarios with the largest
Wmax (s) value, which account for less than 1% of all scenarios, further reinforcing
the case for adaptive tuning. The workgroup sizes with the highest legality are
listed in Table 7.2, and the workgroup sizes with the highest performance are
listed in Table 7.3.
68 Chapter 7. Evaluation
1.0
pearsonr = 0.15; p = 1.1e-46
0.8
Performance
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Legality
Figure 7.7: Average legality and performance relative to the oracle of all workgroup
sizes. Clearly, the relationship between legality and performance is not linear. Distinct
vertical “bands” appear between regions of legality caused by the different Wmax (s)
values of devices. The irregular jitter between these vertical bands is caused by refused
parameters.
Figure 7.8 shows the speedup of the oracle workgroup size over the baseline
parameter w(4×4) for all scenarios. If we assume that sufficiently pragmatic de-
veloper with enough time would eventually find this static optimal, then this
provides a reasonable comparison for calculating speedups of an autotuner for
workgroup size. Comparing the runtime of workgroup sizes relative to the oracle
allows us to calculate upper bounds on the possible performance which can be
expected from autotuning.
For a given scenario s, the ratio of the workgroups sizes from Wlegal (s) which
provide the longest and shortest mean runtimes is used to calculate an upper
7.3. Workgroup Size Optimisation Space 69
Parameter Legality (%) Performance (%) Parameter Legality (%) Performance (%)
Table 7.2: The 25 workgroup sizes Table 7.3: The 25 workgroup sizes
with the greatest legality. with the greatest mean performance.
103
Max
w(4×4)
w(32×4)
102
Speedup (log)
101
100
0 50 100 150 200 250 300 350 400
Scenarios (sorted by descending max speedup)
Figure 7.8: Speedup of oracle workgroup size over: the worst performing workgroup
size for each scenario (Max ), the statically chosen workgroup size that provides
the best overall performance (w(4×4) ), and the human expert selected parameter
(w(32×4) ). Note that the human expert parameter is not legal for all scenarios.
7.3.6 Heuristics
In this section we will consider the effectiveness of instead selecting workgroup
size using two types of heuristics. The first, using the maximum workgroup size
returned by the OpenCL device and kernel APIs to select the workgroup size
adaptively. The second, using per-device heuristics, in which the workgroup size
is selected based on the specific architecture that a stencil is operating on.
A common approach taken by OpenCL developers is to set the workgroup size for
a kernel based on the maximum legal workgroup size queried from the OpenCL
APIs. For example, to set the size of 2D workgroup, a developer the square root of
the (scalar) maximum wgsize to set the number of columns and rows (since wc · wr
must be < Wmax (s)). To consider the effectiveness of this approach, we group
the workgroup size performances based on the ratio of the maximum allowed for
each scenario. We can also perform this for each of the two dimensions — rows
and columns — of the stencil workgroup size.
Figure 7.9 shows the distribution of runtimes when grouped this way, demon-
strating that the performance of (legal) workgroup sizes are not correlated with
the maximum workgroup sizes Wmax (s). However, when considering individual
components, we observe that the best median workgroup size performances are
achieved with a number of columns that is between 10% and 20% of the maxi-
mum, and a number of rows that is between 0% and 10% of the maximum.
Table 7.4: Selecting workgroup size using a per-device heuristic. The mode optimal
workgroup size for each device type w̄ is evaluated based on legality, and relative
performance to the oracle (minimum and average) when legal.
lowest median performance, indicating that it is the most profitable to tune, while
the threshold kernel is the least profitable. Similarly, the Intel i7-3820 is far less
amenable to tuning than the other devices, while the Intel i5-4570 is the most
sensitive to the workgroup size parameter. Such variances in the distributions of
workgroup sizes suggest that properties underlying the architecture, kernel, and
dataset all contribute towards the proper selection of workgroup size.
To test the performance of a per-device heuristic for selecting workgroup size,
we group the scenarios by device, and compare the relative performance of all
workgroup sizes for each group of scenarios. The most frequently optimal work-
group size w̄ for each device is selected, and the legality and performance of each
scenario using that workgroup size is evaluated. Table 7.4 shows the results of
this evaluation. The GTX 690 and GTX TITAN share the same w̄(64×4) value,
while every other device has a unique optimum. The general case performance of
these per-device parameters is very good, although legality is only assured for the
GTX TITAN and AMD 7970 (which did not refuse any parameters). However,
the worst case performance of per-device workgroup sizes is poor for all except
the i7-3820 (which is least sensitive to tuning), suggesting that device alone is
not capable of reliably informing the choice of workgroup size.
7.4. Autotuning Workgroup Sizes 73
7.3.7 Summary
In this section we have explored the performance impact of the workgroup size
optimisation space. By comparing the relative performance of an average of 629
workgroup sizes for each of 429 scenarios, the following conclusions can be drawn:
2. The performance gap between the best and workgroup sizes for a particular
combination of hardware, software, and dataset is up to 207.72×.
3. Not all workgroup sizes are legal, and the space of legal workgroup sizes can-
not statically be determined. Adaptive tuning of workgroup size is required
to ensure reliable performance.
4. Differing scenarios have wildly different optimal workgroup sizes, and the
best performance can be achieved using static tuning is optimal for only
15% of scenarios.
In the following section, we will evaluate the performance of OmniTune for se-
lecting workgroup sizes.
• 10-fold — shuffle the set of all data and divide into 10 validation sets, each
containing 10% of the data. This process is repeated for 10 rounds, resulting
in 100 validations of 10 permutations of the data.
74 Chapter 7. Evaluation
1.0
0.8
Performance
0.6
0.4
0.2
0.0
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Workgroup size (as a % of Wmax (s))
(a)
1.0 1.0
0.8 0.8
Performance
Performance
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
10% 20% 30% 40% 50% 10% 20% 30% 40% 50%
Workgroup columns (as a % of Wmax (s)) Workgroup rows (as a % of Wmax (s))
(b) (c)
Figure 7.9: Comparing workgroup performance relative to the oracle as function of:
(a) maximum legal size, (b) number of columns, and (c) number of rows. Each
workgroup size is normalised to the maximum allowed for that scenario, Wmax (s).
Performance Performance
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Intel i5-4570
synthetic
Intel i5-2430M
Intel i7-3820
(b)
Nvidia GTX 690 gaussian
(a)
Performance nms
0.0
0.2
0.4
0.6
0.8
1.0
1024.1024.float.float
threshold
cases, grouped by: (a) kernels, (b) devices, and (c) datasets.
2048.2048.float.float
(c)
4096.4096.float.float gol
4096.4096.int.int
he
512.512.float.float
Figure 7.10: Performance relative to the oracle of workgroup sizes across all test
75
76 Chapter 7. Evaluation
• Synthetic — divide the training data such that it consists solely of data
collected from synthetic benchmarks, and use data collected from real-world
benchmarks to test.
• Device — partition the training data into n sets, one for each device. Use
n − 1 sets for training, repeating until every partition has been used for
testing once.
• Kernel — partition the training data into n sets, one for each kernel. Use
n − 1 sets for training, repeating until every partition has been used for
testing once.
• Dataset — partition the training data into n sets, one for each type of
dataset. Use n − 1 sets for training, repeating until every partition has
been used for testing once.
For each autotuning technique, the results of testing using the different validation
sets are reported separately. The autotuning techniques evaluated are: using clas-
sifiers to predict the optimal workgroup size of a stencil, with fallback strategies
to handle illegal predictions; using regressors to predict the runtime of a stencil
using different workgroup sizes, and selecting the legal workgroup size which has
the lowest predicted runtime; and using regressors to predict the relative perfor-
mance of workgroup sizes over a baseline, and selecting the workgroup size which
has the highest predicted relative performance. We first describe the evaluation
strategies for each technique, before presenting experimental results and analysis.
which they handle illegal predictions. Illegal predictions occur either because the
classifier has suggested a parameter which does not satisfy the maximum work-
group size constraints w < Wmax (s), or because the workgroup size is refused by
OpenCL w ∈ Wref used (s). Workgroup sizes are predicted for each scenario in the
testing set, and the quality of the predicted workgroup size is evaluated using the
following metrics:
• accuracy (binary) — whether the predicted workgroup size is the true ora-
cle, p(f (s)) = Ω(s).
• time (real) — the round trip time required to make the prediction, as mea-
sured by the OmniTune client. This includes classification time and inter-
process communication overheads between the client and server.
The validty and refused metrics measure how often fallback strategies are required
to select a legal workgroup size w ∈ Wlegal (s).
∀s∈Straining
78 Chapter 7. Evaluation
For predicting speedups, the features vectors are labelled with observed speedup
over the baseline parameter w̄ (see Section 7.3.3) for all legal workgroup sizes:
n o
Dtraining = ∪ (f (s), r(s, w, w̄))|w ∈ Wlegal (s) ∀s ∈ Straining (7.5)
The quality of predicted workgroup sizes is evaluated using the following metrics:
• accuracy (binary) — whether the predicted workgroup size is the true ora-
cle, p(f (s)) = Ω(s).
• time (real) — the round trip time required to make the prediction, as mea-
sured by the OmniTune client. This includes classification time and inter-
process communication overheads between the client and server.
Unlike with classifiers, the process of selecting workgroup sizes using regressors
is resistant to refused parameters, so no fallback strategies are required, and the
validity and refused metrics are not used.
With the exception of the ZeroR, which predicts only the baseline workgroup
size w(4×4) , the classifiers achieve good speedups over the baseline. Average clas-
sification speedups across all validation sets range between 4.61× and 5.05×.
Figures 7.11 and 7.12 show a summary of results using 10-fold cross-validation
and cross-device validation, respectively. The highest average speedup is achieved
by SMO, and the lowest by Naive Bayes. The difference between average speedups
is not significant between the types of classifier, with the exception of SimpleLo-
gistic, which performs poorly when trained with synthetic benchmarks and tested
against real-world programs. This suggests the model over-fitting to features of
the synthetic benchmarks which are not shared by the real-world tests.
By isolating the test cases where classifiers predicted an illegal or refused
parameter, we can directly compare the relative effectiveness of each fallback
handler. The fallback handler with the best average case performance is Near-
estNeighbour, with an average speedup across all classifiers and validation
sets of 5.26×. The speedup of Random fallback handler is 3.69×, and 1.0×
for Baseline. Figure 7.14 plots the speedups of each fallback handler for all of
these isolated test cases. Interestingly, both the lowest and highest speedups are
achieved by the Random fallback handler, since it essentially performs a random
exploration of the optimisation space. However, the NearestNeighbour fall-
back handler provides consistently greater speedups for the majority of test cases,
indicating that it successfully exploits the structure of the optimisation spaces.
Figures 7.13a and 7.13b show a summary of results for classification using
regressors to predict program runtimes and speedups, respectively. Of the two
regression techniques, predicting the speedup of workgroup sizes is much more suc-
cessful than predicting the runtime. This is most likely caused by the inherent
difficulty in predicting the runtime of arbitrary programs, where dynamic factors
such as flow control and loop bounds are not captured by the kernel features
used in OmniTune, which instead use simple static static instruction counts and
densities. The average speedup achieved by predicting runtimes is 4.14×. For
predicting speedups, the average is 5.57×. Tables 7.5, 7.6, and 7.7 show mean per-
formances and speedups for: J48 classifier using the NearestNeighour fallback
strategy, classification using runtime regression, and classification using speedup
regression, respectively.
If we eliminate the 2.6% of test cases for which the workgroup size of w(32×4)
is illegal, we can compare the performance of OmniTune directly against the hu-
80 Chapter 7. Evaluation
10
Classification time (ms)
8
0
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
35%
30%
25%
20%
Ratio
15%
10%
5%
0%
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
Illegal Refused Accurate
7
6
5
Speedup
4
3
2
1
0
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
Baseline Random NearestNeighbour
100%
80%
Performance
60%
40%
20%
0%
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
Baseline Random NearestNeighbour
Figure 7.11: Classification results for synthetic benchmarks. Each classifier is trained
on data from synthetic stencils, and tested for prediction quality using data from 6
real world benchmarks.
7.4. Autotuning Workgroup Sizes 81
10
0
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
35%
30%
25%
20%
Ratio
15%
10%
5%
0%
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
Illegal Refused Accurate
7
6
5
Speedup
4
3
2
1
0
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
Baseline Random NearestNeighbour
100%
80%
Performance
60%
40%
20%
0%
ZeroR NaiveBayes SMO SimpleLogistic J48 RandomForest
Baseline Random NearestNeighbour
0%
20%
40%
60%
80%
100%
0
1
2
3
4
5
6
7
0%
2%
4%
6%
8%
10%
12%
0
20
40
60
80
100
120
140
(a)
Device Device Device Device
0%
20%
40%
60%
80%
100%
0
1
2
3
4
5
6
7
0%
2%
4%
6%
8%
10%
12%
0
20
40
60
80
100
120
140
(b)
Device Device Device Device
ing: (a) the workgroup size with the minimal runtime, and (b) the workgroup size
Figure 7.13: Evaluating the effectiveness of classification using regressors, by predict-
Chapter 7. Evaluation
7.4. Autotuning Workgroup Sizes 83
102
Baseline
Speedup (log)
101 Random
NearestNeighbour
100
10−1
10−2
0 200 400 600 800 1000 1200 1400 1600
Test instances where p(f (s)) 6∈ Wlegal (s)
Figure 7.14: Comparison of fallback handlers, showing the speedup over baseline
parameter for all test cases where a classifier predicted an illegal workgroup size.
man expert chosen workgroup size. Figure 7.15 compares the speedups of all
such validation instances over the human expert parameter, for each autotuning
technique. The speedup distributions show consistent classification results for
the five classification techniques, with the speedup at the lower quartile for all
classifiers being ≥ 1.0×. The IQR for all classifiers is < 0.5, but there are out-
liers with speedups both well below 1.0× and well above 2.0×. In contrast, the
speedups achieved using runtime regression have a lower range, but also a lower
median and a larger IQR. Clearly, runtime regression is the least effective of the
evaluated autotuning techniques. Speedup regression is more successful, with the
highest median speedup of all the techniques. However, it also has a large IQR
and the lower quartile has a speedup value well below 1, meaning that for more
than 25% of test instances, the workgroup size selected did not perform as well
as the human expert selected workgroup size.
The prediction costs using regression are significantly greater than using clas-
sifiers. This is because, while a classifier makes a single prediction, the number of
predictions required of a regressor grows with the size of Wmax (s), since classifi-
84 Chapter 7. Evaluation
2.5
Speedup over human expert
2.0
1.5
1.0
0.5
0.0
J48
SMO
NaiveBayes
RandomForest
SimpleLogistic
Runtime Regression
Speedup Regression
Figure 7.15: Distributions of speedups over human expert, ignoring cases where hu-
man expert prediction is invalid. Classifiers are using NearestNeighbour fallback
handlers. The speedup axis is fixed to the range 0–2.5 to highlight the IQRs, which
results in some outliers > 2.5 being clipped.
J48 NaiveBayes
60
60
40
40
Rows
Rows
20
20
0
0
20 40 60 80 20 40 60 80
Columns Columns
(a) (b)
RandomForest SMO
60
60
40
40
Rows
Rows
20
20
0
20 40 60 80 20 40 60 80
Columns Columns
(c) (d)
60
40
40
Rows
Rows
20
20
0
20 40 60 80 20 40 60 80
Columns Columns
(e) (f)
cation with regression requires making predictions for all w ∈ {w|w < Wmax (s)}.
The fastest classifier is J48, due to the it’s simplicity (it can be implemented as
a sequence of nested if/else statements).
Figure 7.16 visualises the classification errors of each of the autotuning tech-
niques. It shows that while the performance of all of the classifiers is comparable,
the distribution of predictions is not. Only the NaiveBayes and RandomForest
classifiers predicted the human expert selected workgroup size of w(32×4) as fre-
quently, or more frequently, than it was optimal. The two regression techniques
were the least accurate of all of the autotuning techniques.
7.4.4 Summary
From an evaluation of 17 different autotuning techniques using 5 different types
of validation sets, the following conclusions about autotuning performance can
be drawn:
• In the case of classifiers predicting illegal workgroup sizes, the best fallback
strategy is to select the closest legal workgroup size.
Conclusions
As the trend towards higher core counts and increasing parallelism continues, the
need for high level, accessible abstractions to manage such parallelism will con-
tinue to go. Autotuning proves a valuable aid for achieving these goals, providing
the benefits of low level performance tuning while maintaining ease of use, with-
out burdening developers with optimisation concerns. As the need for autotuned
parallelism rises, the desire for collaborative techniques for sharing performance
data must be met with systems capable of supporting this cross-platform learning.
In this thesis, I have presented my attempt at providing such a system, by
designing a novel framework which has the benefits of fast, “always-on” autotun-
ing, while being able to synchronise data with global repositories of knowledge
which others may contribute to. The framework provides an interface for auto-
tuning which is sufficiently generic to be easily re-purposed to target a range of
optimisation parameters.
To demonstrate the utility of this framework, I implemented a frontend for
predicting the workgroup size of OpenCL kernels for SkelCL stencil codes. This
optimisation space is complex, non linear, and critical for the performance of
stencil kernels, with up to a 207.72× slowdown if an improper value is picked.
Selecting the correct workgroup size is difficult — requiring a knowledge of the
kernel, dataset, and underlying architecture. The challenge is increased even
more so by inconsistencies in the underlying system which cause some workgroup
sizes to fail completely. Of the 269813 combinations of workgroup size, device,
program, and dataset tested; only a single workgroup size was valid for all test
cases, and achieved only 24% of the available performance. The value selected
by human experts was invalid for 2.6% of test cases. Autotuning in this space
87
88 Chapter 8. Conclusions
requires a system which is resilient these challenges, and several techniques were
implemented to address them.
Runtime performance of autotuned stencil kernels is very promising, achieving
an average 90% of the available performance with only a 3ms autotuning over-
head. Even ignoring the cases for which the human expert selected workgroup
size is invalid, this provides a 1.33× speedup, or a 5.57× speedup over the best
performance that can be achieved using static tuning. Classification performance
is comparable when predicting workgroup sizes for both unseen programs and
unseen devices. I believe that the combination of performance improvements and
the collaborative nature of OmniTune makes for a compelling case for the use
of autotuning as a key component for enabling performant, high level parallel
programming.
OmniTune Framework
The purpose of the OmniTune framework is to provide a generic interface for run-
time autotuning. This is demonstrated through the implementation of a SkelCL
frontend; however, to truly evaluate the ease of use of this framework, it would
have been preferable to implement one or more additional autotuning frontends,
to target different optimisation spaces. This could expose any leakages in the
abstractions between the SkelCL-specific and generic autotuning components.
Synthetic Benchmarks
The OmniTune SkelCL frontend provides a template substitution engine for gen-
erating synthetic stencil benchmarks. The implementation of this generator is
rigidly tied to the SkelCL stencil format. It would be preferred if this template
engine was made more flexible, to support generation of arbitrary test programs.
Additionally, due to time constraints, I did not have the opportunity to explore
how the number of synthetic benchmarks in machine learning test data sets affects
classification performance.
One possible use of the synthetic stencil benchmark generator could be for
8.2. Future Work 89
creating minimal test cases of refused OpenCL parameters, so that bug reports
could be filed with the relevant implementation vendor. However, this would have
added a great level of complexity to the the generator, as it would have to isolate
and remove the dependency on SkelCL to generate minimal programs, requiring
significant implementation work.
The evaluation of OmniTune in this thesis uses multiple classifiers and regressors
to predict workgroup sizes. The behaviour of these classifiers and regressors is
provided by the Weka data mining suite. Many of these classifiers have parame-
ters which affect their prediction behaviour. The quality of the evaluation could
have been improved by exploring the effects that changing the values of these
parameters has on the OmniTune classification performance. It would also have
been informative to dedicate a portion of the evaluation to feature engineering,
evaluating the information gain of each feature and exploring the effects of feature
transformations on classification performance.
Evaluation Methodology
The evaluation compares autotuning performance against the best possible per-
formance that can be achieved using static tuning, a simple heuristic to tune
workgroup size on a per-device basis, and against the workgroup size chosen by
human experts. It would have been beneficial to also include a comparison of
the performance of these autotuned stencils against hand-crafted equivalent pro-
grams in pure OpenCL, without using the SkelCL framework. This would allow
a direct comparison between the performance of stencil kernels using high level
and low level abstractions, but could not be completed due to time constraints
and difficulties in acquiring suitable comparison benchmarks and datasets.
reduce the number of runtime samples required to distinguish good from bad
optimisation parameter values.
Algorithm 4 proposes the behaviour of a hybrid approach to selecting the
workgroup size of iterative SkelCL stencils. This approach attempts to exploit
the advantages of all of the techniques presented in this thesis. First, runtime
regression is used to predict the minimum runtime and a candidate workgroup
size. If, after evaluating this workgroup size, the predicted runtime turned out
to be inaccurate, then a prediction is made using speedup regression. Such a
hybrid approach would enable online tuning through the continued acquisition
of runtime and speedup performance, which would compliment the collaborative
aspirations of OmniTune, and the existing server-remote infrastructure.
Other skeleton optimisation parameters could be autotuned by SkelCL, in-
cluding higher level optimisations such as the selection of border region loading
strategy, or selecting the optimal execution device(s) for multi-device systems.
Optimisation parameters of additional skeletons could be autotuned, or the in-
teraction of multiple related optimisation parameters could be explored. Power
consumption could be used as an additional optimisation cotarget.
8.2. Future Work 91
[AC10] Jason Ansel and Cy Chan. “PetaBricks”. In: XRDS: Crossroads, The
ACM Magazine for Students 17.1 (Sept. 2010), p. 32. url: http:
//dl.acm.org/citation.cfm?doid=1836543.1836554.
[Ben+05] Anne Benoit et al. “Flexible Skeletal Programming with eSkel”. In:
Euro-Par 2005 Parallel Processing. Springer, 2005, pp. 761–770.
93
94 BIBLIOGRAPHY
[Bol+03] Jeff Bolz et al. “Sparse matrix solvers on the GPU: conjugate gra-
dients and multigrid”. In: ACM Transactions on Graphics (TOG)
22.3 (2003), pp. 917–924.
[Con70] John Conway. “The Game of Life”. In: Scientific American 223.4
(1970), p. 4.
[DK15] Usman Dastgeer and Christoph Kessler. “Smart Containers and Skele-
ton Programming for GPU-Based Systems”. In: International Jour-
nal of Parallel Programming (2015), pp. 1–25. url: http://link.
springer.com/10.1007/s10766-015-0357-6.
[Fur+11] Grigori Fursin et al. “Milepost GCC: Machine Learning Enabled Self-
tuning Compiler”. In: International Journal of Parallel Programming
39.3 (Jan. 2011), pp. 296–327. url: http://link.springer.com/
10.1007/s10766-010-0161-2.
[Fur+14] Grigori Fursin et al. “Collective Mind: Towards practical and collab-
orative auto-tuning”. In: Scientific Programming 22.4 (2014), pp. 309–
329.
BIBLIOGRAPHY 97
[FVS11] Jianbin Fang, Ana Lucia Varbanescu, and Henk Sips. “A Compre-
hensive Performance Comparison of CUDA and OpenCL”. In: Par-
allel Processing (ICPP), 2011 International Conference on. IEEE,
2011, pp. 216–225. arXiv: 1005.2581. url: http://arxiv.org/
abs/1005.2581.
[FW86] Philip J. Fleming and John J. Wallace. “How not to lie with statis-
tics: the correct way to summarize benchmark results”. In: Commu-
nications of the ACM 29.3 (1986), pp. 218–221.
[GH11] Chris Gregg and Kim Hazelwood. “Where is the data? Why you can-
not debate CPU vs. GPU performance without the answer”. In: IS-
PASS 2011 - IEEE International Symposium on Performance Anal-
ysis of Systems and Software (2011), pp. 134–144.
[GLS01] Rakesh Ghiya, Daniel Lavery, and David Sehr. “On the importance
of points-to analysis and other memory disambiguation methods for
C programs”. In: ACM SIGPLAN Notices 36.5 (2001), pp. 47–58.
[GWO13] Dominik Grewe, Zheng Wang, and Michael F P Mfp O’Boyle. “Portable
mapping of data parallel programs to OpenCL for heterogeneous sys-
tems”. In: Code Generation and Optimization (CGO), 2013 IEEE/ACM
International Symposium on. IEEE, 2013, pp. 1–10. url: http :
/ / ieeexplore . ieee . org / xpls / abs % 5C _ all . jsp ? arnumber =
6494993$%5Cbackslash$nhttp://www.mendeley.com/research/
portable-mapping-data-parallel-programs-opencl-heterogeneous-
systems-2/.
[JNR02] Rajeev Joshi, Greg Nelson, and Keith Randall. “Denali: a goal-
directed superoptimizer”. In: ACM SIGPLAN Notices 37.5 (2002),
p. 304.
BIBLIOGRAPHY 99
[Kam+10] Shoaib Kamil et al. “An auto-tuning framework for parallel multi-
core stencil computations”. In: Proceedings of the 2010 IEEE Inter-
national Symposium on Parallel and Distributed Processing, IPDPS
2010 (2010). url: http://ieeexplore.ieee.org/xpls/abs%5C_
all.jsp?arnumber=5470421.
[Lee+10] Victor W. Lee et al. “Debunking the 100X GPU vs. CPU myth”. In:
ACM SIGARCH Computer Architecture News 38 (2010), p. 451.
[LOW09] Hugh Leather, Michael O’Boyle, and Bruce Worton. “Raced Pro-
files: Efficient Selection of Competing Compiler Optimizations”. In:
LCTES ’09: Proceedings of the ACM SIGPLAN/SIGBED 2009 Con-
ference on Languages, Compilers, and Tools for Embedded Systems.
Dublin, 2009, pp. 1–10.
[Ryo+08b] Shane Ryoo et al. “Program optimization space pruning for a mul-
tithreaded GPU”. In: Proceedings of the 6th annual IEEE/ACM in-
ternational symposium on Code generation and optimization. New
York, New York, USA: ACM Press, 2008, pp. 195–204. url: http:
//portal.acm.org/citation.cfm?doid=1356058.1356084.
[SGS10] John E. Stone, David Gohara, and Guochun Shi. “OpenCL: A Paral-
lel Programming Standard for Heterogeneous Computing Systems”.
In: Computing in Science & Engineering 12.3 (2010), pp. 66–73.
[SKG12] Michel Steuwer, Philipp Kegel, and Sergei Gorlatch. “Towards High-
Level Programming of Multi-GPU Systems Using the SkelCL Li-
brary”. In: Parallel and Distributed Processing Symposium Work-
shops & PhD Forum (IPDPSW), 2012 IEEE 26th International.
Ieee, May 2012, pp. 1858–1865. url: http://ieeexplore.ieee.
org/lpdocs/epic03/wrapper.htm?arnumber=6270864.