Openmp For Java
Openmp For Java
Openmp For Java
BACHELOR THESIS
Petr Belohlavek
Prague 2015
I declare that I carried out this bachelor thesis independently, and only with the
cited sources, literature and other professional sources.
I understand that my work relates to the rights and obligations under the Act
No. 121/2000 Sb., the Copyright Act, as amended, in particular the fact that the
Charles University in Prague has the right to conclude a license agreement on
the use of this work as a school work pursuant to Section 60 subsection 1 of the
Copyright Act.
i
I would like to express my gratitude to JUDr. Antonn Steinhauser, the su-
pervisor of my thesis, for his guidance and invaluable advice. Furthermore, it
would be impossible for me to finish the thesis without the support, patience and
understanding of my family.
ii
Nazev prace: OpenMP pro Javu
Autor: Petr Belohlavek
Katedra: Katedra distribuovanych a spolehlivych systemu
Vedouc bakalarske prace: JUDr. Antonn Steinhauser, Katedra distribuovanych
a spolehlivych systemu
iii
Contents
1 Introduction 5
1.1 Aim of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 JOMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 JaMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Pyjama . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 JPPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.5 AKKA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Preprocessor Architecture 19
3.1 Level of Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Processing Work-flow . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Code Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Bytecode Analysis . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Syntax Analysis . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Class Hierarchy Model . . . . . . . . . . . . . . . . . . . . 23
3.4 Directive Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Directive Grammar . . . . . . . . . . . . . . . . . . . . . . 24
3.4.2 Directive Hierarchy Model . . . . . . . . . . . . . . . . . . 26
3.5 Directive Translation . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Phase 1: validate . . . . . . . . . . . . . . . . . . . . . . . 27
1
3.5.2 Phase 2: preTranslate . . . . . . . . . . . . . . . . . . . . . 28
3.5.3 Phase 3: postTranslate . . . . . . . . . . . . . . . . . . . . . 28
3.5.4 Rewriting the Source Code . . . . . . . . . . . . . . . . . . 29
3.6 Runtime Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 Executor Interface . . . . . . . . . . . . . . . . . . . . . . 30
3.6.2 Dynamic Executor . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Supported Directives . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7.1 omp parallel . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7.2 omp [parallel] for . . . . . . . . . . . . . . . . . . . . . . . 32
3.7.3 omp section(s) . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.4 omp master . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7.5 omp single . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.6 omp critical . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.7 omp atomic . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7.8 omp barrier . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.9 public, private, firstprivate . . . . . . . . . . . . . . . . . . 38
3.7.10 threadNum . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7.11 Thread-ID Macros . . . . . . . . . . . . . . . . . . . . . . 39
3.7.12 schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8 Language and Coding Style . . . . . . . . . . . . . . . . . . . . . 41
4 Implementation 42
4.1 Version Control System . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Behavioral-Driven Development . . . . . . . . . . . . . . . . . . . 43
4.4 Continuous Integration . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Performance Evaluation 45
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Benchmark Framework . . . . . . . . . . . . . . . . . . . . 45
5.1.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.3 Used Hardware . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.4 Thread Limitation Techniques . . . . . . . . . . . . . . . . 47
2
5.1.5 Statistical Notations . . . . . . . . . . . . . . . . . . . . . 47
5.2 Fibonacci Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 CPUs Dependence . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Workload Dependence . . . . . . . . . . . . . . . . . . . . 50
5.2.3 Speedup Distribution . . . . . . . . . . . . . . . . . . . . . 51
5.2.4 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Function Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Single & Master Overhead . . . . . . . . . . . . . . . . . . . . . . 57
5.6 Related Projects Comparison . . . . . . . . . . . . . . . . . . . . 58
5.6.1 GCC Comparison . . . . . . . . . . . . . . . . . . . . . . . 59
5.6.2 JOMP Comparison . . . . . . . . . . . . . . . . . . . . . . 60
6 User Documentation 62
6.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Preprocessor Installation . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 Compiled Preprocessor . . . . . . . . . . . . . . . . . . . . 63
6.2.2 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Installation and Invocation . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Example Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7 Project Website 67
7.1 Website Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1.2 Back-end . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Responsive Appearance . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3 Hosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Conclusion 71
Bibliography 76
List of Figures 77
List of Tables 78
3
Attachments 79
Appendix C IOMPExecutor 84
4
Chapter 1.
Introduction
Parallel programming has increased in importance during the last the ten years
more than ever before. Apart from common multicore desktop computers and
servers, mobile devices such as smartphones and tablets feature multicore CPUs
as well [1]. Much attention has been paid to multicore solutions since it seems
to be a reasonable way of performance boosting while using as little energy as
possible.
The distributed systems and cloud computing are becoming more popular
and widely used as well. The current situation encourages developers to use the
concurrent programming techniques in order to increase the system performance.
The initial aim of this thesis is to implement Java source code preprocessor which
we denote omp4j (OpenMP for Java). It allows developers to use OpenMP-like
directives for simple parallelization and enhanced scalability of their algorithms.
5
The original thesis requirements are good portability of the preprocessor and
no runtime dependencies of the translated code. The more detailed requirements
are listed below.
6
The preprocessor may be used as a Java compiler instead of regular javac.
That enables easy integration into various IDEs such as IntelliJIdea, Eclipse
and NetBeans.
Apart from being used as a standalone program, the preprocessor can be
used as a library that can be employed by third-party software.
The translated code should be easily modified, thus it should differ as little
as possible from the original code.
In the following section 1.3, an overview of similar Java projects is provided. Both
OpenMP-like and library-based solutions, that allow a simple parallelization, are
covered.
7
Chapter 6 describes the user documentation. The ways of obtaining source
code are listed together with the compilation instructions. All command line
options and their effects are described. Finally, an example of an input code is
attached.
The thesis is accompanied with a digital attachment which includes the whole
project, benchmarks and website. Refer to Attachment chapter for detail infor-
mation about the structure of the provided data.
Since 1995, when Java was introduced by Oracle Corporation [3], various
parallel frameworks and preprocessors have been developed. Because of the Java
backward compatibility maintenance throughout the decades, in recent times
the language itself has become rather old-fashioned and not flexible enough for
modern applications. Hence, the number of libraries and other parallelizing tools,
that provide easy concurrency approach, has been increasing.
1.3.1 JOMP
JOMP1 is an academic research project that implements the OpenMP API for
Java [4]. It implements a wide range of OpenMP directives including reduction.
The directive comments must start with //omp without any whitespaces allowed
1
https://www2.epcc.ed.ac.uk/computing/research_activities/jomp
8
before omp which is uncomfortable for the user. The project is divided in two
separate parts - the preprocessor itself and the runtime library.
The preprocessor takes a Java source file that must be appended with a .jomp
extension [5]. The output is a regular Java source file which may be compiled
with any Java compiler. The main disadvantage of JOMP is that the processed
code is dependent on the runtime library that must be installed in the classpath.
The library itself defines the most important class - jomp.runtime.BusyTask [6].
The source processing works as follows. Initially, a new context class, that extends
previously mentioned BusyTask, is created in the file scope. The class name is
prefixed with underscore in order to prevent name conflicts, however the code
readability decreases.
The class contains public fields for all variables that may be used in the parallel
code. These fields are initiated immediately after the class instance creation.
JOMP does not support Java SE 8. Thus, the modern language constructs
such as lambdas are unable to be used. In addition, the JOMP parser often fails
when parsing Java source code decorated with the annotations such as @Override,
as follows from the experiments that we made.
1.3.2 JaMP
JaMP is an OpenMP implementation that is fitted into Jackal DSM that dis-
tributes Java parallel processes onto a cluster [7]. It is implemented in a similar
manner to JOMP (described in chapter 1.3.1).
9
The main advantage of JaMP is its integration to Jackal DSN, thus poten-
tially enhanced scalability is provided. However, the benchmarks have not been
executed on more than eight nodes [7].
Apart from requirement of the runtime library, the main disadvantage is un-
supported parallel for directive which leads to an inelegant work-sharing directive
notation.
1.3.3 Pyjama
Pyjama intends to implement OpenMP for Java with a direct graphical user
interface (GUI) support [8], i.e. in the main, the responsiveness maintenance and
allowing correct execution of GUI-related code within those regions - [8].
Pyjama preprocessor requires .javamp source files which are processed into
regular . java output files [8].
1.3.4 JPPF
10
pattern supported by directives is not suitable.
1.3.5 AKKA
The project AKKA3 is the state-of-the-art library for concurrent and distribut-
ed JVM computing. It supports both Java and Scala languages and is used for
general purposes. A C# implementation of AKKA - AKKA.NET - was also
developed for .NET framework [10].
3
http://akka.io
11
Chapter 2.
Theory of Parallel Computation
In this chapter presents two important theses which are highly relevant for the
motivation for writing parallel programs and for their performance evaluation. A
basic overview of multiprocessing architectures is provided and the one which is
relevant for this thesis is highlighted. Finally, OpenMP API standard is briefly
introduced together with Fork-Join model, history and usage in modern develop-
ment.
Throughout the last fifty years, a fundamental observation called Moores law has
held. It states that the number of transistors in integrated circuits doubles every
18 months (respectively 24) [11]. Therefore, it implies the exponential growth of
the transistor number as well as exponential decrease of the physical size of the
transistor.
However, due to physical limitations such as the fact that a transistor cannot
be manufactured smaller that an atom, its obvious that Moores law is not going
to hold forever. Moore himself stated in 2010 that:
... In terms of size [of transistor] you can see that were ap-
proaching the size of atoms which is a fundamental barrier ... [12]
12
Since the limitation obviously exists, the CPUs manufacturers are motivat-
ed to seek different approaches in order to increase the performance. Frequency
boosting increases both power consumption and heat production during the com-
putation. For this reason, CPU producers almost stopped frequency development
of CPUs in 2004 [14].
As Amdahl states, his rule models the expected speedup of parallelized algo-
rithm of fixed size [15]. The following notations, definitions and equations are
taken from Yaghob [16].
Definition. T (1) is the total time of the execution of the sequential algorithm
implementation.
Definition. S(P ) is the speedup using P processing elements. The formal defi-
nition is as follows:
T (1)
S(P ) =
T (P )
In every parallel implementation there is a fraction of the code that cannot run
concurrently. Usually it contains I/O operations such as reading a file, scanning
the network or writing the output.
13
gamma
0.5
100
0.4
0.3
0.2
0.1
0.05
0.025
80
0.01
60
Speedup
40
20
0
CPUs
Finally, Amdahls law may be formulated by using prior definitions. The the-
orem asserts that given any positive , the maximal speedup is never unbounded.
Thus the scalability of the algorithms, which have at least a single instruction
executed serially, is bounded.
Theorem 1.
1
lim S(P ) =
P
14
almost impossible to determine precisely. Thus, Amdahls law may be assumed
to be an upper bound of the possible speedup.
In this thesis, only the MIMD architecture is used as the preprocessor pro-
vides exclusively thread-level parallelization. Thus, different threads may execute
different instructions concurrently. Even though the combination of SIMD and
MIMD is often used in low-level programming, this thesis is limited to develop-
ment of a high-level preprocessor compatible with all JVMs. Therefore it cannot
deal with SIMD instructions because Java provides no interface for instruction-
level programming. [17].
2.4 OpenMP
OpenMP has been standard for shared memory multiprocessing since 1997. The
parallelization itself is achieved via compiler directive usage. Therefore program-
ming using OpenMP directives is rather straightforward for the programmer as
it is possible to write serial codes decorated with simple directives. Nevertheless,
the developer ought to be aware of future parallelization in order to maximize the
final speedup. Example in listing 2.1 demonstrates a simple usage of directive
programming.
15
1 #pragma omp p a r a l l e l f o r
2 f o r ( i n t x=0; x < 1 0 0 0 0 ; x++) {
3 runTask ( x ) ;
4 }
Listing 2.1: OpenMP (C++) parallel for
2.4.1 History
Throughout the years, the OpenMP project has been developed by OpenMP
Architecture Review Board (OpenMP ARB). It consists of both hardware and
software related companies such as IBM, AMD, ARM, HP, Intel, NVidia, Oracle
co., RedHat and many more [18]. ARB is responsible for all released OpenMP
specifications.
Initially implemented only for Fortran, it became popular across many fields
especially mathematical computing and physical simulations [19]. In 1998 the
C/C++ support was provided.
Even though OpenMP 4.0 was released in 2013 [20], only few C/C++ compil-
ers support all new directives. However, the majority of commonly used C/C++
compilers (in the latest versions) such as gcc (GNU), icc (Intel) and clang (LLVM)
supports OpenMP 3.0 with some directives introduced in 3.1 [21]
16
gramming. However, for some problems it scales better and it is generally easier
to implement.
Some overhead is naturally connected with fork operation. For example, new
thread creation causes a delay. Figure 2.2b demonstrates tasks invoked in parallel
(black bold horizontal lines) and overhead created by parallelization (blue oblique
lines).
Figure 2.2: (a) In Fork-Join model scheme multiple forks and joins are employed.
(b) The fork/join (blue) overhead caused by thread creations and joining.
As modern programming languages, such as Scala, Ruby, Groovy and Rust, were
developed, programming concurrent application became easier to maintain. How-
ever, for the majority of more traditional languages a lot of libraries and tools
have been developed (see chapter 1.3 for more information about Java tools).
17
Clearly, Scala approach is much easier-to-use than C pthread approach. In
contrast to that design, OpenMP provides very comfortable design pattern for
multi-threading as it is demonstrated in the prior example (see listing 2.1). For
that reason, OpenMP is still frequently used as a good compromise between
performance and readability [19].
1 #i n c l u d e <i o s t r e a m >
2 #i n c l u d e <c s t d l i b >
3 #i n c l u d e <p t h r e a d . h>
4 u s i n g namespace s t d ;
5
6 #d e f i n e NUM THREADS 5
7
8 i n t main ( ) {
9 p t h r e a d t t h r e a d s [NUM THREADS ] ;
10 int rc ;
11 int i ;
12 f o r ( i =0; i < NUM THREADS; i++ ) {
13 r c = p t h r e a d c r e a t e (& t h r e a d s [ i ] , NULL, runTask ( i ) , ( v o i d ) i ) ;
14 i f ( rc ){
15 c o u t << E r r o r : u n a b l e t o c r e a t e thread , << r c << e n d l ;
16 e x i t ( 1) ;
17 }
18 }
19 p t h r e a d e x i t (NULL) ;
20 }
Listing 2.3: C++ parallel execution
18
Chapter 3.
Preprocessor Architecture
This chapter describes the architecture of omp4j - the preprocessor that is intro-
duced by this thesis. Used language, coding style and the variety of employed
technologies are explained. Furthermore, the application design is clarified. The
set of supported directives is presented and the directive implementations are
discussed. The following definitions introduce the notation that is employed.
Since Java does not support direct compiler directives such as #pragma for
C++, [8] the OpenMP directives must be supported via a different information
channel. The one-line comments were employed in order to provide valid Java
source code if it is not processed by omp4j.
Definition. A Task is a block (of code) that may run independently on other
tasks. It is implemented as an instance of the class which implements java . lang .Runnable
interface. The overridden method run is not expected to throw any runtime ex-
ception. For tasks throwing exceptions the behavior is unspecified.
There are four basic types of input processing into the output that are classified
according to the types of input and output. For purposes of the preprocessor
classification, the input and output types are distinguished as either the source
code or JVM bytecode. For future references, these options are denoted in table
3.1.
19
Parallel output
Source Bytecode
Source SISO SIBO
Serial input
Bytecode BISO BIBO
Neither BISO nor BIBO meets the original requirements (described in chapter
1.1) since they accept only bytecode as input. It would not be suitable to expect
programmers to pass serial bytecode from their IDEs in order to parallelize it.
On the contrary, both SISO and SIBO are valid and reasonable alternatives
that were implemented in order to provide the user with freedom of choice. Both
alternatives are illustrated in figure 3.1. The user may use either SISO approach
in order to obtain the parallel source code for advanced optimization and version
control purposes or SIBO approach which replaces javac compiler. In the latter
case, the user provides source code(s) with OpenMP-like directives and the pre-
processor returns compiled parallelized output in the form of the Java bytecode
( . class files).
Figure 3.1: Supported translation types. omp4j either behaves like standard Java
compiler or only preprocesses the source code without compilation.
The code transformation is divided into six logical phases. Each phase employs
one or several classes which communicate with each other. The phases are as
follows from the list below.
20
Figure 3.2: Major preprocessor actions divided into phases.
1. Configuration creation
2. Code analysis
3. Directive recognition
4. Top level directives translation
5. Output writing
6. If any directive exists, goto 2.
Two approaches of source code analysis are used for omp4j preprocessor - bytecode
and syntax analyses. The former approach is applied for classes whose source code
is not accessible during the process while the latter analysis is employed when
Java reflection API is unable to reflect some object such as an anonymous class.
21
3.3.1 Bytecode Analysis
The bytecode analysis is essential since the user may provide input source code
which is dependent on some classes that are installed in the classpath. The
classes may be already compiled into bytecode and the preprocessor might not
access their source code in order to perform syntax analysis instead.
In contrast to the bytecode analysis, the syntax analysis is essential for two dif-
ferent reasons. Given an input source code, the preprocessor assumes that anony-
mous classes may be present which is the fundamental problem since those classes
are impossible to be reflected via Java reflection API. Furthermore, the prepro-
cessor needs local variable information that cannot be obtained via the bytecode
analysis.
Since the preprocessor finally modifies the source code in order to provide its
concurrent version, simple navigation through the source code is required. For
that purpose an abstract syntax tree (AST) is employed.
For syntax analysis, we decided to use commonly used Another Tool for Lan-
guage Recognition version 4 (ANTLRv4) for its speed, great documentation and
the possibility of writing additional grammars. However, other Java parser gen-
erators are available such as BISON1 or JavaParser2 . The former generator is
1
https://www.gnu.org/software/bison
2
http://javaparser.github.io/javaparser
22
unusable for this project the generated parser cannot be employed from Java,
even though it may recognize Java grammar. The latter recognizer is Java com-
patible but has very poor unit tests and documentation.
ANTLR allows the user to provide .g4 grammars in order to get generated lexer
and parser for that language. omp4j benefits from this approach as an OpenMP-
like grammar was implemented as a part of the whole project. The grammar is
employed during the directive recognition which is described in chapter 3.4.1.
For Java recognition, the original ANTLR grammar for Java 8 was used. Since
the grammar is maintained by community, it varies a lot. Even though it was
tested, there is no proof of correct recognition. Despite this fact, we decided to
use it anyway in order to provide Java 8 support at maximum possible level.
The parser, that is generated by ANTLR, provides AST of the source code
given. The AST might be iterated through the visitor pattern [24]. We implement
the extractors located in org.omp4j.extractor. The visitors implement the ANTLR
visitor pattern design. They are employed for two main purposes - the directive
validation (e.g. break detection in omp parallel for directive) and the creation of
class hierarchy model (see chapter 3.3.3).
In addition, the model itself forms a tree, i.e. it preserves relations between
classes such as one class is declared in another and vice versa. This model is
employed during the directive translation which is described later in chapter 3.5.
23
3.4 Directive Recognition
After the syntax analysis is finished (see chapter 3.3.2), the AST is visited one
more time via org.omp4j.preprocessor.DirectiveVisitor that seeks all one-line comments
beginning with // omp. As Parr suggests [24], a hidden channel must be defined
for proper comment passing (see listing 3.1 for hidden channel definition in Java 8
grammar).
1783 //
1784 // Whitespace and comments
1785 //
1786
1787 WS : [ \ t \ r \n\u000C]+ > c h a n n e l (OTHER)
1788 ;
1789
1790 COMMENT
1791 : / . ? / > c h a n n e l (OTHER)
1792 ;
1793
1794 LINE COMMENT
1795 : / / [ \ r \n ] > c h a n n e l (COMMENTS)
1796 ;
Listing 3.1: Java8.g4; Comments passed via hidden channel
After the proper comments are fetched, the ANTLR is used one more time in
order to parse the line into AST according to OMP grammar, which is described
in chapter 3.4.1.
A grammar was employed in order to obtain information from the comment line.
In comparison to regular expressions, grammars provide higher flexibility. In
addition, they are more readable and extendable. The whole grammar may be
found in appendix B.
The grammar, that was developed as a part of this project, has rather straight-
forward structure. The start rule [24] - ompUnit - branches into several directive
rules such as ompParallel and ompBarrier according to the directive type.
Each rule consists of a lexer rule (e.g. omp or parallel ) and optionally some
other rules. For example, the critical section omp critical (lock) is parsed into AST
as ompUnit, ompCritical and ompVar which is directly mapped to a lexer rule VAR
24
that matches lock variable. Therefore, the preprocessor may employ the visitor
pattern [24] in order to obtain information from the AST such as Was lock
variable specified?.
This behavior can be achieved via two different approaches. First, each per-
mutation of attributes is represented as a single rule in the grammar. Since the
grammar may contain at most a finite number of rules [25], this approach lim-
its the total number of attributes used and makes the grammar unreadable and
exponentially large.
The second approach, which was actually employed, implements a parser that
uses random access memory in the form of a HashMap in order to store parser
attributes. The comparative advantage of this method is the linear number of
rules, better grammar readability and maintenance. Nevertheless, the grammar
itself cannot be longer classified as context-free because of the additional memory
use. Thus the matched language remains recursively enumerable.
We conclude that the adopted approach is much more beneficial for both the
users and the developers. Compared with Pyjama (see chapter 1.3.3) which re-
quires the thread-limiting attribute to be before the other attributes [9], omp4j
provides more comfortable API. Furthermore, the grammar may be easily ex-
tended with additional directives (e.g reduction) since only one new rule addition
is sufficient.
25
3.4.2 Directive Hierarchy Model
After ANLRT parses the comment, a Directive is constructed using the comment
information. Similarly to the class hierarchy model (see chapter 3.3.3), all direc-
tives are stored in the form of a tree, preserving mutual relations.
All directives and directive-related classes are stored inside the org.omp4j.directive
package which forms an object oriented class hierarchy.
In comparison with related projects (see chapter 1.3), omp4j employs a fundamen-
tally different approach to code transformation. For example, JOMP creates a
translation class, that has public fields initialized with real references, and trans-
fers the directive body into the class. Since the class is defined at the end of
the source file, which may be long, the developer loses track of the translated
code. On the contrary, omp4j does not transfer code anywhere. Instead, the
code is slightly modified and wrapped with a clause that manages the parallel
invocation.
Even though this approach is much more difficult to implement correctly and
depends on ANTLR Java 8 grammar, it leads to the code that an ordinary pro-
grammer would produce. Additionally, the processed code remains more readable
and maintainable. Therefore, omp4j may be used for educational purposes as it
easily demonstrates the straightforwardness of parallel solutions.
The following definitions are employed later in this chapter. The definitions
from the beginning of chapter 3 are often employed.
Definition. The Capturing is a process of AST analysis that obtains all non-local
variables used in the AST.
26
Definition. A Context class for the given directive is a class extending java . lang .Object
that defines appropriate fields according to the captured variables of the directive
body.
Given the directive hierarchy model (described in 3.4.2), the main prepro-
cessing class org.omp4j.preprocessor.Preprocessor starts translating top directives, i.e.
those that are at the top of the model tree. Therefore, the directives are translated
by layers.
Since the directive bodies in the same hierarchy level cannot overlap, the
selected directives are mutually independent, hence they may be translated in
any particular order.
When the directive is translated, all direct subdirectives that do not create a
new executor are also translated at once i.e. nested for, atomic, critical, master,
single and section. This approach ensures that once the layer translation is fin-
ished, a valid output is provided. Hence, the functional design may be employed
and directive translation might be processed recursively.
The directive translation itself is divided into three separate phases imple-
mented via three directive methods - validate , preTranslate and postTranslate (see
figure 3.3). All three phases are described below in chapters 3.5.1, 3.5.2 and
3.5.3.
27
if and only if the location of the directive in the hierarchy model is not supported.
For example, nested directive such as master etc. must not appear in the first
level of the hierarchy.
During the iteration, the process of variable capturing is realized. By using the
class hierarchy model and both bytecode and syntax analyses, TranslationVisitor
builds a set of accessed variables that are defined as non-local or defined before
directive body starts. The variables are stored with all important information
such as type and meaning (i.e. local variable, parameters, fields, etc.).
Thus, once the phase 2 is finished, the proper variables in the directive body
are ready to be used with the proper context. Furthermore, these variables are
captured and returned.
28
captured variables references. In case of primitive types, the captured values are
duplicated.
At this point, a valid code is obtained since directive body uses the context
fields instead of regular variables. Nevertheless, the program still runs serially.
Finally, the executor is created based on schedule attribute (see chapter 3.6 for
runtime library description). Then the parallelization itself may happen, which
means that the directive body is split into tasks that may run concurrently. These
tasks are scheduled and executed by the executor. Each directive implements the
task decomposition differently (refer to chapter 3.7 for implementation details).
While executing phases of the translation, the source code is modified. Since the
AST provided by ANTLR is immutable [24], it is impossible to modify nodes or
insert and delete branches. For this purpose ANTLR provides Rewriter API [24].
The latter approach was chosen for two main reasons. First, it makes the
application design more elegant as the same routines are not implemented mul-
29
tiple times when more than one directive is preprocessed. Second, the enhanced
modularity is available. For example, the user may reimplement some of the
classes and use them instead of the original ones. For this purpose, a separate git
repository was created, as it is described in chapter 4.1
Even though the runtime library is provided, the preprocessed code has not
any runtime dependencies since all required class-files are provided during the
compilation. Consequently, the users might simply compile theirs sources using
omp4j and run it on a different machine. Thus, the user may employ omp4j instead
of javac as the compiled bytecode is accompanied with all dependencies.
In order to allow the users to define their own executor, they may let the
preprocessor omit the compilation and preprocess only the sources. In that case
sources of the runtime library must be provided by the user as they are not
copied into the output directory. Additionally, the user may compile the whole
application with some different compiler.
The life-cycle of the executor is as follows. Initially, the tasks are scheduled
by overridden execute method. When the first task is scheduled, the executor
may invoke the already scheduled tasks in any order. The method itself is non-
blocking and returns immediately. Finally, waitForExecution method is invoked. It
prevents future task scheduling via method execute and remains blocked until all
tasks are finished.
30
Definition. A barrier hit is the operation performed by a thread in which a barrier
counter is increased.
This chapter describes all implemented directives and usages. omp4j preprocessor
supports a subset of original OpenMP directives [27] - majority of commonly
used directives are provided though. Furthermore, the directive behavior and
translation policy is explained and the additional attributes are listed.
The directives are translated if and only if located inside a class function.
Directives that are inside enum methods are ignored.
31
3.7.1 omp parallel
The most basic supported directive is omp parallel which invokes the directive
body in parallel. That means that the body will be executed as many times as is
the number of available threads (see chapter 3.7.10 for detailed behavior). The
example (listing 3.3) demonstrates a simple usage and the directives behavior.
1 // omp p a r a l l e l
2 {
3 System . out . p r i n t l n ( h e l l o )
4 }
5
6 / output /
7 hello
8 hello
9 hello
10 hello
Listing 3.3: omp parallel example
Directive omp parallel for is a shortcut for omp for nested in omp parallel. It must
be used directly only before a for-loop. Only regular for-loops are allowed - e.g.
for ( int i = 0; i < 10; i++) is a valid example. On the contrary, for-each loops are
not supported (similarly to JOMP [28]) - e.g. for (T x : collection ). Even though a
task might be added identically as in basic for-loop, a lot of unexpected behavior
might happen. For example, the collection may read from network and/or from
file. In that case, the order is crucial.
In contrast to omp parallel, this directive invokes the for-loop only once regard-
less of the specified number of threads. However, each iteration is represented as
a separate task that might run concurrently - i.e. each task may run in different
thread and their execution order is unspecified (see example in listing 3.4)
The use of keywords break, continue and return is not allowed in the directive
body since the order of execution is broken up. Thus, the use of these keywords is
32
unnecessary. Additionally, the iterator is considered as final. Thus, if the for-loop
header is as follows: for ( int i = 0; i < 10; i++), variable i cannot be changed in
the loop body.
1 // omp p a r a l l e l f o r
2 f o r ( i n t i = 0 ; i < 1 0 ; i ++) {
3 System . out . p r i n t ( i )
4 }
5
6 / output /
7 2 0 6 7 1 9 5 3 4 8
Listing 3.4: omp parallel for example
Wrapper directive omp sections may contain only omp section directives - other di-
rectives and/or raw code are not supported. Each omp section is treated as an
independent task i.e. it might be invoked in parallel. omp section should be fol-
lowed by an empty statement, i.e. {...} .
This directive is commonly used for heterogeneous tasks for which omp parallel
[ for ] is not suitable. The number of sections is fixed and its determined by the
number of sections, that are actually present in the source code.
The example in listing 3.5 demonstrates the common usage of the input read-
ing from multiple independent resources at once.
33
1 int a ,b , c ;
2 // omp s e c t i o n s
3 {
4 // omp s e c t i o n
5 { a = readFromDrive ( ) ; }
6
7 // omp s e c t i o n
8 { b = readFromStdIn ( ) ; }
9
10 // omp s e c t i o n
11 { c = readFromNetwork ( ) ; }
12 }
Listing 3.5: omp section(s) example
The major possible pitfall of omp sections is the unused CPU time created when
differently long sections are declared. As an example (illustrated in figure 3.4) we
assume four independent tasks (each implemented as a section). Assuming four
CPUs, the sections may run in parallel all at once. Due to the different lengths
of the sections, however, the final speedup is not 4 as it might have appeared at
the beginning. As the scheme demonstrates, early finished threads must wait for
the join operation (see chapter 2.4.2 for Fork-Join model details).
Figure 3.4: Bold lines, accompanied with clock icons, demonstrate the time when
the CPU waits until the other sections are completed.
omp master is a directive assuring that its body is invoked only from the master
thread which is the thread denoted with ID = 0. omp master is not a top level
directive which means that it must be nested in some other parallel directive.
Usually, omp master is used inside omp parallel where a block of code should run
only once.
34
(a) Master directive (b) Single directive
Figure 3.5: Red color represents the thread that executed the directive body. (a)
Thread 0 only i.e. the top one. (b) Any thread
Even though the use of this directive enables enhanced flow control, the use
of the master directive should not be redundant as it provides an overhead due
to thread creations. As an example (see listing 3.6) we provide parallel directive
with master body. The example leads to only one invocation that might have
been achieved by omitting both directives. The benchmarks of this behavior may
be found in chapter 5.5.
Figure 3.5a demonstrates the master thread (red color). Three blocks of code
decorated with omp master are shown. In each, only the master thread invokes the
specified directive body.
1 // omp p a r a l l e l
2 {
3 // omp master
4 { compute ( ) ; }
5 }
Listing 3.6: omp master wrong usage example
omp single works similarly to omp master - the directive body is invoked only once,
however, by any particular thread. This leads to possibly better scheduling then
the master directive approach. Nevertheless, due to the omp single implementa-
tion, some overhead is naturally present.
omp single is implemented using atomic boolean variable. Unless some thread
35
assigns itself to invoke the body, all threads intend to do so. Finally, when the
atomic boolean is set, other threads passes the body without its invocation.
Figure 3.5b demonstrates the different threads (red color) that executes single s
in each of blocks of code. It may be a different thread in each block as well as
the same one in all.
omp critical also must be nested in some parallel directive. Compared to omp master
and omp single, its body is invoked by all threads. Additionally, mutual exclusion
is provided, meaning at most single thread will access the body at any particular
time.
The critical sections are implemented using Java embedded synchronized key-
word, providing huge overhead (see benchmark in chapter 5.4). In general, the
user should avoid any synchronization primitives, especially the critical sections.
Figure 3.6a represents basic critical section usage. omp parallel is invoked in
four threads which each sequentially execute tasks A, C, B in this order. Fur-
thermore, task C is marked as the critical section.
Figure 3.6b illustrates one possible schedule, where at any point task C is
being executed at most by one thread. Consequently, some threads wait until
other threads run C which inevitably wastes the CPU time. Even though the
waiting is implemented as passive, time is being lost. When a thread leaves the
critical section, all waiting threads intend to lock the critical section variable
(or the class itself if no variable is specified), hence they cause some additional
overhead that increases with the number of threads.
36
//OMP PARALLEL
A
//OMP CRITICAL
C
B
(a) Code scheme (b) Scheduled critical tasks
omp atomic is implemented equally to omp critical . Given the fact that not all
operations have their atomic alternative, is not possible to provide a general
solution.
In addition, the user might use already compiled classes. These classes are
impossible to be translated on the source code level. Instead, we decided to
discourage the user from the use of omp atomic directive.
37
Due to Java memory model and the JVM instruction set design, that dif-
fers from C/C++ and Fortran, it is impossible to support atomic operations in
general:
The first set of attributes we describe - public, private and firstprivate - may be
used with parallel directives. They accept either a single variable as a parameter
or a comma-separated list of variables. All attributes may repeat, however only
last variable occurrence is applied.
public is a default setting for all variables. The public variables are considered
as shared variables among all threads. No synchronization primitives are im-
plicitly provided, thus these variables should be read-only because any mutable
operation may collide with other threads.
38
firstprivate is the same as private , except it uses a copy-constructor. It means
that the variable itself is passed to the constructor. A compilation error is raised
if the type of the variable does not provide this constructor.
3.7.10 threadNum
threadNum is the thread number restriction attribute the parallel directives. The
common usecase is as demonstrated in listing 3.7.
The attribute may be used at most once in a directive. The passed parameter
is a single integer representing the number of threads that will participate in the
directive body execution. If threadNum attribute is missing the total number of
CPUs, that JVM is aware of, is used implicitly.
1 / Running on 48 CPU s e r v e r /
2 // omp p a r a l l e l threadNum ( 2 )
3 {
4 System . out . p r i n t l n ( h e l l o ) ;
5 }
6
7 / Output /
8 hello
9 hello
Listing 3.7: threadNum example
omp4j supports macros (or constants) through which the information about the
current thread is provided. Nevertheless, the source code with these macros will
not be compiled properly using the standard Java compiler. This is the only
difference from standard Java source code. For portability maintenance, the
usage of directives without Thread-ID macros is suggested. All macros
The example in listing 3.8 demonstrates proper usage of OMP4J THREAD NUM
and OMP4J NUM THREADS that are translated into current thread id (numbered
from 1) and total number of threads invoking this directive body respectively.
We decided to employ macros rather than specific methods like related projects
do (e.g. JOMP). The use of macros brings two main advantages - first, the us-
39
er do not need to be aware of the preprocessor package structure since they do
not invoke any methods (such as org.omp4j.runtime.Config.getThreadNum()); second,
the user may easily redefine the constants in the class that should be parallelized,
thus the code may be compiled with other compilers as well. Therefore, enhanced
portability is provided in comparison with similar projects.
The macros are translated into constant integers, hence all common operation
such as addition, multiplication and array-indexing are provided. On the contrary,
the macros cannot be assigned with a new value as they are constants. The macros
follow Java coding style of constants, hence they appear like regular constants.
1 // omp p a r a l l e l threadNum ( 5 )
2 {
3 System . out . p r i n t l n (OMP4J THREAD NUM + / + OMP4J NUM THREADS) ;
4 }
5
6 / Output /
7 2/5
8 3/5
9 0/5
10 1/5
11 4/5
Listing 3.8: Thread-ID example
3.7.12 schedule
The last supported attribute is schedule. It accepts a fully qualified name of a class
that implements org.omp4j.runtime.IOMPExecutor interface. As an abbreviation, pa-
rameters - dynamic and static are supported. The former parameter is translated
into org.omp4j.runtime.DynamicExecutor which is described in chapter 3.6.2. The lat-
ter parameter is analogously translated into StaticExecutor. Each executor modifies
the scheduling policy and the final speedup.
schedule may be used at most once in each parallel directive. The default value
is dynamic as it better suits majority of cases. The dynamic policy registers tasks
into a single central queue whence the tasks are being assigned to the workers
singularly on demand. See chapter 3.6 for detail description.
On the contrary, the static policy determine each task to one thread before
the execution. We advise the user to use the default dynamic approach for its
40
better performance.
Initially, Scala applications may be assembled3 into JARs that do not require
Scala installed for their execution [30]. Additionally, Scala features full Java
interoperability which enables native Java libraries usage [31]. That is an essential
property for the syntax analysis that is described in chapter 3.3.
The language construct, that has been used the most, is implicit keyword. The
variables defined implicit are automatically passed to the invoked functions. This
concept is the welcomed alternative to static configuration context.
As well as Haskell [33], Scala provides simple pattern matching that we often
use during exception handling and subclass classification.
3
https://github.com/sbt/sbt-assembly
41
Chapter 4.
Implementation
This chapters briefly describes used techniques and environment during the devel-
opment of the thesis. The various links to project-related websites are presented
as the preprocessor consists of several modules.
A version control system (VCS, revision control) was employed in order to increase
efficiency of the development process even though only one programmer has been
committing the changes. Additionally, VCS supports further development when
more developers are involved.
Currently the most popular version system - git - introduced by Linus Torvalds
in 2005, was chosen because of its great simplicity of command line interface and
general speed. Since git is distributed, the current backup and the whole history
of code changes have been saved on multiple machines [34].
All commits have been uploaded to GitHub1 which is currently the most
popular git repository hosting for both open-source and private projects. The
commits and the code changes may be easily accessed using GitHub visualization
tools. GitHub will maintain further repository support in terms of bug-listing
and the community will be able to fork omp4j in order to develop new features.
Table 4.1 presents the set of URLs that leads to the appropriate git repositories
that are hosted on GitHub.
1
https://github.com
42
URL Target description
https://github.com/omp4j omp4j group
https://github.com/omp4j/omp4j preprocessor repository
https://github.com/omp4j/runtime runtime library repository
https://github.com/omp4j/www webpage repository
https://github.com/omp4j/grammar ANTLR grammar repository
https://github.com/omp4j/benchmark performance evaluation repository
https://github.com/omp4j/doc API reference
4.2 License
The unit tests were executed frequently in order to maintain the stability and
the professional quality of the software. The whole project was developed in
modern agile Behavioral-Driven Development (BDD) style that implements test-
driven development [35]. Even though BDD is relatively new approach to software
development, it has quickly become popular for its robust description of the tests
that are usually written from the users point of view [35].
ScalaTest - an unit testing library for Scala and Java projects [36] - was used
in order to fulfill the BDD requirements. It enables an extremely straightforward
test composition via natural-like language (see example 4.1). In addition, the
tests may be invoked directly from both SBT (via test ) and IntelliJ Idea IDE. In
order to provide faster testing, a subset of tests (e.g. only previously failed tests)
might be run. Hence, duplicate testing is eliminated which internally accelerates
the development life-cycle. By employing the parallel test invocation, even large
number of tests may be run within minutes.
43
1 d e s c r i b e ( omp s e c t i o n s c h i l d r e n i n f i l e : ) {
2 i t ( 0 1 . j a v a s h o u l d not c o n t a i n o t h e r s t a t e m e n t s ) {
3 an [ S y n t a x E r r o r E x c e p t i o n ] s h o u l d be thrownBy new
P r e p r o c e s s o r ( ) ( . . . ) . run ( )
4 }
5 }
Listing 4.1: ScalaTest test definition
2
https://travis-ci.org
3
All tests, that were ever executed, may be found at https://travis-ci.org/omp4j/omp4j
44
Chapter 5.
Performance Evaluation
This chapter describes several benchmarks that were run in order to evaluate the
performance of the translated sources. Several different types of benchmarks are
implemented and their differences explained. In addition, two linear models are
derived from the data of the most independent benchmark in order to provide
deeper insight into measured data dependence. Furthermore, a comparison with
related projects is presented.
5.1 Methodology
The following benchmarks avoid measuring the absolute values since a tool for
parallel invocation is being evaluated. Instead, the relative quantities are mea-
sured such as total speedup.
45
Initially, the benchmark class is loaded, followed by an instance creation. After
this process is completed, the benchmark is invoked. This approach is used in
order to eliminate the class-loading overhead. Loading and instancing the bench-
mark class before the time measurement forces JVM to run all class-loading rou-
tines before the measurement. As a result, the measured data is more coherent
[38].
As described above, each benchmark consists of two parts - the serial reference
algorithm and its the parallel implementation. Two values, tserial and tparallel , are
obtained by measuring the durations of the execution ot these methods. These
quantities are always measured immediately one after another. The speedup is
tserial
consequently calculated as a fraction tparallel
.
Each benchmark returns the measured speedup rather than raw times of the
executions. Even though the latter approach provides greater mean speedups, the
former method reflects the reality in more detail. By using the speedup approach,
the effects of the environment (e.g. background processes) are minimized since
the speedup is calculated pairwise.
The use of CLT is proper since the obtained results are both identically dis-
tributed (one benchmark executed multiple-times) and independent of each other
(only one being executed at a particular time) [39].
46
5.1.3 Used Hardware
All benchmarks were executed on the same (linux) NUMA server (named navarin)
that has four sockets with 12-core CPUs in each socket. Hence, benchmarks
using up to 48 cores are supported which is important for the creation of the
linear models. In contrast to similar projects (such as Pyjama with 16 cores[8]),
benchmarks that employ higher number of cores provide the valuable information.
Even though no other user was logged into the described machine during the
benchmarking, some system processes may have been running. These effects are
filtered out by using multiple executions of each benchmark as described in the
prior paragraphs.
In order to obtain results that depend on the number of used cores while running
the benchmarks on the same hardware, the following thread limitation techniques
are employed.
However, the number of threads for the original C++ OpenMP was limited
by using the environmental variable OMP NUM THREADS. Finally, we limited the
number of threads used by JOMP via Djomp.threads=N option since JOMP does
not support other thread limitation techniques.
The used techniques are not supposed to have any impact on measured results.
All hypotheses testing are executed at the level of significance = 0.01 if not
stated otherwise.
47
The correlation coefficient classification [40] is used according to the following
table 5.1 .
Coefficient Label
0.00 - 0.35 Weak
0.36 - 0.67 Moderate
0.68 - 0.90 High
0.91 - 1.00 Very high
The majority of benchmarks, that are provided in the following chapters, are
accompanied with various plots. When the blue color is used in the form of a
non-described line, it represents the theoretical optimum which is a rough upper
bound of the real values. Furthermore, Amdahls law limits possible speedup
results even more (see chapter 2.2 for more details about the law).
omp parallel for directive was used for its simple and straightforward iteration
through the workload that represents the computation of 30th to 35th Fibonacci
numbers. The results are not stored anywhere in the memory in order to iso-
late each computation absolutely in terms of the memory sharing and the cache
synchronization.
48
50
50
40
40
30
30
speedup
speedup
20
20
10
10
0
0
0 10 20 30 40 50 0 10 20 30 40 50
cores cores
Figure 5.1: Fibonacci benchmark - the dependence of the speedup on the number
of cores while the workload is fixed and dynamic scheduling
Initially, figure 5.1 is provided which demonstrates the dependence of the speedup
on the CPUs number (denoted as cores). Two different workloads are displayed
- a thousand (see figure 5.1a) and twenty thousand iterations (see figure 5.1b).
With more than twenty employed cores, the growth of the speedup declines as
follows from figure 5.1. This phenomenon can be explained by two independent
reasons. First, Amdahls law limits the maximal speedup (see chapter 2.2 which
is observable in greater a number of cores. Second, the lock contention among
the threads rise whilst locking the task queue.
49
Cores Workload Coefficient
10 1,000 0.9987635
20 1,000 0.9960283
10 20,000 0.9994774
20 20,000 0.9992499
20
6
15
speedup
speedup
4
10
2
5
0
workload workload
Secondly, the dependence of the speedup on the workload size is presented, for
which purpose the number of used cores was fixed. The 8 cores result, that
simulates a regular desktop computer, is provided (see figure 5.2a) as well as the
24 cores result, that simulates a powerful server (see figure 5.2b). We observe
that the increasing workload causes the growth of the speedup. In addition, the
speedup stabilizes quickly.
Given the fixed number of CPUs, we test the difference between the mean of
speedup on 1, 000 and 20, 000 workload. the null-hypothesis H0 is tested against
the one-sided alternative H1 as follows.
50
Andel suggests using non-pair t-test [41]. The table 5.3 presents the computed
p-values of this test for different fixed core numbers. In both cases, H0 is rejected
at previously set significance level since the computed p-values are less than .
Hence, we conclude that the workload significantly influences the speedup when
the number of CPUs is fixed.
In addition, each plot contains both 1, 000 and 10, 000 iterations. The his-
tograms together with the normality discussion may be found in appendix F.
24
8.0
22
7.5
20
speedup
speedup
7.0
18
6.5
16
6.0
workload workload
51
5.2.4 Linear Model
The initial model aims to determine how much variation in the speedup is ex-
plained by the number of the cores. In order to provide maximal simplicity of the
model and not to over-fit it, we used the quadratic dependence (see equation 5.1).
The significance level remains the same as defined in chapter 5.1.5. By using
the ordinary least squares method [42] the i coefficients are estimated as follows
from equation 5.2.
Since the p-values of both 1 and 2 significances are less then 2 1016 , all
three coefficients remain in the model [43]. Due to Amdahls law (see chapter 2.2),
the speedup cannot be linear. Hence, 2 is non-zero and therefore significant.
Additionally, the speedup cannot exceed the optimal speedup, hence 2 must
be negative. In order to compensate for the quadratic decrease of the speedup
growth, 1 is slightly greater than one [43]. Thus, he speedup is almost optimal
for smaller numbers of cores (see table 5.3).
The visualization of the regression line that was estimated is shown in fig-
ure 5.4a. The computed residuals are demonstrated in (see figure 5.4b). Their
unbiased appearance suggests that the model fits well. The coefficient of deter-
mination is calculated: R2 = 0.9889 which suggests very good variance coverage.
The p-value of the calculated F-statistics is less than 2.2 1016 . According to
the residuals plot and the attributes above, we conclude that the model fits well
[44, 45]. For details of linear model refer to appendix G
52
50
6
40
4
30
residuals
speedup
2
20
0
10
2
0
cores Index
Analogously to the prior linear model, the ordinary least square method [42]
was used in order to estimate i . The estimated coefficients are as follows from
equations 5.4.
All the p-values of the variable significances are less than the set significance
level (see chapter 5.1.5) [43]. Considering a fixed workload, 1 and 2 may be
explained analogously to the proper coefficients of the simple model. Nevertheless,
the quadratic dependence is stronger since it must balance the 3 that smooths
the curve. On the contrary, considering the fixed number of cores, the influence of
the workload is similar. However, since 5 is stronger than 2 in the prior model,
we conclude that workload influences the total speedup less than the number of
53
cores. It implies, that the scalability of the benchmark is sufficient.
0 = 3.905 (5.4a)
1 = 1.369 (5.4b)
In order to maintain the clarity of the data, the model itself is not presented.
Instead, a three-dimensional visualization of raw measured speedups is illustrated
in figure 5.5a. As it may be observed, the core dependence is dominant. Fur-
thermore, the residuals are plotted as proof of a good model (see figure 5.5b).
Analogously to the prior model, the unbiased appearance suggests that the model
fits well [45]. The coefficient of determination is calculated: R2 = 0.9688 which
suggests sufficient variance coverage. The p-value of the calculated F-statistics
is less than 2.2 1016 . According to the residuals and the attributes above, we
analogously conclude that the model fits well [44]. For details of linear model we
refer to appendix H
10
20
speed
residuals
15
up
10
0
wo
r
klo
ad
res
co
5
5
Index
54
According to residual plots of both previous models, we conclude that models
do not appear to feature heteroscedasticity at all since we observe strong ho-
moscedasticity instead [46]. We conclude that both models fit well the measured
data and may be employed for speedup estimations. Nevertheless, the models are
not appropriate for extrapolation.
We decided to use this trivial approach in order not to use recursion, thus scal-
ability of the problem is enhanced. Additionally, this approach is easily invoked
in parallel since the result cells are computed independently of each other.
Even though omp parallel for was used analogously to the prior benchmark
5.2 and the amount of workload remains approximately the same, the measured
speedups differ. This example stores the result in the memory which inevitably
leads to the synchronization at some point.
The benchmark was run with the workload set to 2, 000. As figure 5.6a illus-
trates, the speedup of this benchmark is scalable similarly to the prior benchmark
(see chapter 5.2) in terms of the observed shape of the dependence. The second
figure (5.6b) indicates the speedup distribution.
55
8.0
50
40
7.5
30
speedup
speedup
20
7.0
10
6.5
0
0 10 20 30 40 50
cores workload
Finally, as the last omp parallel for benchmark, we implemented the parallel version
of the function maximum finding, that employs a critical section (omp critical ).
Function f (x, y), that is defined in equation 5.5, is maximized over an integer
matrix [3, 500; 3, 500)2 .
p
f (x, y) = log(x2 + 6y) sin(x y) + 4 |x3 y 3 | cosh(3x) (5.5)
The algorithm evaluates f in every integer point in the input interval by divid-
ing total workload into chunks of 500 that are invoked in parallel. Nevertheless,
the critical section, where the maximization happens, is an extreme bottleneck.
Since all worker threads work approximately the same time, the contention on
the critical section lock is enormous and leads to great overhead.
As a demonstration of the fact that the critical section lacks the scalability, the
dependence of speedup on cores is presented in form of a graph (see figure 5.7a).
We observe that the speedup steadily increases up to twelve CPUs where it reaches
the value slightly less than ten. That value remains constant whilst the number
of CPUs increases. Even though the scalability seems to be insufficient, the
increasing number of CPUs does not cause any decrease of the speedup which is
56
8.0
50
7.5
40
7.0
30
speedup
speedup
6.5
20
6.0
10
5.5
5.0
0
0 10 20 30 40 50
cores workload
expected for increasing race conditions over the critical section mutex.
The inappropriate use of master or single directives causes an overhead. For the
purpose of measuring its size, the following benchmark was developed. It mea-
sures the total speedup of completely useless directive usage, i.e. a master of
single directive is located in a parallel region. Clearly, the directive body is ex-
ecuted only once, nevertheless the overhead caused by useless thread creations
may be measured. Obviously, the total speedup is expected to be less than 1.
We observed that for non-trivial workloads the overheads are nearly unmea-
surable as the mean speedup is not significantly lower than one. Hence, only
1
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_
distance#Java
57
0.6
0.6
200 200
500 500
700 700
1000 1000
0.5
0.5
0.4
0.4
speedup
speedup
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
2 4 6 8 12 16 20 24 28 32 36 40 44 48 2 4 6 8 12 16 20 24 28 32 36 40 44 48
cores cores
smaller workloads are demonstrated in figure 5.8. According to the figure, even
though the greater workload lowers the fraction of thread creation overhead, the
growing number of CPUs implies the growth of the number of thread that are
created. Consequently, a greater overhead is incurred.
1 / r e f e r e n c e a l g o r i t h m /
2 runLevenstein ( stringA , stringB ) ;
3
4 / p a r a l l e l a l g o r i t h m /
5 // omp p a r a l l e l
6 {
7 // omp master ( s i n g l e r e s p e c t i v e l y )
8 { runLevenstein ( stringA , stringB ) ; }
9 }
Listing 5.2: Master/Single benchmark design
According to the measured data, the single provides slightly higher speedup.
This is caused by the fact, that the task may become executed during other
threads creation. On the other hand, master requires only thread number 0 for its
execution, possibly waiting for all other threads to be created. We recommend
the use of single rather then master for that reason.
58
benchmark (described in chapter 5.2) - was chosen. By not storing the results
into the shared memory, the data are cleared and the Java memory model ef-
fects are filtered out. Hence, the differences among different memory models are
synchronization techniques are eliminated.
In order to provide practical results, the C++ source code was compiled only
using g++ Wall O3 fopenmp options. Figure 5.9 illustrates the comparison of
gcc (black) and omp4j (red) on two fundamentally different workloads - 1,000 (see
figure 5.9a) and 20,000 (see figure 5.9b).
50
gcc gcc
omp4j omp4j
40
40
30
30
speedup
speedup
20
20
10
10
0
0 10 20 30 40 50 0 10 20 30 40 50
cores cores
59
50
50
JOMP JOMP
omp4j omp4j
40
40
30
30
speedup
speedup
20
20
10
10
0
0
0 10 20 30 40 50 0 10 20 30 40 50
cores cores
Analogously to the prior C++ comparison, we aim to compare omp4j with JOMP
that also provides source-level parallelization and runs directly on JVM (as de-
scribed in chapter 1.3.1).
Considering the former workload (see figure 5.10a), we conclude that omp4j
scales more fluently at the expense of the worse local performance around 24
CPUs. In order to determine whether omp4j performs significantly less satisfacto-
ry, unpaired t-test is used as Andel suggests [41]. The significance level is set as
follows from chapter 5.1.5. The null-hypothesis H0 is tested against the one-sided
alternative H1 as follows.
H0 : The mean of omp4j speedup is equal to the mean of JOMP speedup with
fixed number of CPUs=24
H1 : The mean of omp4j speedup is significantly less than the mean of JOMP
speedup with fixed number of CPUs=24
60
JOMP provides significantly greater speedup then omp4j.
On the contrary, we might observe that with a higher number of CPUs, omp4j
performs better. Using analogous hypotheses and t-test at the fixed number of
cores equals 48 we reject H0 with p-value equals to 3.709 1005 . Hence omp4j
scalability is significantly better.
Considering the latter figure (5.10b), we may observe that up to 30 used cores,
both preprocessors perform equally. However, with increasing number of cores
omp4j scales better. To prove this claim, the t-test with analogous hypotheses is
employed. The null-hypothesis H0 is rejected with p-value equal to 5.696 1010 .
Thus, we conclude that even for big workloads, omp4j scales better then JOMP.
61
Chapter 6.
User Documentation
In this chapter, a complex tutorial of downloading source codes, project compi-
lation and installation is covered. Furthermore, the description of the supported
program options is provided and a complex example of a real-world preprocessor
usage is demonstrated.
6.1 Prerequisites
In order to run the preprocessor, having a proper Java Development Kit (JDK)
installed is required since the Java compiler is frequently used during the process.
omp4j is compatible with Java standards. It was explicitly tested on the following
JDKs.
OpenJDK 6
OpenJDK 7
Oracle JDK 7
Oracle JDK 8 (recommended)
The preprocessor was initially developed for linux and UNIX systems. Howev-
er, it may be used on MS Windows as well. The user must be aware of common
pitfalls such as problematic environmental variable setting. For example, null
must not be returned by ToolProvider.getSystemJavaCompiler(). We strongly discour-
age using the preprocessor on MS Windows since it was properly tested only on
linux systems.
There are multiple ways of downloading and installing omp4j. This section pro-
vides a step-by-step tutorial regarding all important information. The preproces-
62
sor download may be achieved either by direct JAR download or by cloning the
git repository and the following compilation.
There are multiple ways of obtaining the project sources. The recommended way
is using git CLI since the source code is developed using this system (hosted
on GitHub). As the preprocessor repository depends on grammar and runtime
submodules, the recursive option usage is suggested in order to download all
dependent submodules [34].
1 # SSH
2 $ g i t c l o n e r e c u r s i v e g i t @ g i t h u b . com : omp4j / omp4j . g i t
3
4 # HTTPS
5 $ g i t c l o n e r e c u r s i v e h t t p s : / / g i t h u b . com/ omp4j / omp4j . g i t
Listing 6.1: Source download
The other possible way of accessing the source is to download directly a ZIP
archive from GitHub 2 . However, this approach requires further submodule ini-
tialization since GitHub does not provide . git/ directory in the archive.
63
command (e.g. $ sbt compile) or as an interactive CLI. We assume that the user
prefers the interactive mode, thus the rest of the text follows this assumption.
The preprocessor was developed and tested on version 0.13.6.
The compilation of source code is rather simple. Entering SBT CLI using
$ sbt in project directory enables the user to command SBT in the interactive
mode. The common commands may be as follows (see Table 6.1 on page 64).
Command Action
compile download all dependencies and compile all sources
run [args] execute compiled program, passing [args] to the main function
test run unit testing, printing (colorful) results
assembly generate executable dependenceless .jar archive
doc generate scaladoc API reference
Hence, in order to compile the project, run command must be selected. Before
the compilation is started, SBT downloads all dependencies from the internet such
as ANTLR and ScalaTest. If no internet connection is available, the user must
provide all dependencies. This may by achieved either by the direct installation
or by copying the proper JARs into lib / directory. However, this approach is
strongly recommended against. The ANTLR JAR may be found in the digital
attachment of this thesis as well as the whole git repository.
The grammar repository3 contains both Java 8 ANTLR grammar and the
OMP grammar that was implemented as a part of this project. The former
grammar is slightly modified in terms of different parser package. In addition,
it allows the generated parser to access the single line comments. Apart from
the grammars themselves ( .g4 files), the repository also contains the generated
parsers and lexers. Therefore, when omp4j is downloaded with recursive option,
the grammars do not require further compilation. In order to recompile the
grammars, update.sh may be executed as it downloads ANTLR complete JAR
and compile the grammars.
3
https://github.com/omp4j/grammar
64
6.3 Installation and Invocation
omp4j does not require any special installation since it is a regular JAR. The
invocation of the preprocessor is as follows $ java jar omp4j.jar <params>. We
suggest that UNIX users create shell alias omp4j.
Alternatively, it may be run from SBT CLI via run command followed by
parameters and flags. Nevertheless, the user must provide theirs own implemen-
tation of executors since the default executors are employed only while assembled
JAR is used. We decided to design the behavior in this manner in order to sim-
plify future development of executors. For regular purposes, the JAR assemblage
is recommended for its higher performance.
Option
Behavior
Short Long
d destdir Directory where the output classes are stored.
h help Print help
n nocompile Do not compile preprocessed sources.
v verbose Provide progress information
omp4j supports all javac CLI options. Additionally, it provides a few new
options as it follows from table 6.2.
Finally, two examples of common omp4j usage are provided in listing 6.2. The
former example behaves similarly to javac and thus may be used instead. The
latter example, represents the bare preprocessor that does not compile final
sources.
1 $ omp4j d c l a s s e s v MyClass1 . j a v a MyClass2 . j a v a
2 $ omp4j d s o u r c e s noc o m p i l e MyClass1 . j a v a MyClass2 . j a v a
Listing 6.2: omp example invocations
65
More examples may be found in the digital attachment 7.3. The examples are
also stored on GitHub4 . The provided complex examples demonstrate all features
of all directives. Some other examples may also be found on the project website5 .
4
https://github.com/omp4j/omp4j/tree/master/example
5
http://www.omp4j.org/tutorial
66
Chapter 7.
Project Website
This chapter describes the project website1 which was developed in order to
provide public information about installation and usage of omp4j preprocessor.
The whole website consists of two major parts - the front-end and the back-end.
The former part provides a user interface and runs entirely on the client side (see
chapter 7.1.1). The latter part handles requests and runs entirely on the server
side (see chapter 7.1.2).
7.1.1 Front-end
The Angluar-based websites communicate with the back-end only via RESTful
API. That provides complete freedom of back-end technology [47, 48].
The first page access requires the download of the whole page template, CSS
styles and JavaScript libraries (c. 50kB for AngularJS scripts [49]). However, the
following requests transfer the minimum amount of data as only the changing
1
http://www.omp4j.org
2
http://www.omp4j.org/demo
3
https://angularjs.org
67
content is downloaded. Thus, once the page is loaded, the further navigation
downloads only the changing content.
7.1.2 Back-end
The former controller fetches html resources to the front-end. Thus, the time
spent while proceeding a single request is minimized and the user may quickly
navigate through the website.
The latter controller accepts a JSON object filled with the source code that
the user requires to be processed via omp4j. Then the preprocessor is instanced
and executed. Finally, the output is sent via HTTP response.
Listing 7.1 demonstrates the described work-flow. Refer to appendix J for full
Demo controller code.
4
https://www.playframework.com
68
1 / C r e a t e t h e c o n f i g u r a t i o n c o n t e x t /
2 v a l c o n f = new C o n f i g ( Array (
3 d , path / t o / output / d i r e c t o r y ,
4 s o u r c e o n l y , // p r o v i d e o n l y p r o c e s s e d s o u r c e s
5 f i l e 1 . j a v a , f i l e 2 . j a v a // o f f i l e 1 . j a v a and f i l e 2 . j a v a
6 ))
7
8 / C r e a t e p r e p r o c e s s o r /
9 v a l prep = new P r e p r o c e s s o r ( ) ( c o n f )
10
11 / Run t h e p r e p r o c e s s o r /
12 prep . run ( )
Listing 7.1: omp4j as a library
7.3 Hosting
Thus, the website is hosted at Heroku8 that offers a free-plan for small projects.
Heroku was preferred for its great simplicity, git deployment [53, 34] and possible
5
http://getbootstrap.com
6
http://aws.amazon.com
7
https://www.openshift.com
8
https://heroku.com
69
scalability for future purposes.
Free plan at Heroku is limited as the dynos fall asleep after an hour of inac-
tivity [54]. For that reason, the first request in the particular hour may take up
dozens of second for processing.
70
Conclusion
The implemented OpenMP-like Java preprocessor - omp4j - fulfills the original re-
quirements set out in the specification. The preprocessor is highly portable as it
is compatible with all major JDKs. Additionally, it processes all modern versions
of Java and the processed code is independent of any runtime libraries which is
the main advantage of omp4j. In comparison with related projects, omp4j sup-
ports several qualities such as flexible grammar of the directives, which enhance
the simplicity of the directive usage.
schedule attribute is recognized which allows the user to employ either already
provided executors - dynamic and static - or to implement a new executor that
better suits the current situation.
9
http://www.omp4j.org
71
The major disadvantage of omp4j is the complex and complicated source code
analysis which heavily depends on ANTLR and its Java 8 grammar. However, the
analysis is essential as it leads to the output which a programmer would produce.
In comparison with related projects, omp4j does not move the parallel region into
a separate class. We conclude that this behavior is beneficial for educational
purposes since it minimizes the amount of code transformation.
72
Bibliography
[1] Yifan Zhang et al. Towards Better CPU Power Management on Multi-
core Smartphones. url: http://research.microsoft.com/en- us/um/
people/zhao/pubs/hotpower13gpu.pdf (visited on 04/16/2015).
[2] Oracle. Java Platform, Standard Edition 8 Names and Versions. url: http:
//www.oracle.com/technetwork/java/javase/jdk8-naming-2157130.
html (visited on 05/02/2015).
[3] Oracle. The History of Java Technology. url: http://www.oracle.com/
technetwork/java/javase/overview/javahistory-index-19835.html
(visited on 04/16/2015).
[4] Mark Bull. JOMP. url: https : / / www2 . epcc . ed . ac . uk / computing /
research_activities/jomp/index_1.html (visited on 05/03/2015).
[5] Mark Bull. JOMP - download. url: https : / / www2 . epcc . ed . ac . uk /
computing / research _ activities / jomp / download . html (visited on
05/03/2015).
[6] J. M. Bull et al. Towards OpenMP for Java. EWOMP 2000. url: https:
/ / www2 . epcc . ed . ac . uk / computing / research _ activities / jomp /
papers/ewomp00.pdf (visited on 05/03/2015).
[7] M. Klemm et al. JaMP: An Implementation of OpenMP for a Java DSM.
url: http : / / www . researchgate . net / profile / Michael _ Klemm2 /
publication/220105283_JaMP_an_implementation_of_OpenMP_for_a_
Java_DSM/links/02e7e52827539be4a9000000.pdf (visited on 05/03/2015).
[8] Vikas, Nasser Giacaman, and Oliver Sinnen. Pyjama: OpenMP-like im-
plementation for Java, with GUI extensions. In: The 2013 Internation-
al Workshop on Programming Models and Applications for Multicores and
Manycores. Shenzhen, China: ACM, Feb. 2013. url: http://homepages.
engineering.auckland.ac.nz/~ngia003/files/vikas_pyjama_workshop_
2013.pdf (visited on 05/05/2015).
[9] Nasser Giacaman and Oliver Sinnen. Pyjama (PJ) help. url: http : / /
homepages . engineering . auckland . ac . nz / ~parallel / ParallelIT /
downloads/Pyjama/PJHelp1.3.x.pdf (visited on 05/03/2015).
[10] James Conway. Starting Akka.net. Apr. 7, 2015. url: http : / / blog .
jaywayco.co.uk/starting-akka-net (visited on 05/03/2015).
[11] Martin Decky. Vykonnost poctace, Architektura poctacu. url: http://
d3s . mff . cuni . cz / teaching / computer _ architecture / docs / 02 _
vykonnost.pdf (visited on 04/15/2015).
[12] Manek Dubash. Moores Law is dead, says Gordon Moore. In: Tech-
world (2010). url: http : / / www . techworld . com / news / operating -
systems / moores - law - is - dead - says - gordo - moore - 3576581 (visited
on 04/14/2015).
73
[13] Kibum Kang et al. High-mobility three-atom-thick semiconducting films
with wafer-scale homogeneity. In: Nature (Apr. 2015). url: http://www.
nature . com / nature / journal / v520 / n7549 / abs / nature14417 . html
(visited on 05/02/2015).
[14] Victoria Zislina. Why has CPU frequency ceased to grow? url: https :
/ / software . intel . com / en - us / blogs / 2014 / 02 / 19 / why - has - cpu -
frequency-cease-to-grow (visited on 04/15/2015).
[15] Gene Myron Amdahl. AFIPS Spring Joint Computer Conference. In:
American Federation of Information Processing Societies. 1967.
[16] Jakub Yaghob. Programovan v paralelnm prostred Teorie paralelnho pro-
gramovan. url: http://www.ksi.mff.cuni.cz/ lectures/ NPRG042/
html/ppp-01-theory-en.ppt (visited on 04/15/2015).
[17] Vladimir Kozlov and Sandhya Viswanathan. Optimizing the future Java
through collaboration. Intel Corporation.
[18] OpenMP ARB. About the OpenMP ARB and OpenMP.org. url: http :
//openmp.org/wp/about-openmp (visited on 05/01/2015).
[19] Barbara Chapman, Gabriele Jost, and Ruud van van der Pas. Using Open-
MP: Portable Shared Memory Parallel Programming (Scientific and Engi-
neering Computation). One Rogers Street, Cambridge MA, USA: The MIT
Press, Oct. 2007. isbn: 978-0262533027.
[20] Matthijs van Waveren. OpenMP 4.0 API Released. url: http://openmp.
org/wp/openmp-40-api-released (visited on 04/13/2013).
[21] OpenMP ARB. OpenMP Compilers. url: http : / / openmp . org / wp /
openmp-compilers (visited on 05/03/2015).
[22] Aleksandar Prokopec and Heather Miller. Parallel Collections, Scala Doc-
umentation. url: http://docs.scala-lang.org/overviews/parallel-
collections/overview.html (visited on 04/16/2015).
[23] Tutorials Point (I) Pvt. Ltd. C++ Multithreading, Tutorialspoint. url:
http : / / www . tutorialspoint . com / cplusplus / cpp _ multithreading .
htm (visited on 04/16/2015).
[24] Terence Parr. The Definitive ANTLR 4 Reference. 2nd. Pragmatic Book-
shelf, Jan. 2013. isbn: 978-1934356999.
[25] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Pro Git (Ex-
perts Voice in Software Development). 2nd. Apress and Addison Wesley,
Nov. 2000. isbn: 978-0201441246.
[26] Oracle. Executors (Java Platform SE 7 ). url: http : / / docs . oracle .
com / javase / 7 / docs / api / java / util / concurrent / Executors . html #
newFixedThreadPool(int) (visited on 05/05/2015).
[27] OpenMP ARB. OpenMP Application Program Interface. July 2013. url:
http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.
[28] Mark Kambites. Java OpenMP. EPCC SSP 1999. url: https://www2.
epcc . ed . ac . uk / computing / research _ activities / jomp / papers /
ssp059.pdf (visited on 04/24/2015).
74
[29] J. M. Bull and M. E. Kambites. JOMPan OpenMP-like Interface for
Java. In: ACM Java Grande 2000 Conference. San Francisco, California,
USA: ACM, June 2000. url: https://www2.epcc.ed.ac.uk/computing/
research_activities/jomp/papers/acmJG0.pdf.
[30] Martin Odersky, Lex Spoon, and Bill Venners. Programming in Scala: A
Comprehensive Step-by-Step Guide. 2nd. Artima Inc, Jan. 2011. isbn: 978-
0981531649.
[31] Martin Odersky. What is Scala? url: http : / / www . scala - lang . org /
what-is-scala.html (visited on 04/17/2015).
[32] Paul Chiusano and Runar Bjarnason. Functional Programming in Scala.
1st. Manning Publications, Sept. 2014. isbn: 978-1617290657.
[33] Bryan OSullivan, John Goerzen, and Don Stewart. Real World Haskell.
1st. OReilly Media, Dec. 2008. isbn: 978-0596514983.
[34] Scott Chacon. Pro Git (Experts Voice in Software Development). 1st. Apress,
Aug. 2009. isbn: 978-1430218333.
[35] Scott Bellware. Behavior-Driven Development. In: Code magazine (June
2008). url: http : / / www . codemag . com / article / 0805061 (visited on
04/16/2015).
[36] Daniel Hinojosa. Testing in Scala. 1st. OReilly Media, Dec. 2012. isbn:
978-1449315115.
[37] Jir Matousek and Jaroslav Nesetril. Kapitoly z diskretn matematiky. 4th.
Karolinum, 2009, pp. 3436. isbn: 978-80-246-1740-4.
[38] Antonn Steinhauser et al. DOs and DONTs of Conducting Performance
Measurements in Java. In: The 6th ACM/SPEC International Confer-
ence On Performance Engineering. Ed. by Yolande Berbers and Willy
Zwaenepoel. New York, USA: ACM, Feb. 2015, pp. 337340. isbn: 978-
1-4503-3248-4. doi: 10.1145/2668930.2688820. url: http://d3s.mff.
cuni.cz/~steinhauser/icpe2015tt.pdf.
[39] Jir Andel. Statisticke metody. Prague: MATFYZPRESS, 2007. isbn: 8073780038.
[40] Richard Taylor. Interpretation of the Correlation Coefficient: A Basic Re-
view. In: Journal of Diagnostic Medical Sonography (1 1990). url: http:
//www.sagepub.com/salkind2study/articles/05Article01.pdf (visit-
ed on 04/21/2015).
[41] Jir Andel. Zaklady matematicke statistiky. 3rd. Prague: MATFYZPRESS,
2011. isbn: 978-80-7378-162-0.
[42] Milan Hladk. Linearn algebra (nejen) pro informatiky. url: http://kam.
mff.cuni.cz/~hladik/LA/text_la.pdf (visited on 04/21/2015).
[43] Jim Frost. How to Interpret Regression Analysis Results: P-values and Co-
efficients. url: http : / / blog . minitab . com / blog / adventures - in -
statistics / how - to - interpret - regression - analysis - results - p -
values-and-coefficients (visited on 05/08/2015).
[44] Jim Frost. Regression Analysis Tutorial and Examples. url: http : / /
blog . minitab . com / blog / adventures - in - statistics / regression -
analysis-tutorial-and-examples (visited on 05/08/2015).
75
[45] Jim Frost. Regression Analysis: How Do I Interpret R-squared and Assess
the Goodness-of-Fit? url: http://blog.minitab.com/blog/adventures-
in - statistics / regression - analysis - how - do - i - interpret - r -
squared-and-assess-the-goodness-of-fit (visited on 05/08/2015).
[46] Karel Zvara. Regrese. Prague: MATFYZPRESS, 2008. isbn: 978-80-7378-
041-8.
[47] Ari Lerner. ng-book - The Complete Book on AngularJS. 1st. Fullstack io,
Dec. 2013. isbn: 978-0991344604.
[48] Leonard Richardson. RESTful Web APIs. 1st. OReilly Media, Sept. 2013.
isbn: 978-1449358068.
[49] Google. AngularJS: Miscellaneous: FAQ. url: https://docs.angularjs.
org/misc/faq (visited on 05/03/2015).
[50] Peter Hilton, Erik Bakker, and Francisco Canedo. Play for Scala: Covers
Play 2. 1st. Manning Publications Co., Oct. 2013. isbn: 978-1617290794.
[51] Alan Forbes. The Joy of Bootstrap: A smarter way to learn the worlds most
popular web framework. 1st. Plum Island Publishing, Aug. 2014.
[52] Core Team of Bootstrap. Components - Bootstrap. url: http://getbootstrap.
com/components/#navbar (visited on 05/03/2015).
[53] Neil Middleton and Richard Schneeman. Heroku: Up and Running. 1st.
OReilly Media, Nov. 2013. isbn: 978-1449341398.
[54] Inc. Heroku. Dynos and the Dyno Manager. url: https://devcenter.
heroku.com/articles/dynos#dyno-sleeping (visited on 05/03/2015).
[55] Typesafe Inc. Getting Started with sbt: Directory structure. url: http :
//www.scala- sbt.org/0.13/tutorial/Directories.html (visited on
05/10/2015).
76
List of Figures
1.1 Actor pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
77
List of Tables
2.1 Flynns taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1 Preprocessor work-flow . . . . . . . . . . . . . . . . . . . . . . . . 81
78
Attachments
This chapter describes the enclosed digital attachments. All attachments are
compressed in a single ZIP archive which content is explained below.
omp4j/
The preprocessor repository. It contains all sources that are required for
compilation (see chapter 6.2.2). The whole directory follow the SBT project
structure [55].
omp4j/examples/ This directory contains advanced examples such as
lambda function translation and various anonymous classes. The files
are also employed as the unit tests.
omp4j/src/ Preprocessor source root following SBT project structure [55].
benchmark/
The benchmark repository which contains the whole benchmarking frame-
work.
benchmark/resources/ This directory contains all resources that were
measured.
benchmark/statistics/ The statistical analysis scripts and resources. The
data/ subdirectory contains all measured data in .csv format.
www/
The website repository. In order to run the website locally, Activator plat-
form must be installed. By using activator ui, the proper directory can be
selected and the application run. However, the website runs also online10
in order to provide a simple access without the local installation.
omp4j-1.2.jar
The assembled dependenceless preprocessor. The JAR was compiled with
javac 1.6 , thus, it should run on all supported JVMs.
antlr-runtime-4.4.jar
The packed ANTLRv4 grammar recognizer. It is automatically downloaded
from the internet11 during the compilation. However, it must be provided
10
http://www.omp4j.org
11
http://omp4j.petrbel.cz/antlr-runtime-4.4.jar
79
by the user if the internet connection is not available.
thesis.pdf
A digital copy of this text.
80
Appendix A Preprocessor Work-flow
Table 1 presents the preprocessor work-flow in detail. Used packages and classes are highlighted in the following columns.
82
49 ompPrivate : PRIVATE ;
50 o m p F i r s t P r i v a t e : FIRSTPRIVATE ;
51
52 ompSchedule : SCHEDULE ( ( VAR | . ) ? ) ;
53 threadNum : THREAD NUM ( ompNumber ) ;
54 ompAccessModifier : ( ompPublic | ompPrivate | o m p F i r s t P r i v a t e )
( ompVars ) ;
55
56 ompVars : ( ompVar | ( ompVar , )+ ompVar ) ;
57 ompVar : VAR ;
58 ompNumber : NUMBER ;
59
60
61 // LEXER RULES
62 OMP : omp ;
63 PARALLEL : parallel ;
64 FOR : for ;
65 SECTIONS : sections ;
66 SECTION : section ;
67 SINGLE : single ;
68 MASTER : master ;
69 BARRIER : barrier ;
70 ATOMIC : atomic ;
71 CRITICAL : critical ;
72
73 PUBLIC : public ;
74 PRIVATE : private ;
75 FIRSTPRIVATE : firstprivate ;
76 SCHEDULE : schedule ;
77 THREAD NUM : threadNum ;
78
79 ...
83
Appendix C IOMPExecutor
This appendix provides the executor interface. This interface is final and may not
be modified in order to maintain backward compatibility of the processed files.
1 package o r g . omp4j . runtime ;
2
3 import j a v a . u t i l . c o n c u r r e n t . Executor ;
4
5 /
6 E x t e n s i o n t o t h e Executor c l a s s . P r o v i d e s methods f o r omp4j
preprocessor .
7 /
8 p u b l i c i n t e r f a c e IOMPExecutor e x t e n d s Executor {
9
10 / Block c u r r e n t t h r e a d u n t i l a l l t a s k s a r e f i n i s h e d . /
11 public void waitForExecution ( ) ;
12
13 / Get unique i d o f t h e t h r e a d from which t h e method i s c a l l e d .
The r a n g e i s [ 0 , threadNum ) /
14 p u b l i c i n t getThreadNum ( ) ;
15
16 / Return t o t a l number o f t h r e a d used . This number i s u s u a l l y
t h e same a s t h e one g i v e n t o t h e c o n s t r u c t o r . /
17 p u b l i c i n t getNumThreads ( ) ;
18
19 /
20 S i m u l a t e b a r r i e r h i t . Delay t h i s t h r e a d u n t i l a l l t h r e a d s have
h i t the b a r r i e r .
21 @param barrierName name o f t h e b a r r i e r h i t .
22 /
23 p u b l i c v o i d h i t B a r r i e r ( S t r i n g barrierName ) ;
24 }
84
Appendix D hitBarrier implementation
85
Appendix E BSD License
This appendix provides full BSD license. All project repositories are published
under the following license including the text of this thesis.
1 [ The BSD l i c e n s e ]
2 Copyright ( c ) 2015 Petr B e l o h l a v e k
3 All rights reserved .
4
5 R e d i s t r i b u t i o n and u s e i n s o u r c e and b i n a r y forms , with o r w i t h o u t
6 modification , are permitted provided that the f o l l o w i n g c o n d i t i o n s
7 a r e met :
8
9 1 . R e d i s t r i b u t i o n s o f s o u r c e code must r e t a i n t h e above c o p y r i g h t
10 n o t i c e , t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r .
11 2 . R e d i s t r i b u t i o n s i n b i n a r y form must r e p r o d u c e t h e above
copyright
12 n o t i c e , t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r i n
the
13 documentation and/ o r o t h e r m a t e r i a l s p r o v i d e d with t h e
distribution .
14 3 . The name o f t h e a u t h o r may not be used t o e n d o r s e o r promote
products
15 d e r i v e d from t h i s s o f t w a r e w i t h o u t s p e c i f i c p r i o r w r i t t e n
permission .
16
17 THIS SOFTWARE IS PROVIDED BY THE AUTHOR AS IS AND ANY EXPRESS OR
18 IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES
19 OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED .
20 IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT ,
21 INCIDENTAL, SPECIAL , EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT
22 NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES ; LOSS
OF USE,
23 DATA, OR PROFITS ; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY
24 THEORY OF LIABILITY , WHETHER IN CONTRACT, STRICT LIABILITY , OR TORT
25 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
USE OF
26 THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
86
Appendix F Fibonacci Distribution
70
60
60
50
50
40
40
Frequency
Frequency
30
30
20
20
10
10
0
6 7 8 9 10 11 12 6 7 8 9 10 11 12
speedup speedup
70
60
60
50
50
40
40
Frequency
Frequency
30
30
20
20
10
10
0
15 20 25 30 35 15 20 25 30 35
speedup speedup
The measured data contains values that exceed the optimum speedup. This
phenomenon is caused by pairwise speedup computation and is eliminated by
CLT [39]. The distributions appear normal, however, Shapiro-Wilk test rejects
the normality hypothesis because of the described phenomenon. We decided not
to filter out these results in order to provide clear data.
87
Appendix G Simple Linear Regression
This appendices G and H present the outputs of the R command summary applied
to the first and second linear models which are described in chapter 5.2.4.
1 Call :
2 lm ( f o r m u l a = d a t a s e t $ s p e e d u p d a t a s e t $ c o r e s + I ( d a t a s e t $ c o r e s 2 ) )
3
4 Residuals :
5 Min 1Q Median 3Q Max
6 2.2220 0.3854 0.0917 0.3252 7.1028
7
8 Coefficients :
9 Estimat e Std . E r r o r t v a l u e Pr ( >| t | )
10 ( Intercept ) 0.669882 0.126153 5.31 1 . 6 8 e 07
11 dataset$cores 1.201956 0 . 0 1 1 8 7 6 1 0 1 . 2 0 < 2 e 16
12 I ( d a t a s e t $ c o r e s 2 ) 0.012727 0 . 0 0 0 2 3 5 54.16 < 2 e 16
13
14 S i g n i f . codes : 0 0.001 0.01 0.05 . 0.1 1
15
16 R e s i d u a l s t a n d a r d e r r o r : 0 . 8 8 3 2 on 477 d e g r e e s o f freedom
17 M u l t i p l e Rs q u a r e d : 0 . 9 8 8 9 , Adjusted Rs q u a r e d : 0 . 9 8 8 8
18 F s t a t i s t i c : 2 . 1 2 2 e+04 on 2 and 477 DF, pv a l u e : < 2 . 2 e 16
Listing 2: Simple linear regression summary
88
Appendix I Matrix Multiplication
The following source code was used in matrix multiplication benchmark (see
chapter 5.3).
1 p r o t e c t e d v o i d runBenchmark ( f i n a l i n t [ ] [ ] A, f i n a l i n t [ ] [ ] B) {
2
3 i n t r e s u l t [ ] [ ] = new i n t [ workload ] [ workload ] ;
4
5 // omp p a r a l l e l f o r s c h e d u l e ( dynamic )
6 f o r ( i n t i = 0 ; i < workload ; i ++) {
7 f o r ( i n t j = 0 ; j < workload ; j ++) {
8 int d = 0;
9 f o r ( i n t k = 0 ; k < workload ; k++) {
10 d += A[ i ] [ k ] B [ k ] [ j ] ;
11 }
12 r e s u l t [ i ] [ j ] += d ;
13 }
14 }
15 }
89
Appendix J Demo Controller
90