A - Review - of - Conventional - and - Upcoming - Ap 2019
A - Review - of - Conventional - and - Upcoming - Ap 2019
A - Review - of - Conventional - and - Upcoming - Ap 2019
Abstract: Present day compilers have profusion of code discuss about the trending techniques of compilation.
modification. But all most all compiler follow same old method to Traditional compilers have been in use for generating
optimize code and they apply predetermined sequence like machine codes until the emerging architectural changes, since
scanning, lexing, analyzing syntax and analyzing semantics
followed by generating intermediary code and finally generating these were hardware dependent thus there was a necessity to
target code. Thus optimizing functions without interpreting build compilers that are not hardware dependent and use to
whether code is getting modified or not we can’t assure that after compile source codes for type of microprocessor designed to
compilation it uses less resources and faster execution. In order to control functionality of devices has accentuated coders to need
overcome this problem, techniques like Reverse-Inlining for controlled potency consumption, real-time execution and
(Procedural Abstraction), Cross Linking optimization, trace agility of code.
inlining, Optimizing Leaf Function, Combined code motion and
allocation of registers etc are used which are more efficient and In static compilation, compiler processes the program code
generated better machine codes. As trace inlining technique helps method after method, a control-flow graph (CFG) is constructed
in analyzing enforcement and size of low level code generated. for each method, the graph is optimized and the native code gets
Various inlining rules on the benchmark suites DaCapo 9.12 Bach, generated based on CFG traversal. Suganuma et al, proposed a
SPECjbb2005, and SPECjvm2008 are assessed, proved that just-in-time dynamic compiler(JIT), where it accesses the
compilers based on tracing attains nearly 51% better enforcement
runtime profile information, selects code only if it influences
than method-based Java HotSpot client compiler. Moreover, the
positive effect of the large compiler scope on other optimizations the overall runtime and inline those parts of method bodies only
of compiler is conveyed. One of the techniques used to deicide good [2]. Gal proposed trace-based compilation approach to building
optimizing sequence for programs is artificial neural network. dynamic compilers, wherein cyclic code paths that are
frequently executed are identified and recorded. These traces
Keywords: Compiler Analysis are assembled dynamically into a tree-like data-structure. The
major advantage is that the trace tree only consists of code areas
1. Introduction that are relevant. Edges which appear in static CFG but that are
Process of transforming a program where it considers not executed at runtime, are not considered in the trace
resultant of foregoing step as input to next step and modifies representation, and are assigned to an interpreter in case of
code in such manner that it absorbs minimum expedients like execution. The control flow merge point in the trace tree being
CPU, memory etc. with superior accomplishment is called absent greatly simplifies optimization algorithms which
optimizing code. Machine dependent and machine independent increases performance and reduces the machine code that gets
are kinds of code optimizations. Transforming given code as generated.
stated in target device architecture, modifying induced code is We will see the difference between the traditional compilers
called as Machine dependent optimization. It uses CPU and and modern compilers in their way of optimizations and
registers, instead of using corresponding memory label it uses compiler analysis, also the modifications done to the
established memory label. Machine independent optimization conventional compilers in order to meet the emerging changes.
[1] method optimizes transitional code for better final code.
This paper focuses on new techniques which supports for the 2. Literature survey
emerging computer architectural designs. Very large instruction Phase Ordering optimization techniques: Here instead of
word (VLIW) Architecture, allows exploiting instruction level following fixed sequence for modifying whole program,
parallelism, meaning execution of more than one instruction apply it to discrete portion by this it selects superior
simultaneously which are not inter dependent. In Parallel sequence for modification automatically.
execution code execution doesn’t consume more time and
Phase ordering with genetic algorithm [3]: It is used in
hence code will be more efficient. So in this paper we will also
International Journal of Research in Engineering, Science and Management 348
Volume-2, Issue-5, May-2019
www.ijresm.com | ISSN (Online): 2581-5792
two experiments. First experiment involves process of variable can be circulated through the flow graph and
finding better optimization sequence. To do this evaluate replaced at use of the variable.
optimization sequence (chromosome) through compiling Local sub-expression elimination, while executing the
available benchmarks with sequence, recording their time expressions present in the program variable being
taken for execution, calculating their fastness by accessed in current step whose value is being computed in
normalizing time taken for running with running time previous step and if it’s not modified in such scenarios
observed by compiling benchmarks at O3 level. This compiler can skip process of recounting value for the
corresponds to “best overall sequence”. Second same variable.
experiment involves finding superior optimization Local strength reduction is frequently applied in
ordering refers to “best sequence per benchmark” customary compiler optimization to allow replacement of
Automatic Feature generation [4]: Components of system an expression by its meaning. That is, it replaces more
are training data generation, attribute generation and resource consuming operations (multiplication) by less
machine learning. Extracting intermediate codes of resource consuming operation (addition).
compiler and optimal values for heuristic is task of data Global optimization, different from local optimization is
generation. This does process iteratively to obtain best applied on global parameters. Data flow analysis is used for
heuristic value for each program. The extracted codes will code optimization during this step. The various methods
be passed to feature search which explores characteristics followed are:
of intermediate codes and generates vector by considering Redundant (common) Sub-expression elimination, in this
evaluated values of all programs, pass generated vector to kind of elimination redundant expressions is identified
machine learning tool. Job of tool is to fetch globally and are calculated once and replaced by the result
characteristics from base attribute list and calculate value everywhere else.
for that characteristic and add characteristic with best Dead Code Elimination: After all the steps are performed
heuristic value to list, if two characteristics have same as mentioned so far, the optimized code consists of some
quality or value then add short characteristic to the list. used lines of instructions which are eliminated in this step.
And this process will be repeated iteratively. List gets Loop Optimization [6], if each time the result computed
generated in final is called as latest attribute list. within a loop is same then computing it in each iteration is not
Trace Inlining: In previous work, found on Oracle's required. If an induction variable is found within a loop that is
JavaTMHotSpot client compiler a trace-based JIT a variable whose value on each loop iteration is proportional to
compiler [5] was implemented. It focused on how trace the calculation of the iteration index. When such variables and
inlining is performed and its advantages. Application of the value they compute are found, often high cost operations are
several trace inlining rules was presented for trace-based replaced by low cost operations or the variable itself can be
JIT compiler. Influence of rules on peak enforcement, size deleted.
of generated low level code and compilation time for Reverse-Inlining (Procedural Abstraction) is used to achieve
DaCapo 9.12 Bach benchmark suite was evaluated. code size reduction. This method replaces the function call with
function definition, thus increasing the speed of execution.
3. Discussions Some of the modern compiler optimization and compiler
Traditional compilers aim to remove redundancies, analysis are discussed below. It also uses data flow analysis but
inefficiency and reordering the data and operations. It uses a with alias analysis that is it is used to find if data storage can be
number of techniques for this purpose like data flow analysis retrieved in many other ways. If two pointers are pointing to the
where we try to understand the flow of the data. If a variable is same memory location they are said to aliased. Advanced
defined we try finding out the way it‘s used. It can be either compiler analysis applies the above mentioned aliasing in
forward analysis or backward analysis, in forward analysis we determining the existence of the references at the exit of the
supply the value of variable to succeeding code whereas, in block.
backward analysis we can supply information about some Combined code motion [7] and register allocation, Code
succeeding properties "back in time", like in case of dead code motion tries in placing the such instruction together than
elimination we can withdraw variables if they are never read in are independent of each other and can be ran
future. simultaneously, thus implementing parallel execution.
Local optimization is used to improve the code and done The Register Allocation and Code Motion
with the following methods., (RACM)minimizes register pressure by implementing
Local constant folding where Expressions with operands code motion as explained above, followed by duplicating
holding fixed values can be evaluated at time of the code and finally storing the value of variables from
compilation and this causes faster execution, reduction in register to memory.
size of code. Cross linking optimization, this optimization can be
Local constant circulation, fixed values allocated to a applied globally or locally. This is used where functions
International Journal of Research in Engineering, Science and Management 349
Volume-2, Issue-5, May-2019
www.ijresm.com | ISSN (Online): 2581-5792
consist of switch statements with same tail codes, which interpreters and other compiled traces then directly invoke
is replaced after the switch block. generated machine code. If a prerequisite condition for
Address Code Optimization, speeds up instructions and optimization is violated, System reverses optimizations to trace
execution time by reordering data in memory to minimize recording interpreter. During reversing the optimizations all
and simplify computation of address. values that are still in current compiled frame are saved first and
Type Conversion Optimization, involving the physical then one or more interpreter frames replaces that compiled
size of the data known as data processing, and semantics frame and then execution is continued in trace recording
of data processing operations. Substructure for a different interpreter. The trace recording interpreter takes charge here
of data types is the main goal of most compilers; as such after as, a partial trace that starts at the point of reverse
compilers insert many implicit data type conversions, optimization, instead of from where the trace anchor is
such as zero extension. recorded.
Multiple Memory Allocation(MMA) is one of the In trace recording approach, every individual thread
technique used in trending compilers as in this technique maintains tracing stack for traces that gets recorded.
instructions are saved and uploaded in many registers Instructions that can alter control flow is recorded always at
simultaneously. stack’s top. Method invocation also gets notified in caller’s
Upcoming compilers are also expected to minimize code trace. Receiver’s class also gets stored if it is a virtual method.
length, increase programs speed to run by accurate usage A new trace for caller function is recorded onto stack and
of memory which is appreciably seen in the techniques recording of the caller is resumed. On completion of caller’s
they adopt. execution, trace of called function gets popped and stored in a
Artificial neural networks [8] used to find out superior trace repository. A pointer to caller’s function trace is used to
sequence modification, represented as reticular nervous. Kinds link caller and called function by recording in caller's trace and
of feed forward network are one layer with one insert and one then the recording of called function is continued. Information
resultant layer. Multilayer with one insert and one resultant which is context dependent over the methods gets preserved by
layer and includes multiple underlying layers. Recurrent layer this linkage. Trace doesn’t get stored but counter gets
is like multilayer but includes minimum one feedback loop. As incremented in the previously recorded trace, when a trace that
mentioned before resultant of each layer becomes input to next is saved is again recorded. Traces are considered to be different
layer, link across neurons represented by weighted vectors. depending upon the execution flow or the caller function traces.
ANN works in iterative manner where for each iteration it takes Hence, accurate information of call regarding every executed
characteristic of code of present state and does evaluation to path is recorded using trace-linking. Loop and recursive
determine superior optimization. modules are not linked to their parent trace to reduce the
Enactment of model is done by using a dynamic compiler number of traces that gets recorded. Assuming that all the traces
4cast-xl that builds ANN, integrates it to Jikes RVM’s for a particular anchor has been recorded after recording is done
optimization driver and performs phase ordering optimizations. a specific count of times for that anchor, machine
And this dynamic compilation has to be done repeatedly by understandable instructions which is optimized is generated
ensuring the steps given below: from those compiled traces. Commonly executed operations are
1. Generating attribute vector of current method’s state. executed straightly by the interpreter’s assembler and complex
2. Initiating outline of code. operations are executed in C language-based runtime
3. Determine superior sequence to modify by using ANN. environment. Efficient operations done by many threads
Following is an overview on brief description of VM’s simultaneously enables Java thread structure to jump between
structure. Class loader does following functions of parsing, two interpreters independently. A local buffer is maintained by
loading and verifying class files and run-time data structures are each thread to achieve the best performance. The data structure
also provided to different parts of VM. Then during bytecode used to store recorded traces allows data to be read and denies
pre-processing step, loops are detected and tracing-specific data locks and atomic instructions during the reading process. But
structures are created. To implement trace recording quite a few locks data structure during writing threads only when a new
steps are followed by Java HotSpot VM template interpreter trace is found.
that is key is duplicated (a copy made) and instrumented which In trace inlining, static and dynamic inlining methods [9] are
results in a usual and a trace recording interpreter. With respect supported by trace-based JIT compilers with the help of the
to the initial executions, normal interpreter is used. Counter of recorded trace information. Trace inlining replaces function
invocation of that trace anchor increments each time basic calls with its code itself but traces that are commonly executed
interpreter comes across a trace anchor. Execution switches to are only inlined instead of entire methods. Unlike method-based
trace recording interpreter, when counter exceeds specified compilers, context dependent information is contained in the
threshold. Trace-based JIT (just in time) compiler is based on recorded traces which helps to avoid inlining of function
HotSpot client compiler. Compiler records traces into a trace definition parts that are unnecessary for the current caller but
graph when traces have been recorded often enough. The are executed frequently overall. Aggressive inlining of virtual
International Journal of Research in Engineering, Science and Management 350
Volume-2, Issue-5, May-2019
www.ijresm.com | ISSN (Online): 2581-5792
methods takes place since methods of only a specific type of loops, the exit will have to be filtered out since in spite of loop
receiver gets invoked at a call site. body, occurrences of loop entry are compared with occurrences
First of all, the maximum size of the trace, which is of loop exits, which would restrict the scope of compilation and
dependent on the relevance of the call site, that has to be placed again rise the occurrences of deoptimization.
in inline at the current call site is computed. Then, the invoked In order to discuss about trace inlining heuristics, execution
traces at the current call site areinlined based on certain frequency of each block of trace graph, using recorded traces,
heuristics. So, the method invocation is replaced with the trace is evaluated. Then, estimated frequency is then divided with a
graph that is built from the traces that should be inlined. Then, reference value to estimate relevance of each block which in
return instructions are replaced with jumps to the next line of turn will be used to compute the call site’s relevance. So, that
command after the call and exception-throwing instructions reference value is chosen utilizing one of the algorithms
will be attached tohandlers of exception. Usually, it’s only the mentioned below.
traces that get invoked by the current caller that are inlined. Simple: Estimated execution frequency of each block of
However, in certain cases, all caller function traces are trace graph is divided by completeimplementation
considered as candidates for inlining, if the callee traces were occurrences of complete merged execution paths of graph.
compiled before the caller’s trace recording had even begun. Value lies in domain 0 to 1, excluding 0.
Based on dead code elimination technique [10], certain Most frequent trace: Execution frequency of each block is
candidates for inlining can be rejected based on specific divided by execution frequency of merged trace that gets
parameters that is passed to the callee by the caller. executed most number of times. So, call sites residing in
By taking into consideration CHA [11] and context blocks shared among multiple traces would have a greater
dependent trace information, in case of virtual calls, the actual relevance due to higher execution frequency, while values
receiver class for the calling site can be determined. As in range ]0,1] would be the relevance of a call site residing
recorded receiver classes are used if multiple target methods are only in individual traces.
found by CHA along with an additional run-time check such Path-based: In this approach, firstly successor block, with
that the trace deoptimizes to the interpreter in case of type respect to root block, that is executed most number of
mismatch. In a similar manner, since no context-specific times is determined. Then, marking block as visited
information is available for the loop traces, all recorded traces process is performed over and over until either a loop
are considered as candidates for inlining. However, elimination header orleaf nodes is reached. Then, leastimplemented
of few traces is possible based on parameters and locals. Traces node of all visited nodes will be used to determine the
with no further lead existence in the caller functions trace graph other blocks aptness that are present in trace graph.
for the inlined loop is also eliminated. Hence, values would be in range of [1,∞], for call sites
A special kind of inlining guarded by method guard was deemed to be important, while [0, 1] would be range of
implemented to overcome the issue that a method could not be values for calls that are less important.
inlined if the same method was always invoked by the call site Following are the configurations of the dynamic inlining
but by different receiver types. In the above process the invoked heuristics that generated least size of machine code and
method is compared with the expected method by accessing the increased performance.
virtual table of the receiver. Multiple receiver types can be Minimum code: Depending on relevance of call site,
checked for by extending the type guard to a switch structure if inlining size of 35 bytecodes is modified by this heuristic.
the same method was always invoked by the call site but by Inlining size is decreased for relevance less than 1 and
receiver of type interface. The stated process is definitely increased for relevance greater than 1. This heuristic
cheaper when compared to the interface lookup. The switch like results insmall size of machine code and increased
structure can also be extended to efficient inlining of performance on combination with path based algorithm.
polymorphic calls. Platform specific methods are inlined by Balanced: Inlining size remains unchanged for blocks
the JIT compiler using heuristics of the compiler. The same with relevance value less than 1 which conceivably are
method based inlining is conducted yet the trace-based important calls that are inlined, while inlining size is
compiler performs aggressive inlining of java traces and traces increased by 40 bytes of code by this heuristic for blocks
are also smaller than methods. having relevance greater than 1. This heuristic results in
Further departing edges of similar part of the code is stability between size of code and performance on
compared with the repeatedly implemented departing edge combination with path based algorithm.
determined. Edges to unreachable parts of code and with Performance: Depending on the relevance of call site,
hundred times the minimal frequency are removed from the substantial size of 150 bytecodes is used for inlining by
trace graph. The process has many downfalls and must be taken this heuristic. The inlining size is not increased beyond
care of, the tracing information is based on the program the increased value. Inlining size is decreased for
behavior that may change over time, removing major execution relevance less than 1. This heuristic optimises the
paths might proceed to repeated deoptimizations. In the case of performance on combination with path based algorithm.
International Journal of Research in Engineering, Science and Management 351
Volume-2, Issue-5, May-2019
www.ijresm.com | ISSN (Online): 2581-5792
Greedy: Similarly, large inlining size of 250 bytecodes is up is achieved by tracing configurations for jython
used by this heuristic. This heuristic prevents inlining of benchmark, which is responsible for execution of virtual
traces beyond maximum value. The inlining size is calls, since compiler makes use of recorded trace
decreased if relevance is less than 1. This heuristic results configurations. In terms of compilation time,
optimizes the performance on combination with path configuration based on traces is well organised that even
based algorithm. This heuristic on combination with aggressive configuration greedy takes similar amount of
algorithm that computes the frequent executed paths time as client compiler.
provided better results. Optimization is positively affected by inlining of traces due
Amount of machine code generated is more on invoking to increase inscope of compilation. For SPECjbb2005
small traces in comparison to inlining them hence the above benchmark, canonicalization [12] is increased because of trace
heuristics make sure that the assessors are always inlined. If inlining. For SPECjvm2008 benchmark suite, optimizations is
compilation of calleee merged to substantial size of machine hardly affected due to unavailability of larger compilation
code then inlining is avoided by the heuristics since after a scope. Increased performance is equally spread over all
certain point increasing scope of compilation is discouraged. optimizations listed on benchmark DaCapo 9.12 Bach.
Inheriting of relevance of the parent block by the callee is used Code with effective performance and optimizations are
by our heuristics to minimize size of machine code and inlining produced by server compiler but involve 13times extended
of traces in a nested manner. compilation on benchmark suites DaCapo 9.12 Bach,
To evaluate JIT compiler deployed from traces, it is enforced SPECjbb2005 and SPECjvm2008. Trace-based compiler only
for Java VM of A-32 design from Oracle [20]. Benchmark attains up to 67% of performance of server compiler in
suites DaCapo 9.12 Bach, SPECjvm2008, andSPECjbb2005 SPECjbb2005 benchmark but benefits largely from server
were chosen for evaluating the above discussed heuristics. compiler optimizations. 85% of performance of server compiler
Results evaluated is with respective to results for client is attained for SPECjvm2008 benchmark suite by configuration
compiler based upon the methods. greedy. Server compiler exhibits higher performance on crypto,
Server/client approach is simulated by this SPECjbb2005 mpeg audio, and scimark benchmarks that are loop exhaustive,
benchmark where functions are implemented in database due to advanced optimizations. However, server based compiler
of memory that is subdivided into warehouses. Relative to gets outperformed by trace based compiler on compress and
peak performance, client compiler gets outperformed by sunflow benchmarks, by aggressive trace inlining. In DaCapo
trace-based configurations because of greater 9.12 Bach benchmark suite, on an average 93% of performance
aggressiveness of inlining of traces. Configuration greedy of server compiler is attained by trace based compiler due to
slightly causes an increase in performance but at same presence of benchmarks that are more complex and less loop
time size of machine code is more. Configuration, exhaustive. However, irrespective of basic optimisations
minimum code, is well organized with respect to machine performed by trace based compiler, server compiler's
code and time utilized for compilation and also performance gets outperformed in sunflow, pmd and lu index
performance is decent. Irrespective of number of benchmarks due to context-sensitive and aggressive inlining of
warehouses used client compiler is outperformed by trace traces.
configurations. Performance is utmost when there are 4
warehouses since benchmarking system has 4 cores and 4. Conclusion
each thread processes one warehouse respectively. This paper discusses difference between the traditional
Nine benchmark categories are included in compilers and modern compilers in their way of optimizations
SPECjvm2008 benchmark to measure peak performance. and compiler analysis, also the modifications done to the
Client compiler gets outperformed by tracing conventional compilers in order to meet the emerging changes.
configurations. Highest speed-ups is witnessed on derby It also discusses the java compiler based on traces that manages
and serial benchmarks by tracing configurations. Due to inlining of traces during compilation of JIT instead of at times
small size, benchmarks scimark, mpeg audio and crypto of trace recording, which enforces trace inlining to be more
indicate nearly no increase in performance and is similar selective because of the extra information available. Besides,
to client compiler. However, it’s only traces that gets variant components of the method can be inlined based upon
inlined and not whole methods which decreases the site of call because of context-sensitive nature of traces.
compilation time and machine code that gets generated. Moreover, its proposed that it’s only the traces that are
Fourteen Java applications is included in DaCapo 9.12 implemented often that gets compiled to machine code by
Bach benchmark suite. On the whole, less machine code eliminating infrequently executed traces. Evaluation with
gets generated with respect to all inlining heuristics except benchmark DaCapo 9.12 Bach, SPECjvm2008 and
greedy configuration, yet overall performance increases. SPECjbb200proved that effective performance along with the
Greedy configuration profits benchmarks luindex, pmd, generation of decent amount of machine code can be attained
and sunflow, due to its large inlining size. Highest speed by proper trace inlining. Furthermore, larger compilation
International Journal of Research in Engineering, Science and Management 352
Volume-2, Issue-5, May-2019
www.ijresm.com | ISSN (Online): 2581-5792
scopes attained by trace compiler increases efficacy of compiler and loop transformations,” Conference on Design and Architectures for
Signal and Image Processing [LOOP], 2012.
optimizations and effective performance. [7] Slavko Radonić; MiodragĐukić; NenadČetić, “One solution of loop
invariant code motion compiler optimization,” 22nd Telecommunications
References Forum Telfor (TELFOR),2012.
[8] L. Prechelt, “Exploiting domain-specific properties: compiling parallel
[1] David Padua,“Compilers and the Furture of High Performance dynamic neural network algorithms into efficient code,” IEEE
Computing” , IEEE 22nd International Conference on High Performance Transactions on Parallel and Distributed Systems, 2010.
Computing (HiPC), 2015. [9] J. Cavazos; M. F. P. O'Boyle, “Automatic Tuning of Inlining Heuristics”,
[2] M. Bebenit, Jikes RVM, “Trace Based Compilation in Interpreter-less ...”, ACM/IEEE Conference on Supercomputing “, 2005.
Semantic Scholar, 2008. [10] Hiral H. Karer; Purvi B. Soni, “Dead code Elimination Technique in
[3] Michael R. Jantz; Prasad A. Kulkarni, “Exploiting phase inter- eclipse compiler for Java”, International Conference on Control,
dependencies for faster iterative compiler optimization phase order Instrumentation, Communication and Computational Technologies
searches”, 2013. (ICCICCT), 2015.
[4] Edwin Bonilla; Michael O'Boyle, “Automatic Feature Generation for [11] Jason Sawin; Atanas Rountev, “Assumption Hierarchy for a CHA Call
Machine Learning Based Optimizing Compilation Hugh Leather”, Graph Construction Algorithm”, IEEE 11th International Working
International Symposium on Code Generation and Optimization,2009. Conference on Source Code Analysis and Manipulation, 2011.
[5] Hiroshi Inoue; Hiroshige Hayashizaki; Peng Wu; Toshio Nakatani, “A [12] Shin-Ming Liu; R. Lo; F. Chow, “Loop induction variable
trace based Java JIT compiler fitted from a method-based compiler,” canonicalization in parallelizing compilers,” Conference on Parallel
International Symposium on Code Generation and Optimization, 2011. Architectures and Compilation Technique, 1996.
[6] Grigoris Dimitroulakos; Christakis Lezos; Konstantinos Masselos,”
MEMSCOPT: A source-to-source compiler for dynamic code analysis