CD Multicore
CD Multicore
CD Multicore
1. Introduction
In the face of the increasing demand for computing, the improvement of clock frequency of single-
core processor has reached the bottleneck. Processor manufacturers compensate for the performance
impact of reducing clock frequency to reduce power consumption by increasing the number of
processor cores, which makes multi-core processor show obvious computing advantages compared
with single-core processor. With more applications needed to be developed on multi-core platform,
parallel programming mode is gradually favored by programmers. The concept of parallelism has been
embodied in the way of threads in operating systems. However, the thread library represented by
Pthread is more inclined to task parallelism, and requires programmers to spend more energy on the
underlying operations related to threads. In order to make programmers focus more on business
process logic, some easy-to-use parallel programming specifications have been proposed at present,
such as OpenMP based on shared memory, MPI based on message passing mode and so on [1]. In
order to improve the efficiency of parallel programming on multi-core DSP, this paper introduces in
detail the compiler's scheduling and task allocation for multi-core DSP in parallel domain, and
explains how to implement parallel computing in fork-join mode on multi-core DSP platform.
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137
F J F J
O O O O
R I R I
Main Thread
K N K N
2
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137
Parallel
Execute Task Functions Execute Task Functions Domain
Report The Completion Of
The Task
3
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137
The translator includes two steps: analysis and transformation. Source 1 generates the intermediate
code in the form of abstract syntax tree (AST) after lexical, grammatical and semantic analysis, and
then transforms AST to output source 2. In this process, we use Lex and Yacc tools to complete lexical
and grammatical analysis, and complete the construction of AST in the process of executing semantic
actions.
Fig.4 shows the AST corresponding to a simple OpenMP program. Each box represents an AST
node and different types of AST nodes are designed according to different types of statements in the
program (such as variable declaration, function definition, expression, etc.). The information recorded
in each node is different. For example, in Fig.4, the AST node of function definition records the
function name (decl member) and function body (body member). Function call belongs to the class of
expression, and the corresponding AST node records the left value (member left) and right value (right
member) of expression. OpenMP instruction corresponds to a type of node, which will be dealt with in
the transformation step. These nodes are connected and organized into a tree, so the whole program
can be represented by an AST node at the top.
After the AST is established according to source 1, the transformation is started. In the process of
traversing the whole AST the transformation is completed, but most of the AST nodes corresponding
to the standard C language statements do not need to be transformed. Only the OpenMP nodes need to
be processed. The specific operation is to crop the corresponding subtree of OpenMP node and replace
it with the AST subtree of the expected changed code. The corresponding Fig.4 shows that the
OpenMP node in the dashed box is removed from the original AST and replaced with a new subtree.
At the same time, the new subtree must be completely embedded without changing other structures in
the original AST. After the new AST is constructed, the code program in the form of output characters
is traversed again, i.e., generating source 2.
root node
OpenMP node
type expression
body type function call
left printf
right hello
4
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137
5
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137
Parallel domain creation function execute_ Parallel is only called by the master core, and its
specific implementation is provided by the runtime. The function has three parameters, which are the
number of threads participating in the execution in the parallel domain (in this case, In this case, it is
declared by num_threads), task function name (in this case, it is _thrFunc0_) and the structure that
records shared variable information (in this case, it is shvars). Since the shared variables in the parallel
domain are uncertain with the program definition, the translator encapsulates all the shared variables
in a parallel domain into a structure to facilitate the transfer in the form of parameters.
A parallel domain corresponds to a task function, which is called by every core that enters the
parallel domain. In the task function, the data environment declared in source 1 needs to be
reconstructed before the parallel statement in the task function. It is realized by redefining and
initializing the shared variables with the same name and private ones. As can be seen from the above
example, the structure used to record the shared variable information before calling the
execute_parallel function is exactly the same as that used in the task function to build the data
environment. In fact, the structure obtained by get_shared_vars function is the parameter variable
passed in execute_parallel before.
The reason why we need to create shared variables in parallel domain and transfer the information
of shared variables between task functions is that the execution context of each core is stored in its
own local memory. If a variable is declared to be a private variable in the parallel domain, no
additional processing is needed. However, for shared variables, in order to make all cores in the
parallel domain access symmetrically, the variable should be declared in the shared area and its
address should be recorded by the master core and sent to the slave cores. It can be accessed indirectly
through the address in the parallel statements of task functions.
4. Runtime implementation
This chapter introduces the runtime support for the above parallel scheme, including the key data
structure, the implementation of the parallel domain creation function generated in source 2 after the
parallel instruction transformation and the execution flow of the slave cores in the parallel domain.
6
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137
Therefore, the slave cores receiving the threads will access the same task function and the threads are
distinguished by numbers.
The life cycle of a thread only exists in the parallel domain. In order to avoid the computational
overhead caused by frequently creating and destroying threads, a group of threads are created in the
form of a linked list in runtime to form a thread pool. Each time a parallel domain is created, the
master core applies for a corresponding number of threads from the thread pool and sends them to the
slave cores. After the slave cores complete the task, the threads received from the master core are put
back into the thread pool for use in the next parallel domain.
The task completion status of the slave cores in the parallel domain is monitored by the master core
through the executive body and thread structure. As shown in Fig.5, a key variable is recorded in the
parent executive body - "the number of slave cores that have not completed tasks in the parallel
domain". When the master core creates a parallel domain, the initial value of this variable will be set to
the number of threads in the parallel domain. When the master core sends the thread to the slave core
and the slave core completes the task, it will access the variable through the parent executive body and
subtract 1 from its current value.
Therefore, the variable is shared by multiple cores and its value will change with the execution of
the slave cores. The master core can judge whether there are slave cores that have not completed the
task the current parallel domain by the value of the variable.
struct{
void* func(void*); The Number Of Slave Cores That
void* num; Have Not Completed Tasks In The
} Parent Executive Body A Parallel Domain
7
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137
Thread 1 Core 1
Core 0 Thread 2
(Master Core) Core 2
Thread 3
Core 3
4) Execute the task function to complete the tasks that belong to in the parallel domain.
5) Synchronous exit, wait for all the slave cores in the parallel domain to complete the tasks and
then exit the parallel domain. The specific implementation is that after the main core enters the loop, it
does not jump out of the loop until the value of the variable "the number of slave cores in parallel
domain that has not completed tasks" in the executive body is 0.
5. Conclusion
This paper designs and implements a parallel compiler for OpenMP programs running on multi-core
DSP platform. By translating the compilation guidance instructions in the source program into the
runtime interface, the local compiler and runtime are used to compile and link the translated C
program. Running on multi-core DSP, the generated executable file can achieve the effect of master-
slave cores executing code in parallel domain in parallel. While making full use of the computing
resources of multi-core DSP, it improves the programming efficiency of programmers in developing
parallel programs. In the following work, the following work can focus on the performance
optimization of compiler, such as quantitative analysis of the computing tasks assigned by each core in
the parallel domain. A more reasonable task allocation strategy is added to achieve load balance of
computing cores.
References
[1] Q.M.Luo, Z.Ming, G.Liu. The Principle and Implementation of OpenMP [M]. Beijing: tsinghua
university press, 2012.
[2] X.J.Gong, L.M.Zou, Y.X.Hu. Research on Parallel Program Design Method of Multi-core
System Based on OpenMP[J].Journal of Nanhua University (Natural Science Edition), 2013,
27 (1): pp.64-68.
8
CDMMS 2020 IOP Publishing
Journal of Physics: Conference Series 1802 (2021) 032137 doi:10.1088/1742-6596/1802/3/032137