Main
Main
Main
Bachelor Thesis
(i) The thesis comprises only my original work toward the Bachelor’s Degree
(ii) Due acknowledgment has been made in the text to all other material used
2 GPU Architecture 17
2.1 GPU Extension Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Extending RISC-V ISA . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 GPU Hardware Extension . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 RTL implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Decode Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Issue Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Dynamic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3 Scoreboard Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Execute Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Arithmetic Logic Unit (ALU) . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Floating Point Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.3 Load Store Unit (LSU) . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.4 Control Status Register (CSR) Unit . . . . . . . . . . . . . . . . . . 64
2.4.5 Convolution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.5 Commit Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.1 Writeback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.2 Popcount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.6 Cache Sub-System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.6.2 Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3 AXI-Controller 111
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2 AXI Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.2.1 Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2.2 Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3
4 Software Testing Tools Flow 115
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2 Ramulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3 Accurate-cycle simulation using Verilator . . . . . . . . . . . . . . . . . . . 116
4.4 OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.5 Models of the OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.1 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6 The Flow of OpenCL Host Application . . . . . . . . . . . . . . . . . . . . 119
4.7 POCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.8 GPU RTL Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.9 Testing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
1.1 Increase of the static power consumption with decreasing the channel length [1]. 5
1.2 The frequency and power trends over the years [2]. . . . . . . . . . . . . . . . 6
1.3 SIMD architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 The different instructions format of RISCV [3] . . . . . . . . . . . . . . . . . . 10
1.5 FPGA components [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 FPGA CLB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 The process of compiling OpenCL kernels, then dump the resulted binaries to
the target FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Abstracting and managing FPGAs [21]. . . . . . . . . . . . . . . . . . . . . . 15
5
6 LIST OF FIGURES
2.74 Fully associative cache: Any block can map to any location in the cache. . . . . 81
2.75 Handling write requests in Write-through cache. . . . . . . . . . . . . . . . . 83
2.76 Handling write requests in Write-back cache. . . . . . . . . . . . . . . . . . . 84
2.77 The Cache Coherence Problem. . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.78 The Cache Coherence Problem. . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.79 The FSM of snoop protocol from the CPU side. . . . . . . . . . . . . . . . . . 87
2.80 The FSM of snoop protocol from the Bus side. . . . . . . . . . . . . . . . . . 87
2.81 Shared memory among different cores. . . . . . . . . . . . . . . . . . . . . . . 89
2.82 The Cache Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.83 Anatomy of the memory address. . . . . . . . . . . . . . . . . . . . . . . . . 91
2.84 The problem of bank conflict. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.85 Pseudo code of the virtual port assignment. . . . . . . . . . . . . . . . . . . . 92
2.86 RTL simulation of bank selector core, when the number of requests is equal to
the available banks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.87 RTL simulation of bank selector core, when the number of requests is higher
than the available banks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.88 RTL simulation of bank selector core, when two requests are sent to the cache
without enabling multi-porting. . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.89 RTL simulation of bank selector core, when two requests are sent to the cache
with the enabling of multi-porting. . . . . . . . . . . . . . . . . . . . . . . . . 94
2.90 Dual-port RAM structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.91 Single-port RAM structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.92 RTL simulation of dual-port RAM when initialization is enabled. . . . . . . . . 98
2.93 RTL simulation of dual-port RAM when BYTEEN parameter is equal to 4. . . 98
2.94 Hit under miss scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.95 Miss under miss scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.96 Integration of the miss reservations with the cache memory. . . . . . . . . . . . 100
2.97 The Lookup operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.98 Comparator Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.99 Access of Four-way interleaved cache banks using block addressing. . . . . . . . 103
2.100The stages of access the cache bank. . . . . . . . . . . . . . . . . . . . . . . . 103
2.101High level overview of the cache bank structure. . . . . . . . . . . . . . . . . . 104
2.102RTL simulation of cache bank 1. . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.103RTL simulation of cache bank 2. . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.104Find First Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2.105Pseudo-code of response merging algorithm. . . . . . . . . . . . . . . . . . . . 107
2.106Cache Arbiter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.107Logical structure of round-robin arbiter. . . . . . . . . . . . . . . . . . . . . . 109
2.108RTL simulation of Round-Robin scheduler. . . . . . . . . . . . . . . . . . . . 109
4.6 The process of compiling OpenCL applications to binaries that target RISC-V
system using POCL compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.7 The overview of the configuration used to perform cycle-accurate simulation. . . 123
4.8 OpenCL kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.9 Code for initializing the OpenCL applications. . . . . . . . . . . . . . . . . . . 125
5.28 IPC results of executing K-NN algorithm to find the 5 nearest neighbors for
different cores counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.29 Execution time in milliseconds of K-NN algorithm to find the 5 nearest neighbors
for different cores counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.30 The execution of Gaussian elimination kernel on the GPU . . . . . . . . . . . . 144
5.31 IPC results of Gaussian elimination algorithm for different GPU configurations
with a constant workload of 4x4 matrix. . . . . . . . . . . . . . . . . . . . . . 144
5.32 Comparison between the execution time of K-NN algorithm on GPGPU with 8
cores and Intel Xeon E5-1640. . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.33 Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and Intel Xeon E5-1640. . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.34 GPGPU speedup over Intel Xeon E5-1650. . . . . . . . . . . . . . . . . . . . . 146
5.35 Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and TPU v2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.36 GPGPU speedup over TPU v2. . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.37 Comparison between the execution time of convolution on GPGPU with 8 cores,
and GPU Tesla-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.1 The host flow of OpenCL application on the design environment. . . . . . . . . 150
11
Acknowledgements
Thirdly, I would like to acknowledge all my lecturers, teaching assistants, and col-
leagues who have contributed to my knowledge in the field of engineering throughout the
years. I am indebted to the management of GIU for creating an environment that fosters
learning, inspiration, and personal growth, thereby enriching my academic experience.
To those who are reading this, I express my gratitude for your time. I hope that you
find the information you seek and gain valuable insights. Your interest and attention to
my research topic are greatly appreciated, and I welcome any comments or questions you
may have.
1
2 LIST OF TABLES
Abstract
As we enter the post-Moore’s Law era, the utilization of graphics processing units
(GPUs) extends beyond their traditional role in graphics-intensive applications like games.
Nowadays, GPUs are widely employed in sophisticated domains such as big data servers,
machine learning, and medical applications. The rationale behind using GPUs in these
contexts is twofold. Firstly, as we approach the limits of Moore’s Law, where the fre-
quency cannot be easily doubled every 18 months, we encounter performance limitations
over time. Secondly, in the post-Moore era, utilizing multi-processor systems presents
challenges due to the complexity of interconnects between processors and the significant
area they occupy, which can reach up to 40 percent of the total silicon die area. Conse-
quently, GPUs have emerged as a solution for data-intensive applications that demand
high throughput. Despite the significance and popularity of GPUs, open-source GPU
options are scarce in the market. The market is predominantly dominated by Nvidia,
which is not suitable for low-power embedded devices requiring computational power for
image processing tasks such as convolution, pattern recognition, and face detection.
This project aims to develop a general-purpose GPU based on the RISC-V architec-
ture. The selection of the RISC-V architecture is motivated by several factors. Firstly,
its simple design makes it an excellent fit for this project, as the GPU will be designed
for small devices where area and power considerations are critical. Secondly, the RISC-V
instruction set enjoys widespread popularity worldwide and is supported by numerous
open-source tools. Lastly, designing the GPU based on RISC-V ensures independence
from vendor licensing, providing greater flexibility and adaptability.
3
4 LIST OF TABLES
Chapter 1
Figure 1.1: Increase of the static power consumption with decreasing the channel length [1].
As the scaling down of the channel length of the MOSFET transistor continued, the
submicron era was reached. In this region, some assumptions about the properties of
MOSFETs, made them dominate other types of transistors in digital circuits such as
BJTs are no longer valid. One of the most important assumptions that make the MOS-
FETs suitable for low-power applications, is that the MOSFETs have zero gate current.
Engineers and physicists made their assumptions depending on the thick silicon oxide at
the MOSFET gate which acts as an insulator. However, due to the aggressive scaling
of the MOSFET, the thickness of the silicon oxide layers has been reduced dramatically
5
6 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW
until it has reached the diameter of a few silicon atoms. Therefore the assumption, that
MOSFET’s gate acts as an insulator is no longer valid. The thin gate oxide leads to the
phenomenon of Quantum Tunneling. Quantum Tunneling is a phenomenon that occurs at
the nanoscale due to reasons related to the quantum mechanical properties of particles. It
has no observable impact on long-channel transistors, however, this effect is very serious
in today’s short-channel transistors. Thin gate oxide increases the probability, that the
electrons can tunnel through the gate oxide. The tunneling effect decreases the degree
by which the gate terminal can control the channel formation between source and drain.
This introduces a challenge in turning off MOSFET transistors, as the leakage current
prevents complete cutoff. Consequently, the difficulty of achieving a proper and reliable
off-state in MOSFETs is amplified. The static power of the CMOS logic was assumed for
many years to be negligible compared to dynamic power consumption. This is not the
case in sub-micron CMOS technology. In fact, many researchers have demonstrated, that
due to the presence of leakage current below the threshold voltage due to the shrinking
of the transistor’s dimensions, static power dissipation increases to the point where it can
surpass dynamic power consumption. This trend is illustrated in Figure 1.1.
There is always a direct correlation between frequency and power. In order to increase
the frequency of the designed digital circuit, more power is needed to be consumed. In sub-
micron technologies increasing frequency will not lead only to an increase in the dynamic
power consumption it will increase also the consumed static power due to leakage current.
This leads us to reach what is known as the frequency wall, where the performance cannot
be increased by simply increasing the frequency because the power is proportional to the
switching frequency according to Eq.(1)
1
P = ·C ·V2·F
2
Figure 1.2: The frequency and power trends over the years [2].
From Figure 1.2 it has been shown that saturation in the frequency of processors is
reached over the years due to the power wall, moreover increasing the frequency will make
it very difficult to remove the heat generated in the processors due to power consumption.
1.1. GENERAL INTRODUCTION AND OVERVIEW OF THE TOPIC 7
According to Eq.(1) the frequency is proportional to the square of the supplied voltage,
which will make it very hard for the mobile devices that depend on batteries for the
power supply to reach high frequencies. For the previous reasons the trend of increasing
frequency to obtain higher performance is not valid anymore, especially for mobile devices.
Therefore, processor designers are focusing on other techniques to increase performance.
One of these techniques is to divide the work between several processors, which will lead
to an increase in total throughput without increasing the frequency. The trend of using
multi-processor chips has its drawbacks. One of the main drawbacks is the complexity of
scaling the number of processors in one chip due to the limitations of power consumption
and the increasing complexity of designing interconnects between different processors in
one chip and the huge area required by these interconnects. The second major draw-
back addressed by [2] is the high amount of latency needed to perform communication
between different processors, which can reach up to 35-50 clock cycles between the pro-
cessors within the same chip, or up to 100 cycles between off-chip processors. The most
difficult challenge in designing multi-core systems is the design of memory hierarchy. The
complexity in designing memory systems of multi-core chips is the difficulty in maintain-
ing memory consistency when a large number of processors can access the same memory
resources. This allows the different processors to communicate with each other and ex-
change information by writing and reading to the same memory locations. However, there
is a critical problem: each processor has its own local cache memory, which it writes to
first. The value in the cache is not written back to the shared memory until a miss occurs
or the block is evicted due to full space. In the meantime, the processor can update the
variable in its local space, leaving an older version of the value of the variable in the shared
memory. This can cause errors in calculation in case of some processors depend on this
data. A possible solution to this problem is to use cache coherence protocol. The Use of
cache coherence protocol will ensure that all the processors will have the same copy of the
data in their local memories. These processes are done through communication between
different processors with each other and invalidating or updating cached copies of data
when necessary. However, implementing cache coherence protocol can be expensive in
both hardware and development costs especially, when the count of processors in one chip
is high.
Today there are a lot of applications such as graph analytics, machine learning accel-
eration, and game applications that are performance- and memory-intensive. Therefore
there is a great demand for high-performance architectures that can execute these kinds
of applications with the lowest possible power consumption. Therefore, engineers and de-
signers admire architectures that can produce maximum performance without consuming
a great amount of power. The traditional pipelined in-order processors cannot afford high
performance without hitting the frequency wall. On the other hand the multi-processor
systems despite their ability to produce high performance without too much frequency,
they have a lot of issues regarding the complexity of the design and the high power con-
sumption that are associated with the scaling of the number of processors in one chip.
SIMD (Single Instruction Multiple Data) architectures shown in Figure 1.3 can provide
much more performance gain than normal processors without the high complexity of
multi-core systems. In contrast to normal processors that contain only one execution unit
and one load-store unit which causes a lot of performance loss due to stalls caused by the
long latency of memory access, SIMD architectures have multiple lanes, each lane has its
own execute unit, and high pipelined load store unit (LSU) to hide the latency of memory
8 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW
access, In addition to Multiable (GPRs) General Purpose Registers files to handle each
lane individually. SIMD architecture is a good candidate for executing multimedia or
machine learning applications, where the same operations are performed on a large patch
of independent data. To hide the latency resulting from cache misses SIMD processors use
multiple lanes to keep the pipeline busy, this is not the case in normal processors, which
have to waste hundreds or even thousands of cycles waiting for the data to arrive from the
cache. In order To amortize the cost of accessing memory, SIMD processors usually use
heavily pipelined load-store units. This means that the program only pays the latency
of vector load or store operations once as the remaining latency will be hidden in the
pipelined access of the first request. In normal processors, a lot of power is consumed
due to the overhead of accessing instruction cache memory this consumed power can be
reduced in SIMD processors by having one fetch unit for all lanes within the processor.
So if the processor has four lanes each lane will execute the same instruction but on dif-
ferent data, only one instruction will be fetched from the instruction cache, instead of
four cache access in traditional processors, which will reduce the overhead from fetching
and decoding processes.
Graphical Processing Units (GPUs) mainly consist of multiple SIMD cores, with inte-
gration of a warp schedular that assigns each warp to a certain SIMD core. GPUs depend
primarily on multi-threading to hide high memory latency. GPUs have been used for
a long time for executing graphics applications, mainly those requiring real-time perfor-
mance and minimal latency. However in recent years with the need for high-performance
architectures, there has been a shift toward using GPUs as a general-purpose platform to
execute real-time and high-performance applications, as they offer high throughput with
lower energy consumption compared to other architectures. The advancement of portable
languages that have a C-based syntax and can run on heterogeneous systems made it easier
1.2. BACKGROUND 9
to integrate GPUs in many systems with host processors without worrying about compat-
ibility between them, in addition, languages such as CUDA and OpenCL make it easier
for the average programmer to write codes that target GPU platform without requiring
in-depth knowledge of the underlying hardware workings. The use of these languages also
facilitates what is known as heterogeneous computing systems. These systems comprise
a combination of different architectures, such as a conventional processor for traditional
applications and a GPU to accelerate multimedia applications. The development of such
languages enables programmers to use a single language to write programs for these sys-
tems, increasing efficiency and reducing time to market. Recently, GPUs have not only
been used in personal computers or data centers but they are also employed in embedded
systems applications. The current advancements in the embedded systems field, where
microcontrollers are utilized to perform high-performance applications including complex
image-processing applications such as face identification and pattern recognition. Many
of these devices operate on low-power supplies, in contrast to personal computers or data
centers that use high-power supplies, which makes it a more challenging task to design a
GPU that can be integrated with embedded devices, especially in the IOT industry, where
are applications that rely on batteryless devices. The aim of this project is to develop
a low-power general-purpose graphical processing unit suitable for embedded systems
applications based on RISCV architecture.
1.2 Background
Despite the fact that GPUs are widely utilized as accelerators in different applications
in personal computers and data centers, there is a very limited number of open-source
GPU architectures in the market, which affects their usage in different potential applica-
tions. The fact that the most market share of the GPU market belongs to Nvidia, makes
it very hard to integrate other GPUs in existing projects because most of the tools and
languages like CUDA are designed specifically for Nvidia GPUs, on the other side, Nvidia
is known for manufacturing large GPUs that consume relatively high power and occupy
a significant amount of die area. These GPUs are not suitable for low-power embed-
ded applications that need an accelerator to execute computationally intensive tasks like
matrix multiplication, Gaussian elimination, and image processing. Another important
issue is the potential licensing problems due to political conflicts between countries that
could impact platform production. Due to the previous reasons, there is a demand for
open-source GPUS that is not vendor-dependent and can be easily integrated with other
systems to be used as accelerators. However, designing an open-source GPU with the
previous specifications is not a trivial task for several reasons:
The huge time and very high initial cost of the ASIC (Application Specific Inte-
grated Circuit) design flow, which is not practical for an open-source design, many
designers for this reason tend to simulate their architecture using intermediate lan-
guage simulations. However, these simulations are easy and fast, but they do not
model some important aspects of the underlying hardware, such as clock crossing
domain problems and cache system latency.
To solve the critical problem of the tool support the GPU implemented in the current
project is built mainly on top of the RISC-V (Reduced Instruction Set Computer) pro-
cessor [3]. The RISC-V processor is an open-source processor that was developed at the
University of California, Berkeley around 1980. RISC-V is widely known for its simple
but efficient architecture, which has significantly lower power consumption than other
architectures such as x86. The most important reason for utilizing RISC-V in this project
is the availability of numerous tools in the market that support the RISC-V architecture,
such as the riscv-gnu-toolchain, in addition to that, the ISA of the RISC-V processor
supports a wide range of instructions such as I-type, R-type, J-type, and S-type, besides
the availability of many opcodes that can be used to create user’s defined instructions
which can be used in the process of extending RISC-V to general purpose GPU.
FPGAs (Field Programmable Gate Arrays) are silicon chips that contain an array of
configurable logic blocks (CLBs) as shown in Figure 1.5. They are popular among digital
designers because their internal CLB blocks can be arranged to construct different types
of circuits.
CLB is the basic block that builds the FPGA, using it the user can create nearly any
logical function on the hardware. CLB also implements memory functions and synchro-
nizes code on the FPGA when connected by routing resources. The CLB consists of the
following components Lookup tables(LUT), Flip-Flops(FF), and multiplexers (MUXs) as
shown in Figure 1.6.
1.3 Aim
The purpose of this project is to design a full-stack general-purpose graphics processing
unit (GPGPU) that can handle both hardware and software tasks. The microarchitecture
of the GPU is based on the RISC-V processor but with some changes to its instruction
set architecture (ISA) to support multi-threading. The GPU also has a multi-banked
non-blocking cache that provides the high bandwidth needed to run multiple threads on
the GPU. The software platform for this project is OpenCL, which is one of the most
popular programming languages for developing applications on different GPU platforms.
To provide a runtime environment for OpenCL applications, a compiler called Portable
Computing Language (POCL) is used to convert OpenCL kernels to executable binaries
that can run on the RISC-V processor [5]. Then, this binary is transferred to the FPGA
via shared memory that can be accessed by both the host processor and the FPGA as
shown in Figure 1.7.
Figure 1.7: The process of compiling OpenCL kernels, then dump the resulted binaries to the
target FPGA
These two modifications increase resource usage without affecting the critical timing path
too much [7].
In order to train the machine learning models properly and take advantage of the dense
computing capabilities of the FPGA, the workload will be divided among two parts:
In order to provide a smooth transition between the two sides a model converter is de-
ployed for this task. The Tensorflow model is used to train the models on the GPU and
make the best use of GPU resources, and then the CUDA framework is used to convert
TensorFlow code to executables that can run on the FPGA side. In other words, the
model converter acts as a mapping function to facilitate the translation of the training
framework into the inference framework[8].
One of the earliest synthesizable open-source GPU designs in the market is the Nyami
GPU. The Nyami GPU is compatible with different rendering applications, through a
graphics pipeline that contains programmable vertex shaders. The Nyami GPU is also
equipped with a flexible cache hierarchy that allows the adjustment of different cache
properties such as cache associativity and cache levels. Furthermore, Nyami GPU inte-
grates an on-chip debugger to enable real-time performance monitoring[10].
CUDA is the most famous framework for programming GPU platforms, it is used by
the most famous deep learning frameworks like Tensorflow and PyTorch as a backend
framework to train neural network models. Despite these facts, CUDA is compatible
only with Nvidia GPU architectures. This problem makes it very hard to program het-
erogeneous systems that include GPUs from other vendors. Engineers and researchers
overcome this issue by using source-to-source translators to convert CUDA code into a
portable language representation that can be executed on different architectures. A major
limitation of this approach is that it needs a lot of manual modifications, causing appli-
cations targeting non-Nvidia GPUs to have a long time to market. A framework called
14 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW
CuPBoP enables the execution of CUDA kernels on non-NVIDIA devices without the
need for manual code modifications, with a coverage rate of 69.6% on the Rodinia bench-
mark, which is 56.6% performance gain over the average frameworks in the market[11].
One of the frameworks that are designed to enable the execution of CUDA kernels on
CPUs is the COX (CUDA on X86) framework. The mapping of the CUDA kernels occurs
by utilizing compiler IR transformation and a runtime system that supports warp-level
programming functionalities like warp shuffle, warp vote, and cooperative group. COX
framework uses a technique called hierarchical collapsing, that enables the execution of
the warp level programming model. The main working principle of hierarchical collapsing
is to generate two types of loops: inter-warp loops and intra-warp loops. This approach
makes it easier to execute the warp-level programming model. COX framework uses bar-
riers to identify parallel regions to provide synchronization and prevent data hazards[12].
One of the scalable and energy-efficient vector processors that are built over RISC-
V architecture is the Ara processor. This processor is designed to work with Ariane, a
scalar core. Ara utilizes RISC-V to support mixed-precision floating-point operations and
extends it further to support the SIMT (Single Instruction Multiple Thread) execution
paradigm. The microarchitecture of Ara consists of a main sequencer, a slide unit, a
vector load/store unit, and a variable number of identical lanes. Each lane is provided
with a local register file, operand queues, and execution units[13][14][15].
The integration of the FPGAs in the data center with the GPUs is a challenging task,
due to the need for manual assignment of tasks between FPGAs and CPUs. In addition,
the synchronization between multiple tasks executed on both sides needs a higher-level
supervisor system. CFUSE (Cluster Front-end USEr framework) is one of the frameworks
that try to solve this problem by employing different techniques to abstract the resources
within a heterogeneous system:
1. Using High-Level Synthesis (HLS) tools to translate high-level programming lan-
guages, such as OpenCL, into hardware description languages. This approach makes
the programming of FPGAs more easier for the average programmer.
2. Using Dynamic task allocation and migration strategies to divide the workload
between FPGAs and host CPUs.
OpenCL program can be compiled on the common host processor, then transferring the
resulting binaries to the target architectures using Direct Memory Access (DMA) engines,
which will ensure that the data is transferred in a non-blocking manner[17][18][19].
The process of developing applications that use FPGA as a source target is not trivial,
this process requires an in-depth knowledge of the underlying hardware of the FPGA. To
facilitate the developing processes of these applications, an abstraction level of FPGA
resources and operating system (OS) details is needed. One of the open-source libraries
that provides this abstraction is The OPAE (Open Programmable Acceleration Engine).
OPAE is a lightweight open-source C-library that offers a flexible computing environment.
The intel FPGA software stack is used to build the OPAE library. OPAE abstracts the
different details of FPGA hardware and operating system specifications, so the program-
mer can develop different software applications without concerning himself with unwanted
details. The library enables software programs to access FPGA device features, such as
pre-configured acceleration logic and functions for managing and reconfiguring the device.
By using OPAE, user applications can smoothly use FPGA-based acceleration transpar-
ently, as shown in Figure 1.8. Intel also provides the users with a simulation environment
called an Accelerator Functional Unit (AFU), which enables the user to access Intel’s
hardware resources, giving them the ability to perform cycle-level simulations [20].
GPU Architecture
Figure 2.1 shows the an abstract structure of the architecture of the GPGPU, it
consists mainly of multiple cores with shared L2-cache, each core consists of five stage
pipeline with the following stages fetch, decode, issue, execute, and write-back stage.
The GPGPU also comes with an extended Convolution Unit, designed to accelarte
the convolution operation. The upcoming sections will describe in the details of each
subsystem in the core.
17
18 CHAPTER 2. GPU ARCHITECTURE
Instruction Operand
FMAdd a,b,c
IMAdd a,b,c, shift
MatMul a,b,c
Interp dx,dy,a,b,c
Tex u,v,lod
Table 2.1: Some GPU instructions that require more than 3 operands.
Input merging: The idea of the input merging approach as shown in Figure 2.4 is
to combine two or more inputs in one source register, which will reduce the overall
number of registers needed to encode all the inputs. This approach can be done by
the use of a helper function that acts as a wrapper around the original instruction
and merges the input arguments of it. This approach is useful for instructions,
whose operands are guaranteed to be in a certain small range, therefore there is no
loss in data when the arguments are merged together.
Function bits overloading: In some cases the input arguments are too large to be
merged, Another alternative approach for input merging is to overload the function
bits to act as a storage for additional source registers beside their original function
as shown in Figure 2.5, by using this approach, there is a possibility to encode
two additional registers via R-type format and one additional register via R4-type
format. However, there are two important drawbacks to extending the number of
operands. Firstly, it reduces the total number of instructions that can be allocated
for the given custom operand. This limitation arises from the need to allocate bits
for storing the additional registers, which reduces the available space for encod-
ing instructions. Secondly, increasing the number of source operands introduces
complexity to the backend of the pipeline. This is because reading an additional
register requires modifications to the existing register file and adds complexity to
the backend processing stages.
Control Status Registers: The last option for extending the number of source
operands is to split the execution of the instruction into two parts and pass the
additional operands via CSRs (Control and Status Registers) before invoking the
instruction. However, this solution is considered the least efficient for two main
reasons. Firstly, executing CSR instructions results in pipeline stalls, which can
significantly impact performance. The pipeline needs to pause and wait for the CSR
instructions to complete before proceeding with the next instructions. Secondly,
this approach adds an additional instruction to the pipeline for each invocation of
the extension. This increases the instruction count and can negatively affect the
instructions per cycle (IPC) metric, which measures the efficiency of instruction
execution.
Figure 2.5: Overloading of function bits to extend the number of source operands.
A set of warp masks that contain the status of each warp from which the warp
scheduler can determine the next ready warp
A wavefront table that contains private data about each warp such as thread mask
or the next PC to be fetched from the Instruction Cache.
Thread masks are vector bits that contain important information about the state
of each warp, upon this information warp scheduler can decide the ready warp to be
scheduled next as shown in Figure 2.8. The 3 main masks are as follows:
Active mask: It contains status bits, that indicate whether a certain warp is active
or not.
Stalled mask: It contains bits, that show the warps that cannot be scheduled right
now because they are waiting for certain events such as memory requests,
Barrier mask: It contains bits, that indicate whether a certain warp is waiting for
a barrier to be released or not.
Figure 2.8: The process of selecting the next warp depending on different masks.
In cases where all the threads agreed on one path the split instruction does not
affect the state of the current warp, in other words, it acts as nop instruction.
When both taken and not taken masks are not equal to zero the following sequence
occurs to handle the branch divergence:
– Pushing the current thread mask alongside the original PC to the IPDOM
stack.
– Pushing the alternate thread mask if the thread takes the other path alongside
the split PC to the IPDOM stack.
When a Join instruction is executed the entries are popped from the IPDOM stack
with two scenarios :
– The predicate value for this entry is not taken so the thread mask is the same
as the original one and the thread continues executing from the original PC.
2.1. GPU EXTENSION UNIT 23
– The predicate value for this entry is taken so the thread mask is equal to the
alternate thread mask and the the thread continues executing from the split
PC.
In GPU programs sometimes data dependencies between different warps may exist,
whether they operate on shared data resources or their results depend on each other.
GPUs always use memory hierarchy with multiple cache levels, therefore synchro-
nization is needed to ensure memory consistency, which means that the results of
operations from different warps are committed in the correct order, preventing in-
consistent or unexpected results.
As discussed before threads in the same warp may diverge into different paths.
synchronization points are required, at which the diverged threads will be waiting
for each other to be collected and continue executing together.
Synchronization is useful for managing the shared resources and avoiding warp con-
flicts.
In this project the means of Synchronization are provided by using what is known
as Warp Barriers. A warp barrier is a synchronization point that is put in a certain
location in a program and when any warp passes through this point it gets halted until the
required number of warps reach this point to be synchronized. There is a data structure
with the warp scheduler, where each entry in this table contains private information about
each barrier such as :
Validity bit: This bit indicates whether a certain barrier is valid or not
Barrier ID: Each barrier has its unique ID to identify it and the MSB of the ID
indicates whether the scope of the barrier is local or global.
Number of left warps: The number of the remaining warps that need to go
through the barrier, so the barrier is released.
Stalled Warps Mask: This vector mask indicates which warps are currently wait-
ing for the current barrier to be released.
2.1. GPU EXTENSION UNIT 25
Wspawn Instruction: This instruction is used mainly for warp control, it activates
a certain number of warps at a given PC making multiple instances of a certain
program execute independently. This is useful in distributing the workload of a
certain program among multiple warps which increases the instructions per cycle
(IPC) count. It takes the form of:
Where Rs1 contains the number of warps to spawn. while Rs2 contains the Program
Counter to spawn the warps at.
tmc Rs1
where Rs1 contains the new thread mask that is to be applied. The GPU unit
simply sets the Thread Mask to be the data in RS1
Split and Join Instructions: These instructions are used to handle the control
divergence problem. SPLIT controls which threads to activate when a branch occurs
using a “taken” mask and a “not taken” mask. It also generates a “diverged” signal
that indicates whether the control flow diverged, and then it pushes the thread
mask and branch prediction results to a special hardware structure called hardware-
immediate post dominator (IPDOM) stack. It takes the form of
split Rs1
where Rs1 indicates whether the previous branch was taken or not. On the other
hand, Join instruction is used to pop the previous information from the IPDOM
stack during re-convergence.
where the synchronization between warps is required, the condition to release this
barrier is reaching a certain number of warps to this barrier location. It takes the
form of
bar Rs1 Rs2
where Rs1 contains the ID of the barrier, while Rs2 contains the number of warps
needed to reach the barrier, so it can be released.
Valid: this signal determines whether the instruction is valid to be executed or not.
Next Program Counter (PC): the next program counter to be fetched after
executing the current instruction.
EX Unit: the type of processing unit used to execute the current instruction as
shown in table 2.2.
Rs3: the third source register operand in case of decoding instructions that use R-4
instruction format such as floating-point multiply-and-add instruction.
Rd: the destination register in the register file is used to store and finalize the result
data.
Read after write (RAW): The read-after-write data hazard, also referred to as
a true data dependency, happens when an instruction tries to read a source register
while there is an ongoing write operation to the same destination register. This will
cause the instruction that tries to read this register to obtain the wrong data.
Write after read (WAR): The write-after-read data hazard, also referred to as
Anti dependency, happens when an instruction tries to read a register, and then
an instruction after it in the pipeline is trying to write to this register, this kind of
dependency can cause a problem in concurrent systems like GPUs, when the two
instructions are executed at the same time.
Write after Write (WAW): The write-after-read data hazard, also referred to as
Output dependency, happens when an instruction tries to write to a register, and
then an instruction after it in the pipeline is trying to write to this register.
30 CHAPTER 2. GPU ARCHITECTURE
Instruction Meta Data: This table stores metadata associated with each instruc-
tion such as :
– operation: The type of operation that this instruction will perform on source
operands (ADD, SUB, MUL,.....).
– Functional Unit: The type of unit that will execute the instruction (ALU,
FPU, LSU,......).
– Program Counter(PC): The PC of the issued instruction.
– Thread Mask: The status of the active threads.
unit is equipped with reservation buffers to store instructions awaiting their operands.
When an instruction conflicts with another in the pipeline, the scoreboard unit pushes
the instruction to the reservation station along with its metadata. To release instructions
from the reservation buffer, all of their operands must not be in use by other instruc-
tions. If this condition is met and there are sufficient resources, the ready signal’s tag is
broadcasted to extract the associated instruction’s metadata from the reservation buffer.
These operands are then forwarded to the dispatch unit to be directed to the appropriate
functional element within the execution stage of the pipeline.
which will result in total required bits of 96k, this huge amount of storage needed will
consume a lot of LUTs and FFs present in the FPGA, which cannot be afforded as the
LUTS and FFs are needed in implementing other processing element of the GPU. An
alternative approach to implementing a register file is to use on-chip block rams(BRAMs)
instead of using LUTs, a single BRAM36E1 block can hold up to 1024 registers. Figure
2.21 shows how the BRAMs can be utilized to build a register file, the register file is divided
into several banks each bank supports a certain warp and contains several BRAMS to
provide each thread with the required data values in parallel.
Request Interface: This module is mainly an elastic buffer that receives ALU
input signals such as clk, reset, PC, and data operands signals, then forwards these
signals to the control unit to take certain actions.
Control Unit: This module has the crucial role of interpreting the input signals
and determining the operations to be executed on the operand data. The flowchart
depicted in Figure 2.23 illustrates the process of decoding these signals. The control
unit plays a key role in making decisions based on the decoded signals. First, the
control unit determines whether the operation is a branch operation or an arithmetic
operation. If it is an arithmetic operation, the control unit further determines
whether it is a signed or unsigned operation. On the other hand, if the operation is
a branch, the control unit faces the task of deciding whether it is a static branch, such
as the JAL or JALR instructions, which are always taken, or if it is a conditional
branch that may or may not be taken. These decisions made by the control unit
guide the subsequent execution of the instructions and ensure the correct operation
of the module.
Arithmetic Unit: This unit performs the actual arithmetic and logical opera-
tions of the ALU unit based on the input signals, it contains also multiplier and
divider unit to be able to perform complex tasks such as image scaling or rendering
applications that require a lot of multiplication and division operations.
Branch Unit: This unit is responsible for resolving different static and dynamic
branches and determines whether these branches will be taken or not. It also con-
tains a dynamic branch prediction module to reduce the clock cycles, which the
processing unit needs to wait before the branch destination is decided.
Commit Unit: This module is responsible for writing back the output signals of
the ALU unit to the register file and forwarding the new PC to the Fetch engine in
case of valid branch instruction.
2.4. EXECUTE STAGE 35
algorithm involves continuous subtraction of the divisor from the dividend accompanied
by restoring the partial remainder if the result is negative. To keep track of the sign of
the partial remainder the algorithm uses an extra sign bit that is adjusted every clock
cycle. The restoring algorithm calculates the quotient bit by bit, which means that if a
division operation is needed to be performed on two 32-bit numbers, it will take 32 clock
cycles to be finished.
This module within the ALU unit has the task of handling and evaluating various
branches that are executed by the processing unit. One of the major challenges in execut-
ing branch instructions lies in dealing with conditional instructions that are not always
taken consistently. As a result, the pipeline faces a stall, where it is temporarily halted
until a decision is made regarding whether to fetch the next instruction at the current
PC plus 4 or to jump to a new destination generated by the branch instruction. This
decision-making process is crucial for maintaining the correct flow of instructions and
ensuring the smooth operation of the module. In a complex system like GPU, where
more than one instruction is fetched every clock cycle and the pipeline is getting deeper
over the years, the cost of halting the pipeline is very high, and the branch stalls can-
not be tailored. One way to solve this problem is to try to predict the destination of
the branches before it is finally evaluated. This scheme has two approaches whether to
predict the behavior of the branch statically or dynamically. The static branch predictor
38 CHAPTER 2. GPU ARCHITECTURE
depends on the history of the execution of the program from the compile time, this scheme
is very effective in dealing with the branches that have a statically bimodal distribution,
which can be found in scientific computing programs that have a lot of large loops, where
the loop branch will have the same behavior for 99% of the execution time. The major
limitation of static branch prediction is that its accuracy is dependent on the nature of
the program and the frequency of the branches for this reason modern processors always
deploy dynamic branch predictors. The most basic form of dynamic branch predictions is
using a branch history buffer. This buffer is a small directly mapped cache memory, that
is indexed using the lower portion of the address of the branch instruction, inside each
memory location there is a bit that indicates whether the branch was taken the previous
time it was executed or not. In some cases, a branch instruction. With this buffer, it is
not possible to determine with certainty if the prediction is correct since another branch
with the same lower-order address bits could have placed the prediction there. In this
case, if the prediction hint is not true the bit in this position will be inverted. The 1-bit
branch predictor has a main drawback, where it will always make a wrong prediction twice
even if the branch is always taken. To overcome this limitation, a 2-bit branch predictor
is used, in this scheme a prediction must be missed twice before the direction is changed.
Figure 2.27 shows the finite state machine (FSM) of the 2-bit branch predictor scheme.
Figure 2.28: The multiplication operation on the ALU with 4 active threads.
2.4. EXECUTE STAGE 39
Figure 2.29: The division operation on the ALU with 4 active threads.
Figure 2.30: The remainder operation on the ALU with 4 active threads.
Figure 2.31: The Branch Greater Than (BGE) instruction operation where all threads within
the warp is active.
40 CHAPTER 2. GPU ARCHITECTURE
operations.
The main advantage of floating-point representation is its wider dynamic range, which
is the range of numbers that can be represented. Floating-point numbers achieve this at
the expense of slightly less precision and slightly more storage space. This is because
floating-point numbers encode the position of the decimal point, which allows them to
represent a much wider range of values than fixed-point numbers. Fixed-point numbers
have limited or no flexibility, while floating-point numbers have greater flexibility. This
means that floating-point numbers can be used to represent a wider range of values and to
perform a wider range of operations. Floating-point hardware is also more accurate than
fixed-point hardware. This means that floating-point calculations are less likely to produce
rounding errors. Another factor that affects the choice between fixed and floating-point
formats is the precision – the size of the gaps between numbers. A Digital signal processor
(DSP) has to round every new number that it produces by a mathematical calculation
to the closest value that can be stored by the format that it uses. This rounding and/or
truncating of numbers causes quantization error or ‘noise’ - the difference between the
actual analog values and the quantized digital values. Since fixed-point processing has
larger gaps between adjacent numbers than floating-point processing, the round-off error
can be more significant. Therefore, floating-point processing offers much higher precision
than fixed-point processing, making floating-point processors the preferred DSP when
computational accuracy is essential.
The floating-point format has different applications depending on the data set require-
ments of video and audio applications. Floating Point units are useful for high-speed ob-
ject recognition systems, high-performance computer systems, embedded systems, and
mobile applications. For example, in medical image recognition, higher accuracy al-
lows the processing of various types of signal input from light, x-rays, ultrasound, and
other sources to produce output images with valuable diagnostic information. On the
other hand, floating-point devices are more suitable for the huge communications mar-
ket. FPUs can perform specialized trigonometric calculations that are widely used in
real-time applications such as motor control, power management, and communications
data management. The graphics processing units (GPUs) today mostly use IEEE 754-
compatible single-precision 32-bit floating-point operations for arithmetic operations in
the programmable processor cores, but some newer GPUs like the Tesla T10P also support
IEEE 754 64-bit double-precision operations in hardware.
The current floating-point core in this project supports IEEE 754 32-bit single precision
format. It also supports four rounding modes:
These rounding modes are based on the RISC-V floating-point extension.The floating-
point core also supports the following operations:
Multiplication
Division
42 CHAPTER 2. GPU ARCHITECTURE
Subtraction
Comparison
This section will explore the IEEE-754 floating point format in more detail, following
the comparison between floating point and fixed point representations in the previous
section. The IEEE 754 standard is a way of representing real numbers using a sequence
of digits. It does this by mapping the infinite range of real numbers to a finite subset with
limited precision. A floating point number can be characterized by the following[25]:
Sign: A single bit that determines the polarity of the number whether positive (+)
or negative (-)
Exponent: The exponent range represents the range of possible powers of the radix,
spanning from the minimum to the maximum value
Radix: Refers to the base number used for scaling, typically either two (binary) or
ten (decimal)
A w-bit biased exponent (E), which is obtained by adding a bias to the actual
exponent (e).
To convert real numbers to binary interchange representation, the following steps have to
be done. For example, if 85.125 is supposed to be converted to binary interchange format
the following steps have to be done:
The separation of the whole part and the fractional part: In this case whole
= 85 , fraction=0.125
Combine the two parts of the number as final result: Final value = 1010101.001
1010101.001 → 1.010101001 × 26
The sizes of the exponent and significand bits enable RISC-V to represent a wide range
of numbers, from very small to very large. This allows the GPU to perform real-world
computations, such as those used in scientific computing, engineering, and graphics.
44 CHAPTER 2. GPU ARCHITECTURE
2.4.2.2.3 Special Cases and Exceptions Although IEEE 754 gives us the ability
to represent a wide range of numbers, some results of floating-point operations may be
too large or too small to be represented in the exponent field. The RISC-V ISA uses a
floating-point control and status register (FCSR) to deal with this problem. The least
significant five bits of the FCSR are used as exception flags, which are triggered when one
of the cases in Table 2.5 occurs.
2.4. EXECUTE STAGE 45
Xoring the sign bits of the two numbers to determine the sign of the overall product.
Suppose there are two numbers A and, B in the normalized floating-point format
Assume also that the mantissa bits are only 4 bits for simplicity while the hidden bit is
still present. A = 0 10000100 0100 = 40 and B = 1 10000001 1110 = -7.5 The
multiplication algorithm will be as follows:
1. The significands of the two numbers M1 and M2 will be multiplied
4. The produced exponent is not the true one because E1 = e1 + bias and E2 =
e2 + bias, so the produced exponent will be equal Et = e1 + e2 + 2*bias so
in order to prevent the addition of bias twice, the bias should be subtracted from
the produced exponent
S = S1 xor S2 = 1
7. The mantissa is larger than 4 bits so rounding by truncation will made and the
hidden bit will be removed
1 10000111 0010
2.4. EXECUTE STAGE 47
2.4.2.5.1 Round Towards Zero One of the simplest modes of rounding that IEEE-
754 supports is rounding toward zero. This mode is easy to implement in hardware, but
it has some special cases that require careful handling. For example, IEEE-754 defines
a special case called NaN, which stands for Not a Number. A floating point number is
NaN if its exponent bits are all one and its mantissa is not zero as shown in Table 2.6. A
problem that occurs when rounding NaN values is that the mantissa of the NaN number
may have non-zero bits at the least significant position. Truncating these bits may change
the NaN number into a different representation, where the exponent bits are all one and
48 CHAPTER 2. GPU ARCHITECTURE
the mantissa is zero. This representation corresponds to infinity according to the IEEE-
754 standard. However, the IEEE-754 standard solves this problem by differentiating
between two types of Nan values signaling Nan and Quiet Nan as shown in Table 2.6.
Signaling Nan is a type of Nan where the most significant bit of the mantissa is equal
to zero, when Signaling Nan is detected an exception is raised to prevent future errors
that may occur due to truncation of the least significant bits. Round towards zero is very
useful when dealing with subnormal numbers.A subnormal number is a number that is
very small so the exponent bits are not enough to represent them, so if the last bits of
the mantissa of a subnormal number are truncated, its value may become zero. This is
acceptable because the part of the number that is not zero is too tiny to fit in finite bits
that are used to represent the floating-point numbers.
2.4.2.5.2 Round Towards Nearest Rounding towards zero is a simple and efficient
way to implement rounding in hardware. However, this mode can cause large errors in
computations, because it just discards the least significant bits. For example, if we round
the decimal number 7.9 to zero, we get 7 instead of 8, which is a big difference. Moreover,
the errors in floating-point computations can accumulate over time. Another mode is to
round the number to the closest value, either the next higher or lower value. The rule
for rounding to the closest value depends on the sign of the number. For positive num-
bers, rounding up means rounding towards positive infinity, while for negative numbers,
rounding up means rounding towards zero. The rules of rounding are shown in Table 2.7.
1. Determining the rounding bit: The rounding bit is the bit posistion before the
least significant bit ( LSB) bit, this bit will be retained after rounding. This position
depends on the precision of the floating-point format being used.
2. Obtaining the guard bit: The guard bit is the bit that is just located before the
round bit, it is set to 1 in case, there are any non-zero bits beyond the rounding
point, which will cause a loss in the data during rounding. Otherwise, it is set to 0.
3. Determining the sticky bit: The sticky bit indicates whether there are any non-
zero bits after the rounding point. It “sticks” to the result and affects the rounding
decision when the number is exactly in the middle of two representable values.
4. The rounding decision is made according to the values of round and sticky bits as
shown in 2.8.
2.4. EXECUTE STAGE 49
Both the input and output ports of the FPU have a handshake interface to synchronize
the flow of data into and out of the FPU. This handshake protocol is similar to the
protocols used in common protocols such as AXI:
The valid signal indicates that the data on the corresponding interface is valid and
stable. Once valid is asserted, it must not be deasserted until the handshake is
complete.
50 CHAPTER 2. GPU ARCHITECTURE
The ready signal indicates that the interface is capable of processing data on the
following rising clock edge. Once valid and ready are asserted during a rising clock
edge, the transaction is complete.
After a completed transaction, valid may remain asserted to provide new data for
the next transfer.
The protocol direction is top-down. This means that ready may depend on valid,
but valid must not depend on ready.
One of the most important metadata is the operation tag field. Operation tags are
additional information that is attached to an operation. They can be used to track where
the results of the operation came from. Operation tags move through the FPU without
being changed, and they are always associated with the operation that they were created
with.
2.4.2.6.1 Fused Multiply Add Unit The fused multiply-add (FMA) unit is one
of the building blocks of the FPU core. This design choice of using a single module to
perform addition and multiplication operations has several advantages:
FMA modules can perform multiplication and addition operations in a single in-
struction, which can significantly reduce rounding errors. This can result in im-
proved numerical accuracy, especially for iterative computations or when dealing
with large numbers.
Combining multiplication and addition into a single step can reduce the instruction
count, which can improve performance by reducing latency and cache miss rates.
The RTL of the FMA is shown in Figure 52. This design has a long critical path,
which can limit the maximum frequency at which the FMA can operate. The solution to
this problem is to pipeline the critical path, which will increase the maximum frequency
of the system from 66 MHz to 160 MHz at the cost of increasing the latency.
2.4.2.6.2 Verification The steps of testing the FPU core are as follows:
Figure 2.40: Python script to generate random numbers in IEEE-754 floating point format
Figure 2.42: The output produced from the FPU vs the actual output
Figure 2.45: The output produced from the FPU vs the actual output
2.4. EXECUTE STAGE 57
Figure 2.48: The output produced from the FPU vs the actual output
Figure 2.50: RTL simulation of converting IEEE-754 numbers to the nearest integer
Input Approximation
75.11204945036411 75
300.54744821509166 301
766.264255313085 766
878.5626430788918 879
-461.02451392541013 -461
-935.433141987377 -935
-187.70898012719522 -188
780.8973092710785 781
-209.06777986403586 -209
-483.19200494111806 -483
475.7076863444606 476
-157.2947256582371 -157
-815.9343442137628 -816
-274.09916577479737 -274
39.90017099455781 40
-720.6216768138403 -721
-869.4262660133307 -869
-108.0196437235594 -108
658.7183789078322 659
810.2743781094275 810
-420.84335455008386 -421
349.998328741402 350
-417.7595467500603 -418
-644.9503193988262 -645
-60.75731384969754 -61
103.42021755982046 103
-121.12395568020906 -121
-836.2065186216659 -836
-288.78110431056166 -289
998.6286837046068 999
826.4287099445658 826
483.594587151362 484
Table 2.10: Rounding of different floating point numbers to the nearest integer .
60 CHAPTER 2. GPU ARCHITECTURE
to high.
As soon as the req valid is triggered the LSU controller sends the required address
to be accessed to the main memory alongside the request data to be written in case
of store instruction.
The memory controller responds by pulling the rsp valid signal to high for one
clock cycle and sends the required data for the core.
Software Prefetch: The prefetch is typically done by using a compiler that stati-
cally analyzes the code before the execution and inserts additional prefetch instruc-
tions in certain program locations.
2.4. EXECUTE STAGE 63
The current LSU unit implemented in this project supports only software prefetching
so if a compiler or profiler can analyze the code and determine possible prefetch instruc-
tions location, they will make use of the GPU ISA that supports prefetch instructions. It
will be considered in the future to add a prefetch unit to the GPU cores.
64 CHAPTER 2. GPU ARCHITECTURE
Pooling layers
The convolution layer plays a vital role in extracting image features using small kernels
with adjustable parameters. The behavior of these kernels is governed by various factors
like padding, stride, and kernel size. To meet the computational demands, multiple ker-
nels are utilized within each convolution layer. Figure 2.55 illustrates the 2D convolution
algorithm, revealing that it involves four nested loops. This emphasizes that convolution
is a computationally intensive task, and the traditional approach may not efficiently han-
dle larger image sizes. The convolution layer is the main bottleneck in accelerating the
execution of CNN as it consumes 90% of the execution time.
2.4.5.1.2 Systolic Arrays One of the proposed architectures to accelerate the ex-
ecution of convolution operation is Systolic Array. The systolic array is a 2D grid
2.4. EXECUTE STAGE 67
The need to include special FIFO buffers to store the image and kernel parameters,
to save memory bandwidth. These buffers require a special circuit to control the
flow of data in/out of the buffers, which consume a lot of die area even bigger than
that needed by the computing structures.
Large kernels with a size, that exceeds the capacity and bandwidth of the systolic
array, increase the rate of memory access, which will cause a lot of communication
bottlenecks.
Another significant problem arises when the dimension of the weight matrix is not
evenly divisible by the MXU tile size. In such cases, the matrix needs to be divided
into multiple tiles. However, when we reach the last row and column of the tile,
not all the processing elements (PEs) will be utilized. This results in inefficient
utilization of the systolic array resources. Figure 2.57 addresses this problem.
architecture that reduces the number of load/store instructions with more flexibility than
the systolic array. The new convolution unit topology makes use of the newly added tmc
instruction to extend the RISC-V architecture to support the SIMT executing model and
the banked register file that can act as buffers for the kernel and image values.
Figure 2.60: Reducing the number of commits relative to the normal architectures.
72 CHAPTER 2. GPU ARCHITECTURE
2.4.5.3 Results
Figure 2.61 compares the hardware resources utilized by the convolution unit as the
number of threads within the core varies.
Figure 2.61: The resource utilization of convolution unit with different threads configuration.
2.5.1 Writeback
The writeback module receives different requests from various functional units, the
process of receiving requests like other transactions in the GPU, happens using a hand-
shake of valid and ready signals. After the writeback module receives different commit
requests alongside their data, it directs them to a stream arbiter, to determine which
source gets to commit its data. Although the execute unit gives access to one operation
at a time, an arbiter is needed because various operations take a different number of clock
cycles to be finished. So, at any time, the writeback could have data from multiple func-
tional units that need to be committed. After a certain functional unit is given a grant
to commit its data, a writeback request is sent to the issue stage, to update the register
file.
2.5.2 Popcount
Popcount, short for ”population count”, is a module used to calculate the number of
1’s in a binary data set or binary vector. In the commit stage, this module can be used
to determine the number of valid request signals from different units, the number of valid
requests is then sent to the CSR unit to be written in a certain register used to monitor
the instructions count.
Figure 2.66: Comparison of different storage devices in terms of cost per GB and performance
(IOPS) [30].
The processor fetches data from the memory hierarchy in a sequential manner, it first
starts by accessing the top element to determine whether the required data is present
or not, in case of finding the required data in the top-level cache a hit occurs and the
processor will not have to search the lower-level memories for the data, on the other hand,
if the data is not founded in the top level memory a miss occurs and the processor will
search for the required data in the next lower element, in the lower memory the accesses
may result in hit or miss again, if miss occurred then the processor will continue searching
for data in the lower components. As the access process goes down further in the memory
hierarchy, the likelihood, that a data hit event occurs increases until it reaches the bottom
component of the hierarchy, where it is guaranteed to always result in a hit (assuming the
desired data is present in the main memory). Here is an example to demonstrate the effect
of the memory size on the latency, assume that there is a two-level memory hierarchy,
that consists of a single cache and main memory, the total latency of the system can be
mathematically represented by the following expression
hit hit
LatencyT = Pcache · Latencycache + (1 − Pcache ) · Latencymain memory
where
hit
0 ≤ Pcache ≤1
Latencycache ≪ Latencymain memory
hit
Pcache represents the hit probability in the cache
From the above equation it is seen that the latency is dominated by the latency of
the main memory, so to decrease the overall latency the cache must have a high hit rate,
in the best-case scenario the hit rate is equal to 1, which will make the effective latency
equal to the latency of the cache memory. However, the git rate can be determined by
two factors:
The size of the cache.
Figure 2.69: The memory access pattern of the different array elements white means not yet
touched, light means older accesses, and dark means newer accesses. .
– SRAM can operate at very high speeds comparable to the processor frequency.
– SRAM cells are based on CMOS logic so they can be fabricated with the
processor on the same chip, which is not the case in DRAM cells, which cannot
be integrated with the processor on the same chip.
Cache Size: The capacity of the cache is an important parameter in the designing
process of the cache memory, large caches have a higher hit rate, however, this comes
with the drawback of increasing the total cost of the cache due to the increase in the
transistor count and increasing the latency as a result of extra overhead introduced
by the large transistor count.
Data Management: The data inside the cache is separated into individual entries
with the same size called cache blocks as shown in Figure 2.70. Each block contains
an entry to store a unique label that identifies each block, this label is called the
tag. During the process of accessing data, the processor searches the tag entries of
2.6. CACHE SUB-SYSTEM 79
each block to determine if there is a tag that matches the address, if a matching tag
is found a hit event occurs as described before. The tag bit also is used to describe
certain properties of the block :
In this case the addresses N and N+8 will always conflict in this type of cache. This
problem can be overcome by increasing the level of the associativity of the cache, instead
of having one set with 8 columns, we can have 2 sets with 4 columns,4 sets with 2 columns,
or even 8 sets with 1 column (full-associative cache) as shown in Figures 2.72 2.73 2.74.
The conflict misses rate is reduced when the associativity of the cache increases but this
comes at the cost of increasing the complexity of hardware, as more comparators, wider
muxes, and larger tag stores are needed.
Figure 2.72: 2-way set associative cache: Blocks with the same index can map to 2 locations.
2.6. CACHE SUB-SYSTEM 81
Figure 2.73: 4-way set associative cache: Blocks with the same index can map to 4 locations.
Figure 2.74: Fully associative cache: Any block can map to any location in the cache.
82 CHAPTER 2. GPU ARCHITECTURE
this approach introduces a high latency, as the processor will have to wait until the
request is written to the main memory, which has a much higher access penalty than
the cache memory. This approach violates the primary reason for having a cache in
the system, which is to reduce latency as much as possible.
In memory hierarchy the main memory always has a clone of data present in the cache,
in other words, it can be said that the relation between the cache and the main memory
is inclusive, however, in the case of the relation between multiple caches in the memory
hierarchy cannot always be said to be inclusive. The relation between the different caches
has three shapes:
2.6. CACHE SUB-SYSTEM 85
1. Inclusive
2. Exclusive
3. Non-inclusive
Inclusive and Exclusive relations are two extremes, where in inclusive relation, the data
in the highest level of the cache is guaranteed 100% to be found in the lower cache levels,
in contrast, in the exclusive relation the data in the highest cache level is guaranteed
100% to be not found in any lower level caches. The middle relation between the previous
two opposite extremes is the non-inclusive relation, where in this relation the data in the
highest cache level may be or not present in the lower-level caches. The main advan-
tage of the exclusive relation is the saving of a great amount of cache capacity because
the data in one level will not be stored in any other levels, on the other hand, inclusive
property simplifies the search process, especially in the multi-core systems, as if a certain
data wanted to be searched in local cache levels of certain processor, the largest cache in
the hierarchy only will be searched because it is guaranteed to store all the data present
in other levels. The non-inclusive relation is less conservative than the other two types
of relation, so when inserting a new chunk of data into a cache at a certain level, there
will not be strict requirements to evict this chunk of data from other levels (in case of
exclusive policy) or transferring this chunk to the remaining levels (in case of inclusive
policy).
The second issue in multi-cache design is the way the write request will be handled. If
a block is written to a cache at a certain level, the block can be modified and contain data
that is different from the other version in lower-level caches, this problem of inconsistent
data is the same problem discussed in 2.6.1.4, which can be handled by using any of
the two techniques write-back or write-through. This problem occurs in the inclusive and
non-inclusive caches, where multiple copies of the data are present in more than one cache
level.
The cache can be divided into types Instruction Cache and Data Cache, these
two types of caches are required by the processor to execute any type of program. The
division of the cache into two parts has two advantages:
Preventing one type of data from occupying the cache completely, as the processor
needs the two types of data to complete its execution, if one type of data is not in
the cache memory, the processor will have to fetch this data from the lower level
memories leading to extra amount of latency. By dividing the cache into instruction
and data caches, all information needed by the different parts of the processor will
be available at the highest cache level with minimal possible latency.
Not every part of the processor works on the same type of data. For example,
the fetch engine is responsible for retrieving instructions, while the execute engine
operates on non-instruction data. By separating the cache into two parts, each cache
can be allocated near the specific part of the processor that requires its services.
This reallocation reduces the complexity of interconnects.
86 CHAPTER 2. GPU ARCHITECTURE
The GPU that is designed in this project integrates different cores, the presence of
multiple cores will need a special cache system that can deal with the problem of incon-
sistent data seen by the different cores. To take a deep dive into the problem of data
inconsistency let’s consider Figure 2.77 which illustrates the memory hierarchy of multi-
core systems, the two processors fetched the value of the variable Var from the main
memory in their local caches. Suppose that the CPU1 did some operations on the fetched
variable from the main memory and modified its value, then the new variable is stored in
the local cache of the CPU1. The problem will arise when CPU2 tries to fetch the value
of Var again as shown in Figure 2.78, in this case, the CPU2 will see the old value of
this variable which will lead to the occurrence of errors when trying to use this variable
in further calculations. This problem addressed the need for a cache coherence that
ensures that each copy of data in the local cache of each processor is consistent.
the processor will not fetch outdated data, instead, it will retrieve the most recent
data from the main memory or other processor’s cache.
The approach of implementing cache coherence protocol using software simplifies the
hardware, but it makes the task of developing applications that target multi-core systems
harder and slower, as it introduces more burdens on the programmer. To make the cache
coherence protocol invisible to the programmer, the hardware is modified to support
the cache coherence protocol (CCP). CCP is implemented in hardware using snooping
protocol, where the cache controller of each processor will monitor or snoop the shared
bus, to observe write requests made by different processors, if the cache controller detects
a write request, it verifies whether it has a local copy of the requested block if it has one
in its local cache, the cache controller will invalidate this copy before the other processor
write the new data to its local memory, therefore snooping protocol is also called Write
Invalidate Protocol.
Figure 2.79: The FSM of snoop protocol from the CPU side.
Figure 2.80: The FSM of snoop protocol from the Bus side.
88 CHAPTER 2. GPU ARCHITECTURE
State Definition
Invalid This state indicates that a certain
block in the cache is invalid, so the
processor needs to fetch the block
again from the main memory or the
local cache of another processor
Shared This state indicates that the
cache contains the most recent
copy of the data, and this data is
shared among main memory and
different caches of the processors
within the system
Exclusive This state indicates that the
value of the data copy in the
cache is similar to that of main
memory
Table 2.12: Definition of different states present in the FSM of snooping protocol.
Preventing the waste of resources as when a processor is idle, the shared memory
can be used by other cores, thus solving the problem of resource under-utilization.
The data chunk will be stored once in the cache memory, in contrast to the dedicated
cache system, where the data needs to be written in every local cache of each
processor, thus solving the problem of data redundancy.
Reducing the communication rate between different cores, the bus will handle less
traffic, leading to a decrease in the complexity of interconnect design.
2.6. CACHE SUB-SYSTEM 89
Schedule
Tag Access
Data Access
Response Merge
After the bank finishes serving the different requests, a module called Core Response
Merger combines the different responses and sends them to their corresponding cores.
The architecture of the cache is shown in Figure 2.82. In the upcoming section, each
module of the cache will be discussed.
several threads try to access the cache memory, if these accesses from the thread are not
coordinated and synchronized, this will lead to an increase in the rate of miss conflicts
and delay when accessing the cache memory, thus much loss in the overall performance,
to avoid these problems a module called Core Request Bank Select is used to orches-
trate the cache memory access. The Core Request Bank Select module receives all the
incoming requests from the different threads and distributes these requests fairly among
the available banks, therefore it ensures that the workload is divided among different
banks equally without having certain bank overload, this prevents the problem of under-
utilization of cache memory. This module also provides an additional feature, which is
the ability to track the clock cycles that a certain processor has to wait before its re-
quest is served, this helps to identify the potential bottlenecks in the cache subsystem.
When the module receives a certain address, it performs certain checks related to the
request address, by extracting different fields of address as shown in Figure 2.83 it can
determine the bank that the request is sent to and the set that this address belongs to in
case of set-associate cache. A potential problem can occur if two or more threads share
the same access credentials such as bank number, request tag, and set number, this will
lead to a bank conflict as shown in Figure 2.84, where more than one thread is directed
to the same bank. The bank selector module solves this problem by the use of virtual
multi-porting, this technique is used to divide the bank further to multiple ports that can
2.6. CACHE SUB-SYSTEM 91
be accessed simultaneously. The bank selector checks the requests to the cache, if two
requests target the same bank and the same cache line, then in case of the enabling of
virtual multi-porting, the selector will resolve the requests to two different ports, if there
are no available ports the requests will be blocked. Figure 2.85 shows the pseudo code for
the virtual port assignment algorithm. The module is equipped with a built-in counter,
to keep track of the stalled requests in each bank or each port in case of enabling virtual
porting.
2.6.2.1.1 RTL verification This section discusses the verification of the bank selec-
tor modules under different scenarios, suppose that there are 4 threads, that send requests
to the cache memory as shown in Table 2.13.
Figure 2.86 shows the simulation of the bank selector module, the module configured
to handle 4 requests and distribute them on 4 banks. The simulation shows the bank
number, where each request is forwarded, and the bank ID was extracted from the address
associated with each request, the simulation also shows the different output signals that
represent certain components of a request such as data and tag. Figure 2.87 shows the
case when the available cache banks are only 2 banks, while 4 requests want to be handled,
therefore two of these requests are ignored and the first two requests are the only ones to be
handled. Figure 2.88 shows the case when two requests want to access the cache memory
but they have the same addresses, in this case, the bank selector does not support virtual
multi-porting the two requests will conflict, and the second request will not be handled.
This problem can be solved by enabling virtual multi-porting as shown in 2.89, the two
requests that target the same bank, each one of them will be distributed on one of the
available ports.
Table 2.14: Two requests sent to the cache with the same addresses.
2.6. CACHE SUB-SYSTEM 93
Figure 2.86: RTL simulation of bank selector core, when the number of requests is equal to the
available banks.
Figure 2.87: RTL simulation of bank selector core, when the number of requests is higher than
the available banks.
94 CHAPTER 2. GPU ARCHITECTURE
Figure 2.88: RTL simulation of bank selector core, when two requests are sent to the cache
without enabling multi-porting.
Figure 2.89: RTL simulation of bank selector core, when two requests are sent to the cache
with the enabling of multi-porting.
2.6. CACHE SUB-SYSTEM 95
2.6.2.2.3 RTL verification Dual-port RAM and single-port RAM can be imple-
mented in Verilog by writing RTL code that instructs the synthesis tool to utilize the
BRAM available in the FPGA. The implemented RAM module is flexible and can be
easily configured according to user specifications through certain parameters in Table
2.15.
Parameter Description
DATAW Data bus width.
SIZE The depth of the RAM
ADDRW The width of the address bus
BYTEENW The byte access behavior
OUT REG Whether the output is registered
or not
INIT ENABLE Control the initialization process
of the RAM
INIT FILE The path to the text file that is
used to initialize the RAM
Figure 2.93 shows the RTL simulation of the dual-port RAM module when INIT ENABLE
parameter is enabled, in this case, the entries of the RAM are filled with the values from
the file specified by INIT FILE parameter. In this case, the RTL RAM can act as a
ROM to store unmodified data or act as a photo memory, that stores the input image
that will be processed later.
The memory can also be configured to be byte-addressable memory, which means that
the processor can send requests to modify some bytes only instead of modifying the whole
word, the position of the bytes to be modified is determined by the wren signal shown in
the Figure 2.90, the width of this signal is determined by the BYTEEN parameter that
represents the number of bytes within one word. Figure 2.92 shows the RTL simulation
of the RAM module with BYTEEN equal to 4.
98 CHAPTER 2. GPU ARCHITECTURE
Figure 2.93: RTL simulation of dual-port RAM when BYTEEN parameter is equal to 4.
The performance of the cache subsystem is affected so much by the pattern by which
the processor accesses the cache, some applications such as sparse linear algebra and
graph analytics have irregular cache access patterns, which leads to an increase in the
miss rate, thus reducing the overall performance of the cache system[32]. To address the
issue of frequent cache misses, the cache needs to be capable of simultaneously processing
requests from the processor even when a miss occurs, rather than merely waiting for the
miss to be resolved. A type of cache memory known as the non-blocking cache has the
potential to solve this problem, by performing some optimizations that enable the cache
to handle multiple hits while resolving the miss request, this scheme known as hit under
miss is shown in Figure 2.94, in this scheme the miss penalty is reduced by handling the
2.6. CACHE SUB-SYSTEM 99
hit requests until a second miss occurs, in this case, the cache stalls. A more complex
and effective optimization that can be made is miss under miss, in this method shown
in Figure 2.95, the cache overlaps the miss penalty of several misses, by serving different
hit-and-miss requests until one of the following events occurs:
All instructions in the flight are waiting for their operands and there are no ready
instructions to be executed.
Although non-blocking caches have the potential to increase the system performance
very much, their implementation and design in reality is a very complex task for two
reasons:
The potential collision between hits and misses, this problem is ignored in blocking
caches because the processor has to wait until all misses are resolved and further
requests will not be processed, on the other hand, in a non-blocking cache the hit can
collide with a miss coming from lower level-memory hierarchy, and in the systems
that support handling of multiple misses, the misses can collide with each other. To
handle these potential problems, non-blocking caches give the hits higher priority
than misses and may need to order the handling of colliding misses.
The second challenge is to keep track of the occurred misses in the cache. This
problem was trivial in blocking caches because there is only one miss that is re-
solved at a certain time, this is not the case in non-blocking caches where multiple
outstanding misses are handled concurrently. At first time it is not seen as a hard
100 CHAPTER 2. GPU ARCHITECTURE
task because someone can think that the misses will be returned in the same or-
der they were initiated, thus simple queue is sufficient to track the different misses,
however, this is not true in many scenarios, as a miss in L1-cache may cause a hit
or miss in L2-cache, the problem will appear if L2-cache is also non-blocking, this
make the orders of committing requests from L2-cache to L1-cache is different from
the original order of the requests. Additionally, in systems with multiple cores or
multiprocessors, variations in cache access times can further complicate the tracking
of outstanding misses.
Modern processors use a set of registers named Miss Status Handling Registers
(MSHRs) to keep track of different cache misses. The number of MSHRs in the cache
corresponds to the maximum number of outstanding misses that the cache can handle
concurrently. Each entry in the MSHR stores some important information about the
cache position in which the miss has occurred, the tag value associated with this miss,
and the instruction that has triggered this miss. Figure 2.96 shows how the MSHR is
integrated with the non-blocking cache to handle cache misses.
Figure 2.96: Integration of the miss reservations with the cache memory.
2.6. CACHE SUB-SYSTEM 101
To handle the cache misses properly, the MSHR supports five main functions as follows:
Lookup operation: This operation is used when a miss occurs, to search the
MSHR entries to identify whether there is a pending request that have the same
address as the new miss, if so, then there is no need to allocate a new entry in the
MSHR, if there is no match, then a new location will be allocated to the new cache
miss. The summary of this operation is shown in Figure 2.97.
Replay operation: When the data arrives from the lower memory hierarchy, this
operation is done to notify all the entries in the MSHR with same tag as the memory
response, that the data is arrived. The ready bit in the entries with matching tags
are is set to 1, indicating that that this entry is ready to be dequeued from the
MSHR.
Fill operation: This operation occurs when the memory respond back with the
required data, the entry to fill with this data is determined by the memory ID which
is one of the inputs of the MSHR module.
Schedule Stage: The incoming core requests to the cache may be not the only
source that need to utilize the cache, there are many other sources that need to
access the cache at certain time such as :
There is a priority arbiter at the schedule stage to give grant access to the different
requests, the order of priorities of requests is as follow:
– Flush Request
– Core Request
Tag Access Stage: When a core request is sent to the cache bank, the tag portion
of the physical address associated with this request is checked to find whether it
presents in the tag store or not, if so, a tag match signal is triggered. In case of
memory fill request the tag store is filled with the tag associated with the memory
request. The flush request invalidates the valid bit corresponds to a certain entry
in the tag store.
Data Access Stage: In this stage the actual data is retrieved if the request tag
cause a match in the tag store. In case of fill requests, the data comes from the main
memory is stored in the data store entry corresponds to the memory ID associated
with the memory request.
Response stage: In this stage the response to the received requests are sent back
to the cores through elastic buffers. The memory requests in case of a miss occurs
are also sent to the lower memory hierarchies.
Figure 2.99: Access of Four-way interleaved cache banks using block addressing.
2.6.2.4.1 RTL verification Figure 2.101 shows the structure of the cache bank, the
bank uses an elastic buffer to receive different core requests or memory responses. The
main reason to use an elastic buffer is to ensure smooth data transfer when connecting
between different domains, as the clock frequency may differ from the core side to the
cache side or from the main memory side to the cache side. The elastic buffer decouples
the two sides (transmitter and receiver) and uses a valid/ready handshake mechanism to
transfer data between the two sides, when the sender has data to send, it triggers the
valid signal and on the other side when the receiver is ready to receive data, it triggers
the ready signal, when the both signals are high, the data transfer occurs. The bank can
enter a deadlock state in two cases:
There is an incoming memory response and the memory request queue is full.
To over come this problem, a counter is used to keep track of the number of pending
requests in MSHR, if the MSHR is almost full, the bank triggers a signal to block the new
requests from being issued.
The three requests in Table 2.16 are sent to the cache bank, where all the addresses
associated with the requests will cause a miss. Figure 2.102 shows how the bank will
response to these requests, the bank sends a memory requests to the lower level memory
in the hierarchy, each request is uniquely identified with an ID, that indicates its position
in the MSHR. Figure 2.102 shows the main advantage of the non-blocking cache bank,
that it can work during a stall without waiting for the response to arrive from the main
memory. Table 2.17 shows the responses from the main memory, each response has an ID
corresponds to the MSHR entry, that contains the address of the request waiting for the
data. Figure 2.103 shows the bank respones to the core after resolving the misses. The
core response data is the same as the data recieved from the main memory.
ID Data
0x0 0xaabbccdd
0x1 0xccddeeff
0x2 0xbbccddee
The response merger scans all virtual ports of each bank to find a valid port.
The valid ports are the ports that have valid requests but have not been serviced
yet.
The request signals from the valid port is sent to the Find First Tree to choose
the tag to be processed as shown in Figure 2.104.
A two-dimensional array keeps track of which ports’ responses have been sent and
removes them from the Find First Tree.
This array is cleared once all Active Ports in a bank are serviced.
All the requests from the ports that have a tag that matches the one selected by
the Find First Tree are merged then their responses are sent as one batch.
The Valid Signal and Requested Data are sent out to the threads that initiated the
request.
That Virtual Port is then marked as serviced so that it is removed from the Find
First Tree, the summarized steps of this algorithm are shown in Figure 2.105.
2.6. CACHE SUB-SYSTEM 107
to use the Cache. Initially, the different threads initiate a request to the cache in case it
is ready to receive signals, each lane contains a certain number of requests. Each request
is associated with metadata that indicates certain information about the request such
as valid signal, (Read/Write), tag, address, and byte enable, these data are merged and
then transferred to the stream arbiter. The Stream Arbiter grants a request and sends
it to the Cache. It grants the same request index for each Lane. For example, when
it grants request 2, that request is granted across all Lanes. The Cache then sends the
response back to the Cache Arbiter. This response contains the Valid signal, Thread
Mask, Requested Data, and Tag. The response then goes through the stream demux.
After the response is demuxed there is now a response for each requestor in every lane
that contains: Valid, Thread Mask, Requested Data, and Tag.
Stream arbiter uses the Round-Robin policy as the arbitration algorithm to schedule
the different incoming requests. The round-robin scheduling policy is chosen due to the
following reasons:
The low-time latency needed to select and forward the requests to the cache.
The fairness in sharing the resources among different requestors.
The relatively low worst-case delay.
The low utilization of hardware resources.
To ensure that the arbitration is done with a speed close to the original clock speed of the
system, the round-robin scheduler checks the requests in parallel to determine the first
2.6. CACHE SUB-SYSTEM 109
valid request to be scheduled, this scan is done using parallel prefix OR operation to
determine the position of the first valid request and then a priority encoder is used to get
the index of this location. To ensure that all requests will be served the schedule makes
the chosen request the lowest priority request after it is scheduled, so a chance is given
to the later ones to be scheduled. Figure 2.108 shows the RTL simulation of the round-
robin scheduler, that accepts 5 requests where each of them is valid, every clock cycle
the scheduler grants a different request and produces the index of the granted request
alongside the hot-encoded vector.
AXI-Controller
3.1 Background
To connect the GPU with the main memory an interface is needed to standardize
the transactions between the GPU and the memory and to abstract the resources of the
GPU, so when it is used in a heterogeneous system with another host processor, the
processor can communicate easily with the GPU using shared memory visible by both
architectures. The Advanced eXtensible Interface(AXI) is used to connect between the
GPU and the main memory. The AXI protocol is a part of The Advanced Microcontroller
Bus Architecture (AMBA) protocol, which is an open standard, on-chip interconnect
used for the connection and management of several function units within a system-on-
chip (SOC). The usage scope of AMBA has gone beyond the usage in microcontroller
devices, it is utilized nowadays in a range of ASIC and SoC applications. AMBA was
introduced by ARM in 1996. The first AMBA buses were highly developed Advanced
System Bus (ASB) and Advanced Peripheral Bus (APB). In its 2nd version in 1999,
AMBA 2.0, ARM added AMBA High-performance Bus (AHB) that is a single clock-edge
procedure. In 2003, ARM introduced the 3rd generation, AMBA 3.0, with Advanced
Extensible Interface (AXI) to reach even high-performance interconnects. In 2010, ARM
introduced the 4th generation, AMBA 4.0, which includes the second version of AXI,
AXI4. AXI4 is used to wrap many IPs because it offers three important properties:
2. Flexibility: AXI protocol provides three types of interfaces suitable for different
applications
(a) AXI4: It is used for memory-mapped interfaces and allows sending data bursts
up to 256 data transactions with just one address.
(b) AXI4-Lite: It is similar to the AXI4 interface but with smaller resource
utilization.
(c) AXI4-Stream: It is used for high streaming data that can burst an unlimited
amount of data.
111
112 CHAPTER 3. AXI-CONTROLLER
Write Address
Write Data
Write Response
Read Address
Read Data
The AXI Protocol uses a two-way handshake mechanism. The information source uses
the VALID signal to show when valid address, data, or control information is available
on the channel. The destination uses the READY signal to show when it can accept the
information.
Signal Description
mem req valid Validity of the core request
mem req rw Whether the request is read or write
mem req byteen Which Byte within the Word has valid data
mem req addr Memory Address to Read/Write from
mem req tag Memory Tag Requested
mem req data Data to be written to the memory
1. mem req addr is set to the address of the block that the core needs to read from
and mem req tag is set to the Tag ID.
2. mem req valid is asserted to indicate that all control signals are available at the
input.
After configuring the request signals the AXI Adapter converts these signals to signals
compatible with the AXI4 protocol and then sends the request to the slave side to fetch
the required data as shown in Figure 3.2.
1. mem req addr is set to the address of the block that the core needs to write to
and mem req tag is set to the Tag ID.
3. Strobe (Which Byte Lane is Valid) is set with mem req byteen.
4. mem req valid is asserted to indicate that all control and data signals are available
at the input.
Similar to the read operation discussed before the AXI-adapter will convert the request
signals to other signals compatible with AXI4 protocol and then send the signals to the
memory.
4.1 Introduction
The project’s main aim is not only to develop high-performance GPU but also to
provide an easy software framework to test and execute real-world applications instead
of just using simulators such as Questa Sim, Vivado, and Modelsim to perform RTL
simulations. The development of such a framework accelerates the testing process and
makes it easier to integrate new modules into the device and test it without concern for
the underlying hardware. The tools used to implement the software testing framework
will be discussed in the upcoming sections. The testing enviroment in this thesis was
not built from scratch, it is a modified version of the testing enviroment used in Vortex
project[33].
4.2 Ramulator
To evaluate the performance of the implemented GPU on the FPGA there is an
important parameter to be taken into consideration, which is the latency of (Dynamic
Random Access Memory)DRAM. Many projects ignore this aspect and assume that data
will be always available in the cache, however in real-world applications the program
binaries are first located in the Main Memory then the processor fetches them to the
cache, the latency of the Main Memory is very high compared to the cache and often
dominates the executing time of many applications. The main challenge in simulating the
Main Memory based on DRAM technology is that the technology used to manufacture
the DRAM cells is not compatible with the (Complementary Metal Oxide Semiconductor)
CMOS technology used to manufacture the FPGA logic. So the real latency of the DRAM
is very hard to emulate on the FPGA. To overcome this problem a DRAM simulator
known as Ramulator. The Ramulator framework developed by researchers at ETH
Zurich [34] is an efficient and accurate DRAM simulator that is designed with the aim
of providing cycle-accurate simulations for DRAM systems taking into consideration to
be easily extendable to support newer versions of DRAMs. Ramulator utilizes a generic
template to simulate the DRAM systems, this generic template can be customized later
to support specific characteristics of different DRAM protocols, this structure enables the
Ramulator to support various standards such as DDR3/4, LPDDR3/4, WIO1/2, GDDR5,
HBM, and even academic proposals like TLDRAM, AL-DRAM, and RowClone.
115
116 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW
OpenCL API: The OpenCL API and the remaining application code runs on the
host device(CPU).
The different communications between the host device and the other accelerators in-
cluding the memory transactions are done through a special data structure known as
118 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW
Command Queue. The command queue connects the CPU and each accelerator device
with each other and through it, the CPU sends the instructions to be executed to each
device. The CPU invokes the different devices through kernels, where these kernels are
arranged in a two-dimensional space known as 4.3, where each node resembles a Work
Item that will be executed on a certain processing element, and these work items are
arranged into a Work Group that is executed on the whole computing unit. In this
project, the processing element is the SIMD core, while the computing unit is the top
GPU module that contains a group of SIMD cores.
Host Memory: This type of memory is available for the host device.
4.6. THE FLOW OF OPENCL HOST APPLICATION 119
Local Memory: This type of memory is available for all the processing elements
within the compute unit.
Global Memory: This type of memory is available for the computing unit.
Private Memory: This type of memory is available for a single processing element.
1. The First step is to identify and select the OpenCL platform to execute the pro-
gram kernels, as OpenCL standards support the existence of multiple platforms for
120 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW
different vendors in the same system. The host CPU uses an API to identify the
different platforms supported by the system.
2. After selecting the platform, the host API sends a query for the chosen platform to
identify the devices supported by it for example POCL platform supports different
types of architectures such as X-86, RISC-V, and Nvidia GPUs.
3. Prepare the environment for the selected device by allocating different data struc-
tures used for executing OpenCL programs such as command queues, memory
buffers, and kernels. These structures cannot be shared by more than one process.
4. Create a command queue to launch the kernels on the target devices, the command
queue is exclusive only for each device.
6. Generate and build a program entity, the OpenCL program entity is a group of
kernels, which are special functions written in the OpenCL C program language
uniquely identified by the kernel attribute. There are two ways to create these
kernels:
(a) Compile OpenCL C code using vendor kernel compiler during runtime.
(b) Using pre-compiled binary image generated before execution.
7. Generate kernel object, which is a container that warps a kernel function and pro-
vides a way to pass parameters to it.
8. In this step memory buffers and data structures are passed to the kernel function
using a special function known as clSetKernelArgument.
9. Use special commands to move the data from the host memory to the accelerator
memory through the command queue.
10. The final step is to execute the kernel using the clEnqueueNDRangeKernel
function.
4.6. THE FLOW OF OPENCL HOST APPLICATION 121
4.7 POCL
To enable the execution of the OpenCL kernels on the GPU, the OpenCL binaries
must be compatible with the RISC-V architecture. The Portable Computing Language
(POCL) is an open-source implementation of OpenCL that supports various types of
architectures such as ARM, X-86, and RISC-V. It also provides support for user’s custom-
designed accelerators. The POCL compiler uses Clang as a front-end framework to
compile OpenCL codes and the Low Level Virtual Machine (LLVM) compiler as a back-
end framework to transform the generated binaries from the OpenCL compilation to be
compatible with the target device. The POCL runtime environment is built on three
distinct layers:
OpenCL Runtime API Calls: This layer is a middleware between the software
applications and the actual OpenCL runtime environment, it offers standard APIs
that the running software applications can use to perform different operations.
– Compiler driver
– File system abstraction
– Memory abstraction
– Threading and synchronization libraries
Figure 4.6: The process of compiling OpenCL applications to binaries that target RISC-V
system using POCL compiler.
4.8. GPU RTL DRIVER 123
Figure 4.7: The overview of the configuration used to perform cycle-accurate simulation.
After obtaining these parameters, the intialize tasks function uses them to distribute
the workloads equally among the threads and warps within each processing core. For
example, if a convolution operation is wanted to be performed on an image, the image will
be divided into Work Groups equal to the number of GPU’s cores, and then these Work
Groups will be divided further into Work Items equal to the number of warps supported
by each core, the Work Items contain parallel instructions corresponds to the number of
threads within the warp. After compiling the OpenCL kernels and producing the binary
image the host CPU executes a sequence of code to communicate with the target device
and initialize the Global memory with the required data needed by it. Figure 4.9 shows
the steps performed to launch the OpenCL application the code emulates the steps shown
in Figure 4.5:
Buffer Allocation: Using the clCreateBuffer function, the host processor creates
three buffers to be able to pass the input arguments and retrieve the output results
from the kernel function.
Kernel Creation: Using the clCreateKernel function, the host processor creates
a kernel to be executed on the target device with a unique name identified by the
parameter KERNEL NAME.
Setting Kernel Arguments: Using the clSetKernelArg function, the host pro-
cessor sets the input and output buffers of the kernel.
DSPs 2020
BlockRAMs 755
LUTs 277400
IO 900
Num of Cores LUTs (%) DSP (%) FF (%) LUTRAMs (%) Fmax MHz
1 9.73 1.19 2.72 6.9 142
2 19.58 2.38 5.61 13.80 142
4 39.12 4.75 11.14 27.61 142
6 58.87 7.13 16.68 41.41 142
8 77.34 9.50 22.18 55.22 142
10 96.94 11.88 27.72 69.02 142
127
128 CHAPTER 5. RESULTS AND EVALUATION
Figure 5.2 shows a power analysis for one core based on the default power estimation
of Vivado, the total power consumed is 0.711 watt, consisting of 69% of dynamic power
and 31% of static power. Figure 5.3 shows the power consumption of the GPU with eight
cores, the total power consumed is 5.82 watt, consisting of 95% dynamic power and 5%
static power.
5.2 Benchmarking
5.2.1 Introduction
In this section the performance of the GPGPU implemented in this thesis is evaluated
using different benchmarks that are classified into two main categories:
Computing-bound benchmarks such as sfilter, sgemm, and vecadd which are subsets
of Rodinia benchmarks[35].
The performance metric used to evaluate the performance and efficiency of the GPGPU
is the Instruction Per Cycle (IPC). IPC is used to measure the average number of
instructions executed in one clock cycle. The IPC is used as an evaluation metric for the
following reasons:
Performance Analysis: The IPC is a good tool to analyze the performance of the
program to identify the different bottlenecks, as a lower IPC indicates that there is
a potential bottleneck in the architecture of the hardware.
Figure 5.6: IPC results of merge sort algorithm for different GPU configurations with a constant
workload of 32 elements.
Figure 5.7: IPC results of merge sort algorithm for different workload sizes.
Figure 5.8 shows how the execution time of the program in terms of the number of
clock cycles changes with varying workloads concerning different cores count, it can be
concluded that with higher core count the program workload is distributed among different
cores, thus the overall execution time does not increase rapidly as in case of lower core
count.
132 CHAPTER 5. RESULTS AND EVALUATION
Figure 5.8: The count of clock cycles needed to sort various arrays with diverse workloads.
5.2.2.2 P-Sort
In this section another sorting algorithm will be evaluated on the GPGPU. A sorting
algorithm known as Parallel Counting Sort will be used to sort an array of floating
point numbers with varying sizes.
Figure 5.9: The execution of psort algorithm on the GPGPU with 1 core and workload of 12
elements.
5.2. BENCHMARKING 133
Figure 5.10: IPC results of p-sort algorithm for different GPU configurations with a constant
workload of 32 elements.
Figure 5.11: IPC results of p-sort algorithm for different workload sizes.
134 CHAPTER 5. RESULTS AND EVALUATION
Figure 5.12: The count of clock cycles needed to sort various floating-point arrays with diverse
workloads.
5.2. BENCHMARKING 135
Figure 5.13: IPC results of Convolution algorithm for different GPU configurations with a
constant workload of 16x16 input Matrix and 3x3 filter.
Figure 5.14: Execution time in microseconds of Convolution algorithm for different GPU con-
figurations with a constant workload of 16x16 input Matrix and 3x3 filter.
136 CHAPTER 5. RESULTS AND EVALUATION
Figure 5.15: IPC results of Convolution algorithm for different core counts configurations with
different workload sizes and constant 3x3 filter.
Figure 5.16: Execution time in milliseconds of Convolution algorithm for different core counts
configurations with different workload sizes and constant 3x3 filter.
5.2. BENCHMARKING 137
Figure 5.18: IPC results of Sgemm benchmark for different GPU configurations with a constant
workload of floating-point 64x64 matrix .
138 CHAPTER 5. RESULTS AND EVALUATION
Figure 5.19: Execution time in milliseconds of Sgemm benchmark for different GPU configu-
rations with a constant workload of floating-point 64x64 matrix .
Figure 5.20: IPC results of Sgemm benchmark for different core counts with different matrix
sizes.
5.2. BENCHMARKING 139
Figure 5.21: Execution time in milliseconds of Sgemm benchmark for different core counts with
different matrix sizes.
Figure 5.23: IPC results of Saxpy benchmark for different GPU configurations with a constant
workload of floating-point array with 4096 elements.
Figure 5.24: IPC results of Saxpy benchmark for different core counts with different array sizes.
5.2. BENCHMARKING 141
Figure 5.25: Execution time in milliseconds of Saxpy benchmark for different core counts with
different array sizes.
142 CHAPTER 5. RESULTS AND EVALUATION
Figure 5.27: The execution of the K-NN algorithm to find the 4-nearest hurricanes to a location
of a longitude equal to 60 degrees and latitude of 20 degrees.
5.2. BENCHMARKING 143
Figure 5.28: IPC results of executing K-NN algorithm to find the 5 nearest neighbors for
different cores counts.
Figure 5.29: Execution time in milliseconds of K-NN algorithm to find the 5 nearest neighbors
for different cores counts.
144 CHAPTER 5. RESULTS AND EVALUATION
Figure 5.31: IPC results of Gaussian elimination algorithm for different GPU configurations
with a constant workload of 4x4 matrix.
5.2. BENCHMARKING 145
Figure 5.32: Comparison between the execution time of K-NN algorithm on GPGPU with 8
cores and Intel Xeon E5-1640.
Figure 5.33: Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and Intel Xeon E5-1640.
Figure 5.35: Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and TPU v2.
Figure 5.37: Comparison between the execution time of convolution on GPGPU with 8 cores,
and GPU Tesla-4.
Chapter 6
Decode Stage: This stage takes the 32-bit instructions fetched from the instruc-
tion cache and extracts the different fields to determine the type of the instructions,
whether I-type, R-type, or S-type besides the data operand such as Source Regis-
ter1, Source Register2, and Destination Register and then determines the suitable
processing element to execute this instruction.
Issue Stage: This stage checks the data operands extracted by the decode stage
to verify if all the source operands of the instruction are ready or not if the source
operands are ready and there are no data hazards, the data values of the operands
will be provided by the banked register file, if not the issue stage will choose the
next available instruction and store the unready instruction in reservation stations
until all operands are available.
Execute Stage: This stage contains different processing elements to execute sev-
eral types of operations that are determined at the decode stage:
1. ALU Unit: This unit is responsible for executing arithmetic and logical op-
erations besides handling different types of branch instructions.
2. FPU Unit: This unit is responsible for executing operations that use floating-
point numbers such as float addition, float multiplication, and float square root.
149
150 CHAPTER 6. CONCLUSION AND FUTURE WORK
3. LSU Unit: This unit is responsible for handling the write and read requests
to the external memory.
4. CSR Unit: This unit is responsible for reading and writing certain registers in
the system that contain certain information about the status and configuration
of the system such as the number of threads, warps, and cores.
5. Conv Unit: This unit is responsible for accelerating the convolution opera-
tion.
Commit Stage: This stage is responsible for updating the register file after fin-
ishing the current instruction.
Chapter 3 describes the structure of the AXI adapter that connects the processing
cores with the external memory.
Chapter 4 describes the software stack of the GPGPU. The GPGPU implemented
in this thesis supports OpenCL as the main programming language, OpenCL is compiled
using the POCL compiler to generate RISC-V compatible binaries, and then the host
processor dumps these binaries to the GPGPU through a global memory visible by the
two architectures. The chapter also describes the GPGPU RTL driver that abstracts the
microarchitecture of the GPGPU from the host processor, so an average programmer can
write codes that target the GPGPU without knowing much about its architecture.
Chapter 5 evaluates the area and the performance of the GPGPU, the GPGPU can
support up to 8 cores with each core can be configured with up to 4 warps and 8 threads,
this means that the maximum number of supported threads are up to 256 threads. The
GPGPU also supports up to 16k Instruction-cache and 16k Data-cache with optional level
2 cache. The other part of the chapter evaluates the performance of the GPGPU with
different core configurations using the Rodinia Benchmark which is a famous set of several
programs that evaluates the performance of heterogeneous systems[35].
Figure 6.1: The host flow of OpenCL application on the design environment.
6.2. FUTURE WORK 151
Rasterization Stage: This stage receives the produced triangle from the Geometry
stage and traverses the edge of the triangles to map these edges to pixels on the
screen.
Texture Stage: This stage is responsible for adding surface details to 3D objects.
152 CHAPTER 6. CONCLUSION AND FUTURE WORK
Appendix A
153
154 APPENDIX A. GPGPU INSTRUCTION SET ARCHITECTURE (ISA)
Figure A.12: Control Status Register (CSR) Instructions supported by the GPGPU.
[1] Vijay Sharma, Manisha Pattanaik, and Balwinder Raj. ONOFIC approach: low
power high speed nanoscale VLSI circuits design. Mar. 2013. doi: 10.1080/00207217.
2013.769186.
[2] John L. Hennessy and David A. Patterson. Computer Architecture, Sixth Edition: A
Quantitative Approach. 6th. San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc., 2017. isbn: 0128119055.
[3] Fares Elsabbagh et al. “Vortex: OpenCL Compatible RISC-V GPGPU”. In: arXiv
preprint arXiv:2002.12151 (2020).
[4] url: https://circuitdigest.com/tutorial/what- is- fpga- introduction-
and-programming-tools.
[5] Graham M. Holland. “Abstracting OpenCL for Multi-Application Workloads on
CPU-FPGA Clusters”. In: 2019. url: https : / / api . semanticscholar . org /
CorpusID:215864989.
[6] Muhammed Al Kadi et al. “General-purpose computing with soft GPUs on FPGAs”.
In: ACM Transactions on Reconfigurable Technology and Systems 11 (Jan. 2018),
pp. 1–22. doi: 10.1145/3173548.
[7] Giovanni Todaro et al. “Enhanced Soft GPU Architecture for FPGAs”. In: 2023
18th Conference on Ph.D Research in Microelectronics and Electronics (PRIME).
2023, pp. 177–180. doi: 10.1109/PRIME58259.2023.10161749.
[8] Xu Liu et al. “A hybrid GPU-FPGA based design methodology for enhancing ma-
chine learning applications performance”. In: Journal of Ambient Intelligence and
Humanized Computing 11 (2020), pp. 2309–2323.
[9] Zeke Wang et al. “FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs”.
In: 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad,
CA: USENIX Association, July 2022, pp. 967–986. isbn: 978-1-939133-29-25. url:
https://www.usenix.org/conference/atc22/presentation/wang-zeke.
[10] Jeff Bush et al. “Nyami: A synthesizable GPU architectural model for general-
purpose and graphics-specific workloads”. In: ISPASS 2015 - IEEE International
Symposium on Performance Analysis of Systems and Software (Apr. 2015), pp. 173–
182. doi: 10.1109/ISPASS.2015.7095803.
[11] Ruobing Han et al. CuPBoP: CUDA for Parallelized and Broad-range Processors.
June 2022. doi: 10.48550/arXiv.2206.07896.
[12] Ruobing Han et al. COX: CUDA on X86 by Exposing Warp-Level Functions to
CPUs. Dec. 2021.
157
158 BIBLIOGRAPHY
[13] Blaise Tine et al. Vortex: Extending the RISC-V ISA for GPGPU and 3D-GraphicsResearch.
Oct. 2021.
[14] Hyeonguk Jang et al. “Developing a Multicore Platform Utilizing Open RISC-V
Cores”. In: IEEE Access PP (Aug. 2021), pp. 1–1. doi: 10.1109/ACCESS.2021.
3108475.
[15] Matheus Cavalcante et al. “Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V
Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI”.
In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems PP (Dec.
2019), pp. 1–14. doi: 10.1109/TVLSI.2019.2950087.
[16] Graham M. Holland. “Abstracting OpenCL for Multi-Application Workloads on
CPU-FPGA Clusters”. In: 2019. url: https : / / api . semanticscholar . org /
CorpusID:215864989.
[17] Topi Leppänen et al. “Efficient OpenCL system integration of non-blocking FPGA
accelerators”. In: Microprocessors and Microsystems 97 (Jan. 2023), p. 104772. doi:
10.1016/j.micpro.2023.104772.
[18] Pekka Jääskeläinen et al. “pocl: A Performance-Portable OpenCL Implementation”.
In: International Journal of Parallel Programming (Aug. 2014). doi: 10 . 1007 /
s10766-014-0320-y.
[19] Topi Leppänen et al. “Unified OpenCL Integration Methodology for FPGA De-
signs”. In: Oct. 2021. doi: 10.1109/NorCAS53631.2021.9599861.
[20] url: https:https://opae.github.io/0.13.0/docs/fpga_api/quick_start/
readme.html.
[21] May 2019. url: https : / / educationecosystem . com / blog / enabling - fpgas -
software-developers/.
[22] url: https://www.allaboutcircuits.com/technical-articles/multiplication-
examples-using-the-fixed-point-representation/.
[23] Shadrokh Samavi. “Basic Multiplication Schemes”. In: Jan. 2014.
[24] Hassan Albalawi, Yuanning Li, and Xin Li. “Computer-Aided Design of Machine
Learning Algorithm: Training Fixed-Point Classifier for On-Chip Low-Power Imple-
mentation”. In: June 2014. doi: 10.1145/2593069.2593110.
[25] “IEEE Standard for Floating-Point Arithmetic”. In: IEEE Std 754-2019 (Revision
of IEEE 754-2008) (2019), pp. 1–84. doi: 10.1109/IEEESTD.2019.8766229.
[26] D.A. Patterson and J.L. Hennessy. Computer Organization and Design MIPS Edi-
tion: The Hardware/Software Interface. ISSN. Elsevier Science, 2013. isbn: 9780124078864.
url: https://books.google.com.eg/books?id=EVhgAAAAQBAJ.
[27] Alvaro Vazquez and Florent Dinechin. “Efficient implementation of parallel BCD
multiplication in LUT-6 FPGAs”. In: Dec. 2010, pp. 126–133. doi: 10.1109/FPT.
2010.5681767.
[28] url: https://www.telesens.co/2018/07/30/systolic-architectures/.
[29] url: https://semiengineering.com/dram-scaling-challenges-grow/.
[30] Saba Ahmadian et al. “ETICA: Efficient Two-Level I/O Caching Architecture for
Virtualized Platforms”. In: IEEE Transactions on Parallel and Distributed Systems
PP (Mar. 2021), pp. 1–1. doi: 10.1109/TPDS.2021.3066308.
BIBLIOGRAPHY 159
[31] Onur Mutlu et al. “A modern primer on processing in memory”. In: Emerging
Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann.
Springer, 2022, pp. 171–243.
[32] Mikhail Asiatici and Paolo Ienne. “Stop crying over your cache miss rate: Handling
efficiently thousands of outstanding misses in fpgas”. In: Proceedings of the 2019
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019,
pp. 310–319.
[33] Tine Blaise et al. “Bringing opencl to commodity risc-v cpus”. In: 2021 Workshop
on RISC-V for Computer Architecture Research (CARRV). 2021.
[34] Haocong Luo et al. Ramulator 2.0: A Modern, Modular, and Extensible DRAM
Simulator. Aug. 2023.
[35] Shuai Che et al. “Rodinia: A benchmark suite for heterogeneous computing”. In:
Oct. 2009, pp. 44–54. doi: 10.1109/IISWC.2009.5306797.