Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 170

German International University

Faculty of Electrical Engineering

Implementing a RISC-V based GPU


from scratch on FPGA

Bachelor Thesis

Author: Amr Khalid Salah Attia


Supervisors: Dr. Moheb Saad Zaghloul Mekhail
Submission Date: 18/01/2024
This is to certify that:

(i) The thesis comprises only my original work toward the Bachelor’s Degree

(ii) Due acknowledgment has been made in the text to all other material used

Amr Khalid Salah Attia


18/1, 2024
Contents

1 Introduction and Literature Review 5


1.1 General Introduction and Overview of the Topic . . . . . . . . . . . . . . . 5
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 FPGA Internal Components . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 GPU Architecture 17
2.1 GPU Extension Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Extending RISC-V ISA . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 GPU Hardware Extension . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 RTL implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Decode Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Issue Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Dynamic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3 Scoreboard Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Execute Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Arithmetic Logic Unit (ALU) . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Floating Point Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.3 Load Store Unit (LSU) . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.4 Control Status Register (CSR) Unit . . . . . . . . . . . . . . . . . . 64
2.4.5 Convolution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.5 Commit Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.1 Writeback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.2 Popcount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.6 Cache Sub-System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.6.2 Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3 AXI-Controller 111
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2 AXI Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.2.1 Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2.2 Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3
4 Software Testing Tools Flow 115
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2 Ramulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3 Accurate-cycle simulation using Verilator . . . . . . . . . . . . . . . . . . . 116
4.4 OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.5 Models of the OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.1 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6 The Flow of OpenCL Host Application . . . . . . . . . . . . . . . . . . . . 119
4.7 POCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.8 GPU RTL Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.9 Testing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5 Results and Evaluation 127


5.1 RTL Design Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.2 Sort Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.3 Convolution Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 135
5.2.4 Sgemm Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2.5 Saxpy Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2.6 NearestNeighbor Benchmark . . . . . . . . . . . . . . . . . . . . . . 142
5.2.7 Guassian Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2.8 Comparison with other architectures . . . . . . . . . . . . . . . . . 145

6 Conclusion and Future Work 149


6.1 Summary of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.1 Cache Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.2 Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

A GPGPU Instruction Set Architecture (ISA) 153


List of Figures

1.1 Increase of the static power consumption with decreasing the channel length [1]. 5
1.2 The frequency and power trends over the years [2]. . . . . . . . . . . . . . . . 6
1.3 SIMD architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 The different instructions format of RISCV [3] . . . . . . . . . . . . . . . . . . 10
1.5 FPGA components [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 FPGA CLB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 The process of compiling OpenCL kernels, then dump the resulted binaries to
the target FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Abstracting and managing FPGAs [21]. . . . . . . . . . . . . . . . . . . . . . 15

2.1 High-level overview of the GPPGPU architecture. . . . . . . . . . . . . . . . . 17


2.2 R-type instruction encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 R4-type instruction encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Source registers extension via Inputs Merging. . . . . . . . . . . . . . . . . . . 19
2.5 Overloading of function bits to extend the number of source operands. . . . . . 20
2.6 example of an intrinsic function. . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Fetch stage architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 The process of selecting the next warp depending on different masks. . . . . . . 22
2.9 Divergent Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Taken and Not Taken Masks. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.11 Predicate Mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.12 Thread Mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.13 Architecture of GPU extension unit. . . . . . . . . . . . . . . . . . . . . . . . 26
2.14 Output signals when the input instruction is TMC. . . . . . . . . . . . . . . . 26
2.15 Output signals when the input instruction is Wspawn. . . . . . . . . . . . . . 27
2.16 Output signals when the input instruction is Split. . . . . . . . . . . . . . . . 27
2.17 Fetch-decode handshake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.18 Pipeline data hazard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.19 In-order vs. Out-of-order Dispatch. . . . . . . . . . . . . . . . . . . . . . . . 31
2.20 Scoreboard unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.21 Implementing of Register File using BRAMs. . . . . . . . . . . . . . . . . . . 33
2.22 Overview of ALU structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.23 Flowchart of ALU control unit. . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.24 Shift and Add multiplication steps [22]. . . . . . . . . . . . . . . . . . . . . . 36
2.25 Multiplier Cricut [23]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.26 Flowchart of restoring division algorithm. . . . . . . . . . . . . . . . . . . . . 37
2.27 The states in a 2-bit prediction scheme. . . . . . . . . . . . . . . . . . . . . . 38
2.28 The multiplication operation on the ALU with 4 active threads. . . . . . . . . . 38

5
6 LIST OF FIGURES

2.29 The division operation on the ALU with 4 active threads. . . . . . . . . . . . . 39


2.30 The remainder operation on the ALU with 4 active threads. . . . . . . . . . . . 39
2.31 The Branch Greater Than (BGE) instruction operation where all threads within
the warp is active. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.32 The anatomy of the fixed point number [24]. . . . . . . . . . . . . . . . . . . . 40
2.33 The IEEE 754 representation [26]. . . . . . . . . . . . . . . . . . . . . . . . . 42
2.34 Flowchart of floating point multiplication. . . . . . . . . . . . . . . . . . . . . 46
2.35 Flowchart of floating point addition. . . . . . . . . . . . . . . . . . . . . . . . 47
2.36 The high-level overview of FPU. . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.37 RTL design of the Floating point multiplier. . . . . . . . . . . . . . . . . . . . 51
2.38 FMA module without pipelining . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.39 FMA module with pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.40 Python script to generate random numbers in IEEE-754 floating point format . 54
2.41 RTL simulation of multiplication of two floating point vectors . . . . . . . . . . 54
2.42 The output produced from the FPU vs the actual output . . . . . . . . . . . . 55
2.43 The error percentage produced from the multiplication . . . . . . . . . . . . . 55
2.44 RTL simulation of Addition of two floating point vectors . . . . . . . . . . . . 56
2.45 The output produced from the FPU vs the actual output . . . . . . . . . . . . 56
2.46 The error percentage produced from the addition . . . . . . . . . . . . . . . . 57
2.47 RTL simulation of Subtraction of two floating point vectors . . . . . . . . . . . 57
2.48 The output produced from the FPU vs the actual output . . . . . . . . . . . . 58
2.49 The error percentage produced from the subtraction . . . . . . . . . . . . . . . 58
2.50 RTL simulation of converting IEEE-754 numbers to the nearest integer . . . . . 59
2.51 Bypass of non-cachable addresses to the main memory . . . . . . . . . . . . . . 60
2.52 The Load/Store Unit Architecture. . . . . . . . . . . . . . . . . . . . . . . . 61
2.53 Memory Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.54 CSR extended instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.55 2D convolution code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.56 A high-level overview of a traditional 3x3 systolic array. . . . . . . . . . . . . . 67
2.57 Poor utilization of systolic array resources [28]. . . . . . . . . . . . . . . . . . 68
2.58 Overview of convolution unit in the GPU. . . . . . . . . . . . . . . . . . . . . 70
2.59 Kogge–Stone adder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.60 Reducing the number of commits relative to the normal architectures. . . . . . 71
2.61 The resource utilization of convolution unit with different threads configuration. 72
2.62 Result of the execution of convolution on the GPU. . . . . . . . . . . . . . . . 72
2.63 Writeback module architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.64 Structure of SRAM cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.65 Single memory cell and array [29]. . . . . . . . . . . . . . . . . . . . . . . . . 75
2.66 Comparison of different storage devices in terms of cost per GB and performance
(IOPS) [30]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.67 Modern Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.68 Sudo code for matrix multiplication. . . . . . . . . . . . . . . . . . . . . . . . 78
2.69 The memory access pattern of the different array elements white means not yet
touched, light means older accesses, and dark means newer accesses. . . . . . . . 78
2.70 Cache Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.71 Different logical cache organizations. . . . . . . . . . . . . . . . . . . . . . . . 80
2.72 2-way set associative cache: Blocks with the same index can map to 2 locations. 80
2.73 4-way set associative cache: Blocks with the same index can map to 4 locations. 81
LIST OF FIGURES 7

2.74 Fully associative cache: Any block can map to any location in the cache. . . . . 81
2.75 Handling write requests in Write-through cache. . . . . . . . . . . . . . . . . 83
2.76 Handling write requests in Write-back cache. . . . . . . . . . . . . . . . . . . 84
2.77 The Cache Coherence Problem. . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.78 The Cache Coherence Problem. . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.79 The FSM of snoop protocol from the CPU side. . . . . . . . . . . . . . . . . . 87
2.80 The FSM of snoop protocol from the Bus side. . . . . . . . . . . . . . . . . . 87
2.81 Shared memory among different cores. . . . . . . . . . . . . . . . . . . . . . . 89
2.82 The Cache Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.83 Anatomy of the memory address. . . . . . . . . . . . . . . . . . . . . . . . . 91
2.84 The problem of bank conflict. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.85 Pseudo code of the virtual port assignment. . . . . . . . . . . . . . . . . . . . 92
2.86 RTL simulation of bank selector core, when the number of requests is equal to
the available banks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.87 RTL simulation of bank selector core, when the number of requests is higher
than the available banks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.88 RTL simulation of bank selector core, when two requests are sent to the cache
without enabling multi-porting. . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.89 RTL simulation of bank selector core, when two requests are sent to the cache
with the enabling of multi-porting. . . . . . . . . . . . . . . . . . . . . . . . . 94
2.90 Dual-port RAM structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.91 Single-port RAM structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.92 RTL simulation of dual-port RAM when initialization is enabled. . . . . . . . . 98
2.93 RTL simulation of dual-port RAM when BYTEEN parameter is equal to 4. . . 98
2.94 Hit under miss scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.95 Miss under miss scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.96 Integration of the miss reservations with the cache memory. . . . . . . . . . . . 100
2.97 The Lookup operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.98 Comparator Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.99 Access of Four-way interleaved cache banks using block addressing. . . . . . . . 103
2.100The stages of access the cache bank. . . . . . . . . . . . . . . . . . . . . . . . 103
2.101High level overview of the cache bank structure. . . . . . . . . . . . . . . . . . 104
2.102RTL simulation of cache bank 1. . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.103RTL simulation of cache bank 2. . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.104Find First Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2.105Pseudo-code of response merging algorithm. . . . . . . . . . . . . . . . . . . . 107
2.106Cache Arbiter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.107Logical structure of round-robin arbiter. . . . . . . . . . . . . . . . . . . . . . 109
2.108RTL simulation of Round-Robin scheduler. . . . . . . . . . . . . . . . . . . . 109

3.1 Overview of the AXI-Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 112


3.2 The read process in AXI protocol . . . . . . . . . . . . . . . . . . . . . . . . 113
3.3 The write process in AXI protocol . . . . . . . . . . . . . . . . . . . . . . . . 114

4.1 Example of generic code used by Ramulator. . . . . . . . . . . . . . . . . . . . 116


4.2 OpenCL Platform Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3 The arrangement of kernels NDRange space. . . . . . . . . . . . . . . . . . . . 118
4.4 The hierarchical structure of OpenCL memory. . . . . . . . . . . . . . . . . . 119
4.5 The program flow of OpenCL application. . . . . . . . . . . . . . . . . . . . . 121
8 LIST OF FIGURES

4.6 The process of compiling OpenCL applications to binaries that target RISC-V
system using POCL compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.7 The overview of the configuration used to perform cycle-accurate simulation. . . 123
4.8 OpenCL kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.9 Code for initializing the OpenCL applications. . . . . . . . . . . . . . . . . . . 125

5.1 Timing Constrains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


5.2 Power consumption of one core. . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3 Power consumption of eight cores. . . . . . . . . . . . . . . . . . . . . . . . . 128
5.4 Layout of GPGPU with 8 cores. . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.5 Hardware utilization of each module. . . . . . . . . . . . . . . . . . . . . . . . 129
5.6 IPC results of merge sort algorithm for different GPU configurations with a
constant workload of 32 elements. . . . . . . . . . . . . . . . . . . . . . . . . 131
5.7 IPC results of merge sort algorithm for different workload sizes. . . . . . . . . . 131
5.8 The count of clock cycles needed to sort various arrays with diverse workloads. . 132
5.9 The execution of psort algorithm on the GPGPU with 1 core and workload of
12 elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.10 IPC results of p-sort algorithm for different GPU configurations with a constant
workload of 32 elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.11 IPC results of p-sort algorithm for different workload sizes. . . . . . . . . . . . 133
5.12 The count of clock cycles needed to sort various floating-point arrays with diverse
workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.13 IPC results of Convolution algorithm for different GPU configurations with a
constant workload of 16x16 input Matrix and 3x3 filter. . . . . . . . . . . . . . 135
5.14 Execution time in microseconds of Convolution algorithm for different GPU con-
figurations with a constant workload of 16x16 input Matrix and 3x3 filter. . . . 135
5.15 IPC results of Convolution algorithm for different core counts configurations
with different workload sizes and constant 3x3 filter. . . . . . . . . . . . . . . . 136
5.16 Execution time in milliseconds of Convolution algorithm for different core counts
configurations with different workload sizes and constant 3x3 filter. . . . . . . . 136
5.17 Sgemm benchmark code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.18 IPC results of Sgemm benchmark for different GPU configurations with a con-
stant workload of floating-point 64x64 matrix . . . . . . . . . . . . . . . . . . 137
5.19 Execution time in milliseconds of Sgemm benchmark for different GPU configu-
rations with a constant workload of floating-point 64x64 matrix . . . . . . . . . 138
5.20 IPC results of Sgemm benchmark for different core counts with different matrix
sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.21 Execution time in milliseconds of Sgemm benchmark for different core counts
with different matrix sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.22 Saxpy benchmark code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.23 IPC results of Saxpy benchmark for different GPU configurations with a constant
workload of floating-point array with 4096 elements. . . . . . . . . . . . . . . . 140
5.24 IPC results of Saxpy benchmark for different core counts with different array sizes. 140
5.25 Execution time in milliseconds of Saxpy benchmark for different core counts with
different array sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.26 The K-NN kernel code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.27 The execution of the K-NN algorithm to find the 4-nearest hurricanes to a loca-
tion of a longitude equal to 60 degrees and latitude of 20 degrees. . . . . . . . . 142
LIST OF FIGURES 9

5.28 IPC results of executing K-NN algorithm to find the 5 nearest neighbors for
different cores counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.29 Execution time in milliseconds of K-NN algorithm to find the 5 nearest neighbors
for different cores counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.30 The execution of Gaussian elimination kernel on the GPU . . . . . . . . . . . . 144
5.31 IPC results of Gaussian elimination algorithm for different GPU configurations
with a constant workload of 4x4 matrix. . . . . . . . . . . . . . . . . . . . . . 144
5.32 Comparison between the execution time of K-NN algorithm on GPGPU with 8
cores and Intel Xeon E5-1640. . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.33 Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and Intel Xeon E5-1640. . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.34 GPGPU speedup over Intel Xeon E5-1650. . . . . . . . . . . . . . . . . . . . . 146
5.35 Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and TPU v2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.36 GPGPU speedup over TPU v2. . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.37 Comparison between the execution time of convolution on GPGPU with 8 cores,
and GPU Tesla-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.1 The host flow of OpenCL application on the design environment. . . . . . . . . 150

A.1 Instruction types supported by the GPGPU. . . . . . . . . . . . . . . . . . . . 153


A.2 Arithmetic Instructions supported by the GPGPU. . . . . . . . . . . . . . . . 154
A.3 Logical Instructions supported by the GPGPU. . . . . . . . . . . . . . . . . . 154
A.4 Shift Instructions supported by the GPGPU. . . . . . . . . . . . . . . . . . . 154
A.5 Conditional Branch Instructions supported by the GPGPU. . . . . . . . . . . . 154
A.6 Unconditional Branch Instructions supported by the GPGPU. . . . . . . . . . 154
A.7 Data transfer Instructions supported by the GPGPU. . . . . . . . . . . . . . . 155
A.8 Floating-point arithmetic Instructions supported by the GPGPU. . . . . . . . . 155
A.9 Floating-point comparison Instructions supported by the GPGPU. . . . . . . . 155
A.10 Floating-point conversion Instructions supported by the GPGPU. . . . . . . . . 155
A.11 Floating-point data transfer Instructions supported by the GPGPU. . . . . . . 155
A.12 Control Status Register (CSR) Instructions supported by the GPGPU. . . . . . 156
A.13 Extended instructions to support multi-threading. . . . . . . . . . . . . . . . . 156
10 LIST OF FIGURES
List of Tables

2.1 Some GPU instructions that require more than 3 operands. . . . . . . . . . 18


2.2 Different processing units. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Rounding mode encoding according to RISC-V ISA. . . . . . . . . . . . . . 42
2.4 Comparison between different binary interchange formats. . . . . . . . . . 44
2.5 Exception flags encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6 Special cases for binary single-precision floating-point numbers. . . . . . . 45
2.7 Rounding Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.8 Rounding Decisions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.9 Instructions supported by FPU. . . . . . . . . . . . . . . . . . . . . . . . . 50
2.10 Rounding of different floating point numbers to the nearest integer . . . . . 59
2.11 Control and Status Registers Map . . . . . . . . . . . . . . . . . . . . . . 65
2.12 Definition of different states present in the FSM of snooping protocol. . . . 88
2.13 Requests sent by different threads to the cache. . . . . . . . . . . . . . . . 92
2.14 Two requests sent to the cache with the same addresses. . . . . . . . . . . 92
2.15 Parameters description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.16 Requests sent to the cache bank. . . . . . . . . . . . . . . . . . . . . . . . 105
2.17 Responses from the main memory. . . . . . . . . . . . . . . . . . . . . . . . 105

3.1 Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.1 FPGA resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


5.2 Hardware synthesis results for different numbers of cores. . . . . . . . . . . . . 127

11
Acknowledgements

First and foremost, I would like to express my heartfelt appreciation to my family,


who have supported me throughout this journey. Their love, care, and encouragement
have motivated me to pursue my dreams, and I am truly grateful to have them in my life.

Secondly, I would like to express my deep appreciation to my supervisor, Dr. Moheb


Mekhail, for his consistent and unwavering support during my bachelor’s thesis. I am
sincerely grateful for his exceptional expertise, invaluable guidance, continuous encour-
agement, and significant contributions, all of which have had a profound impact on the
progress of this thesis. The constructive feedback and valuable suggestions he has pro-
vided have played a crucial role in elevating the overall quality of my work.

Thirdly, I would like to acknowledge all my lecturers, teaching assistants, and col-
leagues who have contributed to my knowledge in the field of engineering throughout the
years. I am indebted to the management of GIU for creating an environment that fosters
learning, inspiration, and personal growth, thereby enriching my academic experience.

To those who are reading this, I express my gratitude for your time. I hope that you
find the information you seek and gain valuable insights. Your interest and attention to
my research topic are greatly appreciated, and I welcome any comments or questions you
may have.

1
2 LIST OF TABLES
Abstract

As we enter the post-Moore’s Law era, the utilization of graphics processing units
(GPUs) extends beyond their traditional role in graphics-intensive applications like games.
Nowadays, GPUs are widely employed in sophisticated domains such as big data servers,
machine learning, and medical applications. The rationale behind using GPUs in these
contexts is twofold. Firstly, as we approach the limits of Moore’s Law, where the fre-
quency cannot be easily doubled every 18 months, we encounter performance limitations
over time. Secondly, in the post-Moore era, utilizing multi-processor systems presents
challenges due to the complexity of interconnects between processors and the significant
area they occupy, which can reach up to 40 percent of the total silicon die area. Conse-
quently, GPUs have emerged as a solution for data-intensive applications that demand
high throughput. Despite the significance and popularity of GPUs, open-source GPU
options are scarce in the market. The market is predominantly dominated by Nvidia,
which is not suitable for low-power embedded devices requiring computational power for
image processing tasks such as convolution, pattern recognition, and face detection.

This project aims to develop a general-purpose GPU based on the RISC-V architec-
ture. The selection of the RISC-V architecture is motivated by several factors. Firstly,
its simple design makes it an excellent fit for this project, as the GPU will be designed
for small devices where area and power considerations are critical. Secondly, the RISC-V
instruction set enjoys widespread popularity worldwide and is supported by numerous
open-source tools. Lastly, designing the GPU based on RISC-V ensures independence
from vendor licensing, providing greater flexibility and adaptability.

3
4 LIST OF TABLES
Chapter 1

Introduction and Literature Review

1.1 General Introduction and Overview of the Topic


For many years, the development of the semiconductors industry was pushed by
Moore’s law, which stated that the channel length of the MOSFET transistor would
be halved every 18 months. This scaling has enabled the doubling of the frequency of
processors in the market approximately every 1.5 years by reducing the channel length
of transistors to half its previous value in each subsequent node technology. This trend
continued until the early 21st century.

Figure 1.1: Increase of the static power consumption with decreasing the channel length [1].

As the scaling down of the channel length of the MOSFET transistor continued, the
submicron era was reached. In this region, some assumptions about the properties of
MOSFETs, made them dominate other types of transistors in digital circuits such as
BJTs are no longer valid. One of the most important assumptions that make the MOS-
FETs suitable for low-power applications, is that the MOSFETs have zero gate current.
Engineers and physicists made their assumptions depending on the thick silicon oxide at
the MOSFET gate which acts as an insulator. However, due to the aggressive scaling
of the MOSFET, the thickness of the silicon oxide layers has been reduced dramatically

5
6 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW

until it has reached the diameter of a few silicon atoms. Therefore the assumption, that
MOSFET’s gate acts as an insulator is no longer valid. The thin gate oxide leads to the
phenomenon of Quantum Tunneling. Quantum Tunneling is a phenomenon that occurs at
the nanoscale due to reasons related to the quantum mechanical properties of particles. It
has no observable impact on long-channel transistors, however, this effect is very serious
in today’s short-channel transistors. Thin gate oxide increases the probability, that the
electrons can tunnel through the gate oxide. The tunneling effect decreases the degree
by which the gate terminal can control the channel formation between source and drain.
This introduces a challenge in turning off MOSFET transistors, as the leakage current
prevents complete cutoff. Consequently, the difficulty of achieving a proper and reliable
off-state in MOSFETs is amplified. The static power of the CMOS logic was assumed for
many years to be negligible compared to dynamic power consumption. This is not the
case in sub-micron CMOS technology. In fact, many researchers have demonstrated, that
due to the presence of leakage current below the threshold voltage due to the shrinking
of the transistor’s dimensions, static power dissipation increases to the point where it can
surpass dynamic power consumption. This trend is illustrated in Figure 1.1.

There is always a direct correlation between frequency and power. In order to increase
the frequency of the designed digital circuit, more power is needed to be consumed. In sub-
micron technologies increasing frequency will not lead only to an increase in the dynamic
power consumption it will increase also the consumed static power due to leakage current.
This leads us to reach what is known as the frequency wall, where the performance cannot
be increased by simply increasing the frequency because the power is proportional to the
switching frequency according to Eq.(1)
1
P = ·C ·V2·F
2

Figure 1.2: The frequency and power trends over the years [2].

From Figure 1.2 it has been shown that saturation in the frequency of processors is
reached over the years due to the power wall, moreover increasing the frequency will make
it very difficult to remove the heat generated in the processors due to power consumption.
1.1. GENERAL INTRODUCTION AND OVERVIEW OF THE TOPIC 7

According to Eq.(1) the frequency is proportional to the square of the supplied voltage,
which will make it very hard for the mobile devices that depend on batteries for the
power supply to reach high frequencies. For the previous reasons the trend of increasing
frequency to obtain higher performance is not valid anymore, especially for mobile devices.
Therefore, processor designers are focusing on other techniques to increase performance.
One of these techniques is to divide the work between several processors, which will lead
to an increase in total throughput without increasing the frequency. The trend of using
multi-processor chips has its drawbacks. One of the main drawbacks is the complexity of
scaling the number of processors in one chip due to the limitations of power consumption
and the increasing complexity of designing interconnects between different processors in
one chip and the huge area required by these interconnects. The second major draw-
back addressed by [2] is the high amount of latency needed to perform communication
between different processors, which can reach up to 35-50 clock cycles between the pro-
cessors within the same chip, or up to 100 cycles between off-chip processors. The most
difficult challenge in designing multi-core systems is the design of memory hierarchy. The
complexity in designing memory systems of multi-core chips is the difficulty in maintain-
ing memory consistency when a large number of processors can access the same memory
resources. This allows the different processors to communicate with each other and ex-
change information by writing and reading to the same memory locations. However, there
is a critical problem: each processor has its own local cache memory, which it writes to
first. The value in the cache is not written back to the shared memory until a miss occurs
or the block is evicted due to full space. In the meantime, the processor can update the
variable in its local space, leaving an older version of the value of the variable in the shared
memory. This can cause errors in calculation in case of some processors depend on this
data. A possible solution to this problem is to use cache coherence protocol. The Use of
cache coherence protocol will ensure that all the processors will have the same copy of the
data in their local memories. These processes are done through communication between
different processors with each other and invalidating or updating cached copies of data
when necessary. However, implementing cache coherence protocol can be expensive in
both hardware and development costs especially, when the count of processors in one chip
is high.

Today there are a lot of applications such as graph analytics, machine learning accel-
eration, and game applications that are performance- and memory-intensive. Therefore
there is a great demand for high-performance architectures that can execute these kinds
of applications with the lowest possible power consumption. Therefore, engineers and de-
signers admire architectures that can produce maximum performance without consuming
a great amount of power. The traditional pipelined in-order processors cannot afford high
performance without hitting the frequency wall. On the other hand the multi-processor
systems despite their ability to produce high performance without too much frequency,
they have a lot of issues regarding the complexity of the design and the high power con-
sumption that are associated with the scaling of the number of processors in one chip.
SIMD (Single Instruction Multiple Data) architectures shown in Figure 1.3 can provide
much more performance gain than normal processors without the high complexity of
multi-core systems. In contrast to normal processors that contain only one execution unit
and one load-store unit which causes a lot of performance loss due to stalls caused by the
long latency of memory access, SIMD architectures have multiple lanes, each lane has its
own execute unit, and high pipelined load store unit (LSU) to hide the latency of memory
8 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW

access, In addition to Multiable (GPRs) General Purpose Registers files to handle each
lane individually. SIMD architecture is a good candidate for executing multimedia or
machine learning applications, where the same operations are performed on a large patch
of independent data. To hide the latency resulting from cache misses SIMD processors use
multiple lanes to keep the pipeline busy, this is not the case in normal processors, which
have to waste hundreds or even thousands of cycles waiting for the data to arrive from the
cache. In order To amortize the cost of accessing memory, SIMD processors usually use
heavily pipelined load-store units. This means that the program only pays the latency
of vector load or store operations once as the remaining latency will be hidden in the
pipelined access of the first request. In normal processors, a lot of power is consumed
due to the overhead of accessing instruction cache memory this consumed power can be
reduced in SIMD processors by having one fetch unit for all lanes within the processor.
So if the processor has four lanes each lane will execute the same instruction but on dif-
ferent data, only one instruction will be fetched from the instruction cache, instead of
four cache access in traditional processors, which will reduce the overhead from fetching
and decoding processes.

Figure 1.3: SIMD architecture.

Graphical Processing Units (GPUs) mainly consist of multiple SIMD cores, with inte-
gration of a warp schedular that assigns each warp to a certain SIMD core. GPUs depend
primarily on multi-threading to hide high memory latency. GPUs have been used for
a long time for executing graphics applications, mainly those requiring real-time perfor-
mance and minimal latency. However in recent years with the need for high-performance
architectures, there has been a shift toward using GPUs as a general-purpose platform to
execute real-time and high-performance applications, as they offer high throughput with
lower energy consumption compared to other architectures. The advancement of portable
languages that have a C-based syntax and can run on heterogeneous systems made it easier
1.2. BACKGROUND 9

to integrate GPUs in many systems with host processors without worrying about compat-
ibility between them, in addition, languages such as CUDA and OpenCL make it easier
for the average programmer to write codes that target GPU platform without requiring
in-depth knowledge of the underlying hardware workings. The use of these languages also
facilitates what is known as heterogeneous computing systems. These systems comprise
a combination of different architectures, such as a conventional processor for traditional
applications and a GPU to accelerate multimedia applications. The development of such
languages enables programmers to use a single language to write programs for these sys-
tems, increasing efficiency and reducing time to market. Recently, GPUs have not only
been used in personal computers or data centers but they are also employed in embedded
systems applications. The current advancements in the embedded systems field, where
microcontrollers are utilized to perform high-performance applications including complex
image-processing applications such as face identification and pattern recognition. Many
of these devices operate on low-power supplies, in contrast to personal computers or data
centers that use high-power supplies, which makes it a more challenging task to design a
GPU that can be integrated with embedded devices, especially in the IOT industry, where
are applications that rely on batteryless devices. The aim of this project is to develop
a low-power general-purpose graphical processing unit suitable for embedded systems
applications based on RISCV architecture.

1.2 Background
Despite the fact that GPUs are widely utilized as accelerators in different applications
in personal computers and data centers, there is a very limited number of open-source
GPU architectures in the market, which affects their usage in different potential applica-
tions. The fact that the most market share of the GPU market belongs to Nvidia, makes
it very hard to integrate other GPUs in existing projects because most of the tools and
languages like CUDA are designed specifically for Nvidia GPUs, on the other side, Nvidia
is known for manufacturing large GPUs that consume relatively high power and occupy
a significant amount of die area. These GPUs are not suitable for low-power embed-
ded applications that need an accelerator to execute computationally intensive tasks like
matrix multiplication, Gaussian elimination, and image processing. Another important
issue is the potential licensing problems due to political conflicts between countries that
could impact platform production. Due to the previous reasons, there is a demand for
open-source GPUS that is not vendor-dependent and can be easily integrated with other
systems to be used as accelerators. However, designing an open-source GPU with the
previous specifications is not a trivial task for several reasons:

ˆ To develop programs effectively on open-source GPUs, a set of tools including a com-


piler, assembler, and debugger compatible with the new architecture is necessary.
However, there are already widely recognized tools in the market that support other
architectures like x86 or ARM. These tools have been in the market for decades,
and many software stacks rely on them. Consequently, developing a new architec-
ture that does not rely on these established tools is extremely difficult. Vendors
cannot simply replace their mature software stack with a different one. Moreover,
developing new software tools that support the new architecture would take several
years, and the resulting tools may not be as efficient as those already existing in the
market.
10 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW

ˆ The huge time and very high initial cost of the ASIC (Application Specific Inte-
grated Circuit) design flow, which is not practical for an open-source design, many
designers for this reason tend to simulate their architecture using intermediate lan-
guage simulations. However, these simulations are easy and fast, but they do not
model some important aspects of the underlying hardware, such as clock crossing
domain problems and cache system latency.

To solve the critical problem of the tool support the GPU implemented in the current
project is built mainly on top of the RISC-V (Reduced Instruction Set Computer) pro-
cessor [3]. The RISC-V processor is an open-source processor that was developed at the
University of California, Berkeley around 1980. RISC-V is widely known for its simple
but efficient architecture, which has significantly lower power consumption than other
architectures such as x86. The most important reason for utilizing RISC-V in this project
is the availability of numerous tools in the market that support the RISC-V architecture,
such as the riscv-gnu-toolchain, in addition to that, the ISA of the RISC-V processor
supports a wide range of instructions such as I-type, R-type, J-type, and S-type, besides
the availability of many opcodes that can be used to create user’s defined instructions
which can be used in the process of extending RISC-V to general purpose GPU.

Figure 1.4: The different instructions format of RISCV [3]


1.2. BACKGROUND 11

1.2.1 FPGA Internal Components

FPGAs (Field Programmable Gate Arrays) are silicon chips that contain an array of
configurable logic blocks (CLBs) as shown in Figure 1.5. They are popular among digital
designers because their internal CLB blocks can be arranged to construct different types
of circuits.

Figure 1.5: FPGA components [4]

1.2.1.1 Configurable Logic Blocks (CLBs)

CLB is the basic block that builds the FPGA, using it the user can create nearly any
logical function on the hardware. CLB also implements memory functions and synchro-
nizes code on the FPGA when connected by routing resources. The CLB consists of the
following components Lookup tables(LUT), Flip-Flops(FF), and multiplexers (MUXs) as
shown in Figure 1.6.

Figure 1.6: FPGA CLB.


12 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW

1.3 Aim
The purpose of this project is to design a full-stack general-purpose graphics processing
unit (GPGPU) that can handle both hardware and software tasks. The microarchitecture
of the GPU is based on the RISC-V processor but with some changes to its instruction
set architecture (ISA) to support multi-threading. The GPU also has a multi-banked
non-blocking cache that provides the high bandwidth needed to run multiple threads on
the GPU. The software platform for this project is OpenCL, which is one of the most
popular programming languages for developing applications on different GPU platforms.
To provide a runtime environment for OpenCL applications, a compiler called Portable
Computing Language (POCL) is used to convert OpenCL kernels to executable binaries
that can run on the RISC-V processor [5]. Then, this binary is transferred to the FPGA
via shared memory that can be accessed by both the host processor and the FPGA as
shown in Figure 1.7.

Figure 1.7: The process of compiling OpenCL kernels, then dump the resulted binaries to the
target FPGA

1.4 Literature Review


A flexible and configurable architecture for a soft general-purpose GPU is presented,
This GPU can support the execution of both fixed-point and floating-point operations.
This soft GPU has an advantage over the other GPUs made directly on the silicon, which
is the ability to adapt to various applications and requirements. The current architec-
ture overcomes the MicroBlaze processor provided by Xilinx in terms of performance and
energy efficiency. On average, the soft processor can perform IEEE-754 floating-point
computations 2.9x faster than the MicroBlaze processor and with 11.2x lower energy con-
sumption. It also shows an average speedup of 4 times over ARM Cortex A-9 in executing
floating point operations [6].

OpenCL is a renowned portable language that can be run on heterogeneous systems.


In order to be able to execute the OpenCL kernel on the FPGA two major changes to
1.4. LITERATURE REVIEW 13

the software stack have to be made:

1. Design and add local memory space to the memory hierarchy.

2. Provide a mechanism for synchronizing different threads on the GPU.

These two modifications increase resource usage without affecting the critical timing path
too much [7].

In order to train the machine learning models properly and take advantage of the dense
computing capabilities of the FPGA, the workload will be divided among two parts:

1. The GPU side will be used for training purposes.

2. The FPGA side is used for model inference.

In order to provide a smooth transition between the two sides a model converter is de-
ployed for this task. The Tensorflow model is used to train the models on the GPU and
make the best use of GPU resources, and then the CUDA framework is used to convert
TensorFlow code to executables that can run on the FPGA side. In other words, the
model converter acts as a mapping function to facilitate the translation of the training
framework into the inference framework[8].

In today’s modern era there is an increasing demand for high-bandwidth networks,


which are beyond the handling capabilities of the host CPUs in the servers of the internet
service provider(ISP), this results in a performance gap between network bandwidth and
computing power. A heterogeneous system composed of FPGA with a host processor to
act as an accelerator to handle network packets, from these systems there is a system
called FpgaNIC, which is an FPGA-based GPU-centric SmartNIC. FpgaNIC provides
virtual ports to perform direct PCIe P2P communication with local GPUs, in addition to
providing a reliable 100 Gb hardware network in order to communicate easily with remote
GPUs. FpgaNIC design has a low footprint as it utilizes only 20 % of the target FPGA
resources[9].

One of the earliest synthesizable open-source GPU designs in the market is the Nyami
GPU. The Nyami GPU is compatible with different rendering applications, through a
graphics pipeline that contains programmable vertex shaders. The Nyami GPU is also
equipped with a flexible cache hierarchy that allows the adjustment of different cache
properties such as cache associativity and cache levels. Furthermore, Nyami GPU inte-
grates an on-chip debugger to enable real-time performance monitoring[10].

CUDA is the most famous framework for programming GPU platforms, it is used by
the most famous deep learning frameworks like Tensorflow and PyTorch as a backend
framework to train neural network models. Despite these facts, CUDA is compatible
only with Nvidia GPU architectures. This problem makes it very hard to program het-
erogeneous systems that include GPUs from other vendors. Engineers and researchers
overcome this issue by using source-to-source translators to convert CUDA code into a
portable language representation that can be executed on different architectures. A major
limitation of this approach is that it needs a lot of manual modifications, causing appli-
cations targeting non-Nvidia GPUs to have a long time to market. A framework called
14 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW

CuPBoP enables the execution of CUDA kernels on non-NVIDIA devices without the
need for manual code modifications, with a coverage rate of 69.6% on the Rodinia bench-
mark, which is 56.6% performance gain over the average frameworks in the market[11].

One of the frameworks that are designed to enable the execution of CUDA kernels on
CPUs is the COX (CUDA on X86) framework. The mapping of the CUDA kernels occurs
by utilizing compiler IR transformation and a runtime system that supports warp-level
programming functionalities like warp shuffle, warp vote, and cooperative group. COX
framework uses a technique called hierarchical collapsing, that enables the execution of
the warp level programming model. The main working principle of hierarchical collapsing
is to generate two types of loops: inter-warp loops and intra-warp loops. This approach
makes it easier to execute the warp-level programming model. COX framework uses bar-
riers to identify parallel regions to provide synchronization and prevent data hazards[12].

One of the scalable and energy-efficient vector processors that are built over RISC-
V architecture is the Ara processor. This processor is designed to work with Ariane, a
scalar core. Ara utilizes RISC-V to support mixed-precision floating-point operations and
extends it further to support the SIMT (Single Instruction Multiple Thread) execution
paradigm. The microarchitecture of Ara consists of a main sequencer, a slide unit, a
vector load/store unit, and a variable number of identical lanes. Each lane is provided
with a local register file, operand queues, and execution units[13][14][15].

The integration of the FPGAs in the data center with the GPUs is a challenging task,
due to the need for manual assignment of tasks between FPGAs and CPUs. In addition,
the synchronization between multiple tasks executed on both sides needs a higher-level
supervisor system. CFUSE (Cluster Front-end USEr framework) is one of the frameworks
that try to solve this problem by employing different techniques to abstract the resources
within a heterogeneous system:
1. Using High-Level Synthesis (HLS) tools to translate high-level programming lan-
guages, such as OpenCL, into hardware description languages. This approach makes
the programming of FPGAs more easier for the average programmer.

2. Using Dynamic task allocation and migration strategies to divide the workload
between FPGAs and host CPUs.

3. Using efficient data queue implementations to optimize the communication process


between host CPU and FPGA[16].

Heterogeneous systems are composed of different architectures such as CPUs, GPUs,


FPGAs, and fixed-point hardware. One of the languages that can be used to program
these types of architecture is OpenCL, which provides low-level APIs for programming
these diverse architectures. However, the problem is that each vendor has its specific
implementation of OpenCL, which slows down the process of developing projects that
use Heterogeneous systems as a target. One of the frameworks that helps in solving this
issue is the Portable Computing Language (POCL), which is an open-source portable
OpenCL implementation. POCL enables the development of highly efficient executables
that can be run collaboratively across various devices from different vendors, such as
CPUs, GPUs, FPGAs, and specialized accelerators. By utilizing the POCL compiler, an
1.4. LITERATURE REVIEW 15

OpenCL program can be compiled on the common host processor, then transferring the
resulting binaries to the target architectures using Direct Memory Access (DMA) engines,
which will ensure that the data is transferred in a non-blocking manner[17][18][19].

The process of developing applications that use FPGA as a source target is not trivial,
this process requires an in-depth knowledge of the underlying hardware of the FPGA. To
facilitate the developing processes of these applications, an abstraction level of FPGA
resources and operating system (OS) details is needed. One of the open-source libraries
that provides this abstraction is The OPAE (Open Programmable Acceleration Engine).
OPAE is a lightweight open-source C-library that offers a flexible computing environment.
The intel FPGA software stack is used to build the OPAE library. OPAE abstracts the
different details of FPGA hardware and operating system specifications, so the program-
mer can develop different software applications without concerning himself with unwanted
details. The library enables software programs to access FPGA device features, such as
pre-configured acceleration logic and functions for managing and reconfiguring the device.
By using OPAE, user applications can smoothly use FPGA-based acceleration transpar-
ently, as shown in Figure 1.8. Intel also provides the users with a simulation environment
called an Accelerator Functional Unit (AFU), which enables the user to access Intel’s
hardware resources, giving them the ability to perform cycle-level simulations [20].

Figure 1.8: Abstracting and managing FPGAs [21].


16 CHAPTER 1. INTRODUCTION AND LITERATURE REVIEW
Chapter 2

GPU Architecture

Figure 2.1 shows the an abstract structure of the architecture of the GPGPU, it
consists mainly of multiple cores with shared L2-cache, each core consists of five stage
pipeline with the following stages fetch, decode, issue, execute, and write-back stage.
The GPGPU also comes with an extended Convolution Unit, designed to accelarte
the convolution operation. The upcoming sections will describe in the details of each
subsystem in the core.

Figure 2.1: High-level overview of the GPPGPU architecture.

17
18 CHAPTER 2. GPU ARCHITECTURE

2.1 GPU Extension Unit


2.1.1 Extending RISC-V ISA
The GPGPU implemented in this project primarily relies on the RISC-V ISA. How-
ever, extending the RISC-V architecture to support the GPU’s single instruction multiple
data (SIMD) execution paradigm is non-trivial. RISC-V ISA comes with a very rich in-
struction encoding R-type format that supports several integer operations, however, as
shown in Figure 2.2, R-type can support only 3 operands 1 destination register and 2
source registers. This poses a challenge when designing the ISA for the GPU based on
RISC-V because many instructions require more than three registers. For example, the
fused-add-multiply instruction F=a*b+c needs four operands: three source registers and
one destination register. Therefore, the R-type instruction format alone cannot execute
these instructions effectively. A potential solution to this issue is to utilize the R4-type in-
struction format, as illustrated in Figure 2.3. This format introduces an additional source
register, rs3, which allows for greater flexibility in handling instructions with multiple
operands. However, it is important to note that adopting the R4-type format comes at
the cost of reducing the total number of instructions that can be encoded. Specifically, the
number of encodable instructions decreases from 1024 instructions to only 32 instructions.

Instruction Operand
FMAdd a,b,c
IMAdd a,b,c, shift
MatMul a,b,c
Interp dx,dy,a,b,c
Tex u,v,lod

Table 2.1: Some GPU instructions that require more than 3 operands.

Figure 2.2: R-type instruction encoding.

Figure 2.3: R4-type instruction encoding.

However, relying solely on the R4 instruction format to support GPU operations is


insufficient. As indicated in Table 2.1, there are instructions that require more than
four source operands, such as Immediate-multiply-add and Interpolation instructions.
To accommodate these complex instructions with more than three source operands, one
potential approach is to leverage the available RISC-V custom opcodes to encode these
functions. To address the need for supporting a larger number of source registers, the new
instruction encoding uses one of the four user-defined opcodes: 0x0B, 0x2B, 0x5B, and
0x7B, alongside the R4 format. This choice ensures compatibility with the maximum
number of available source registers. However, this approach presents a challenge in terms
of passing the remaining source registers to the GPU. This problem can be solved using
2.1. GPU EXTENSION UNIT 19

one of the following approaches :

ˆ Input merging: The idea of the input merging approach as shown in Figure 2.4 is
to combine two or more inputs in one source register, which will reduce the overall
number of registers needed to encode all the inputs. This approach can be done by
the use of a helper function that acts as a wrapper around the original instruction
and merges the input arguments of it. This approach is useful for instructions,
whose operands are guaranteed to be in a certain small range, therefore there is no
loss in data when the arguments are merged together.

ˆ Function bits overloading: In some cases the input arguments are too large to be
merged, Another alternative approach for input merging is to overload the function
bits to act as a storage for additional source registers beside their original function
as shown in Figure 2.5, by using this approach, there is a possibility to encode
two additional registers via R-type format and one additional register via R4-type
format. However, there are two important drawbacks to extending the number of
operands. Firstly, it reduces the total number of instructions that can be allocated
for the given custom operand. This limitation arises from the need to allocate bits
for storing the additional registers, which reduces the available space for encod-
ing instructions. Secondly, increasing the number of source operands introduces
complexity to the backend of the pipeline. This is because reading an additional
register requires modifications to the existing register file and adds complexity to
the backend processing stages.

ˆ Control Status Registers: The last option for extending the number of source
operands is to split the execution of the instruction into two parts and pass the
additional operands via CSRs (Control and Status Registers) before invoking the
instruction. However, this solution is considered the least efficient for two main
reasons. Firstly, executing CSR instructions results in pipeline stalls, which can
significantly impact performance. The pipeline needs to pause and wait for the CSR
instructions to complete before proceeding with the next instructions. Secondly,
this approach adds an additional instruction to the pipeline for each invocation of
the extension. This increases the instruction count and can negatively affect the
instructions per cycle (IPC) metric, which measures the efficiency of instruction
execution.

Figure 2.4: Source registers extension via Inputs Merging.


20 CHAPTER 2. GPU ARCHITECTURE

Figure 2.5: Overloading of function bits to extend the number of source operands.

Figure 2.6: example of an intrinsic function.

2.1.2 GPU Hardware Extension


In the previous section the necessary software extensions to extend the RISC-V pro-
cessor into a GPU were discussed. However, these extensions are not enough to fully
support the muti-threading execution paradigm. In order to exploit the benefits of the
SIMT (Single Instruction, Multiple Threads) execution model, new instructions needed
to be added to RISC-V ISA besides providing hardware extensions to support these in-
structions. The SIMT execution model exploits the data level parallelism in a program, as
in many machine learning and scientific computing programs the same code is repeated
many times with the difference in the data that the code works on for example doing
convolution on an image, involving performing multiply and accumulate (MAC) opera-
tions on different portions of the image. Therefore SIMT model provides a computing
structure known as Warp, which is known in other programming languages terminology
as Wavefront. Warp or Wavefront consists of a group of independent instructions that
have the same program counter (PC) and follow the same execution path. Each thread
within the warp has its own resources of general purpose registers (GPRs). The width of
the execution unit has been modified so it has a number of operating lanes equal to the
number of threads within a warp. The main advantage that SIMT architecture provides,
is that operations like decoding, fetching, and issuing the instructions are done once per
warp, which reduces the overhead of doing these operations to each thread, which may
cause performance and power loss.
2.1. GPU EXTENSION UNIT 21

2.1.2.1 Warp Scheduler


One of the necessary hardware extensions to support multi-threading is the warp
scheduler. The warp schedule is present in the Fetch stage as shown in Figure 2.7 and it
is responsible for selecting the next warp to be scheduled and allocating required hardware
resources for it. The warp selection process is done using two main structures :

ˆ A set of warp masks that contain the status of each warp from which the warp
scheduler can determine the next ready warp

ˆ A wavefront table that contains private data about each warp such as thread mask
or the next PC to be fetched from the Instruction Cache.

Thread masks are vector bits that contain important information about the state
of each warp, upon this information warp scheduler can decide the ready warp to be
scheduled next as shown in Figure 2.8. The 3 main masks are as follows:

ˆ Active mask: It contains status bits, that indicate whether a certain warp is active
or not.

ˆ Stalled mask: It contains bits, that show the warps that cannot be scheduled right
now because they are waiting for certain events such as memory requests,

ˆ Barrier mask: It contains bits, that indicate whether a certain warp is waiting for
a barrier to be released or not.

Figure 2.7: Fetch stage architecture.


22 CHAPTER 2. GPU ARCHITECTURE

Figure 2.8: The process of selecting the next warp depending on different masks.

2.1.2.2 Threads Masks and IPDOM Stack


There is an assumption in the SIMT execution model that threads within the warps
will take the same path and execute the same instructions but with different data, however,
this assumption is not correct in the case of control flow instructions. In Figure 2.9 half the
threads will perform addition instruction, while the other half will perform subtraction and
multiplication instructions. This problem is known as control divergence, which occurs
when the threads within the same warp take different paths of execution. To handle this
problem two structures are used thread mask register and IPDOM stack. A thread mask
is a hardware register that contains a stream of bits that indicates whether a certain
thread within a warp is active or not. If a bit in this thread vector is zero, the thread
corresponding to this bit is deactivated, no further alterations are made to the register
file associated with this thread, and no modifications will be made to the cache memory
based on the actions of that thread. In order to handle the control divergence problem,
the IPDOM stack shown in Figure 2.7 is used. The IPDOM stack is controlled via Split
and Join instructions. Split instruction has the form of split Rs1, where Rs1 indicates
whether a certain branch was taken the last time it was executed or not. When a branch
instruction is executed by a warp, the gpu unit runs a split instruction that performs the
following steps:
ˆ Determining whether the branch is taken or not for each thread as shown in Figure
2.10.

ˆ In cases where all the threads agreed on one path the split instruction does not
affect the state of the current warp, in other words, it acts as nop instruction.

ˆ When both taken and not taken masks are not equal to zero the following sequence
occurs to handle the branch divergence:

– Pushing the current thread mask alongside the original PC to the IPDOM
stack.
– Pushing the alternate thread mask if the thread takes the other path alongside
the split PC to the IPDOM stack.

ˆ When a Join instruction is executed the entries are popped from the IPDOM stack
with two scenarios :

– The predicate value for this entry is not taken so the thread mask is the same
as the original one and the thread continues executing from the original PC.
2.1. GPU EXTENSION UNIT 23

– The predicate value for this entry is taken so the thread mask is equal to the
alternate thread mask and the the thread continues executing from the split
PC.

Figure 2.9: Divergent Code.

Figure 2.10: Taken and Not Taken Masks.

Figure 2.11: Predicate Mask.


24 CHAPTER 2. GPU ARCHITECTURE

Figure 2.12: Thread Mask.

2.1.2.3 Warp Barriers


In the SIMT execution model the warps are running most of the time in parallel with
each other and do not depend on each other. However, to ensure the correct execution
and coordination between parallel tasks, a sort of synchronization is needed. Here are
some reasons why synchronization is needed:

ˆ In GPU programs sometimes data dependencies between different warps may exist,
whether they operate on shared data resources or their results depend on each other.

ˆ GPUs always use memory hierarchy with multiple cache levels, therefore synchro-
nization is needed to ensure memory consistency, which means that the results of
operations from different warps are committed in the correct order, preventing in-
consistent or unexpected results.

ˆ As discussed before threads in the same warp may diverge into different paths.
synchronization points are required, at which the diverged threads will be waiting
for each other to be collected and continue executing together.

ˆ Synchronization is useful for managing the shared resources and avoiding warp con-
flicts.

In this project the means of Synchronization are provided by using what is known
as Warp Barriers. A warp barrier is a synchronization point that is put in a certain
location in a program and when any warp passes through this point it gets halted until the
required number of warps reach this point to be synchronized. There is a data structure
with the warp scheduler, where each entry in this table contains private information about
each barrier such as :

ˆ Validity bit: This bit indicates whether a certain barrier is valid or not

ˆ Barrier ID: Each barrier has its unique ID to identify it and the MSB of the ID
indicates whether the scope of the barrier is local or global.

ˆ Number of left warps: The number of the remaining warps that need to go
through the barrier, so the barrier is released.

ˆ Stalled Warps Mask: This vector mask indicates which warps are currently wait-
ing for the current barrier to be released.
2.1. GPU EXTENSION UNIT 25

When a barrier instruction is scheduled the underlying microarchitecture checks the


count of warps that need to pass through the same barrier ID, so it can be released. In
case the number of warps is not equal to one, all the current executing warps will be
stalled until the required number of warps is reached. The release mask keeps track of all
halted warps and when a sufficient number of warps is reached all paused warps will be
released and continue executing.

2.1.2.4 New Instructions added to RISC-V ISA


After discussing the required hardware extensions to enable multi-threading, here is
the summary of the new instructions that these hardware extensions support:

ˆ Wspawn Instruction: This instruction is used mainly for warp control, it activates
a certain number of warps at a given PC making multiple instances of a certain
program execute independently. This is useful in distributing the workload of a
certain program among multiple warps which increases the instructions per cycle
(IPC) count. It takes the form of:

wspawn Rs1 Rs2

Where Rs1 contains the number of warps to spawn. while Rs2 contains the Program
Counter to spawn the warps at.

ˆ TMC Instruction: This instruction primarily serves as a means of thread con-


trol. Its purpose is to activate or deactivate threads within the warp by utilizing the
thread mask register, which can also be accessed through the control status registers
(CSRs). By employing this instruction, we have the ability to disable threads that
are not actively engaged in productive tasks. This capability proves particularly ad-
vantageous for low-power embedded devices, as it conserves power by only utilizing
threads that are necessary for carrying out meaningful work. It takes the form of

tmc Rs1

where Rs1 contains the new thread mask that is to be applied. The GPU unit
simply sets the Thread Mask to be the data in RS1

ˆ Split and Join Instructions: These instructions are used to handle the control
divergence problem. SPLIT controls which threads to activate when a branch occurs
using a “taken” mask and a “not taken” mask. It also generates a “diverged” signal
that indicates whether the control flow diverged, and then it pushes the thread
mask and branch prediction results to a special hardware structure called hardware-
immediate post dominator (IPDOM) stack. It takes the form of

split Rs1

where Rs1 indicates whether the previous branch was taken or not. On the other
hand, Join instruction is used to pop the previous information from the IPDOM
stack during re-convergence.

ˆ Barrier Instruction: This instruction is used to ensure synchronization between


different warps. This instruction inserts a barrier in a certain program location
26 CHAPTER 2. GPU ARCHITECTURE

where the synchronization between warps is required, the condition to release this
barrier is reaching a certain number of warps to this barrier location. It takes the
form of
bar Rs1 Rs2
where Rs1 contains the ID of the barrier, while Rs2 contains the number of warps
needed to reach the barrier, so it can be released.

2.1.3 RTL implementation


Figure 2.13 shows high-level overview of the GPU extension unit. The input port is
designed similarly to the AXI protocol interface so it provides a handshake mechanism
that guarantees smooth and proper flow of data. GPU extension unit contains smaller
units each unit handles specific hardware problems and the enable signal for each unit
depends on the the type of instruction to be handled.

Figure 2.13: Architecture of GPU extension unit.

Figure 2.14: Output signals when the input instruction is TMC.


2.1. GPU EXTENSION UNIT 27

Figure 2.15: Output signals when the input instruction is Wspawn.

Figure 2.16: Output signals when the input instruction is Split.


28 CHAPTER 2. GPU ARCHITECTURE

2.2 Decode Stage


The decode stage comes after the fetch stage in the pipeline, the main role of this
stage is to take the instructions from the instruction cache, which is fetched using the
PC generated from the fetch stage, as input and extract certain parameters from this
instruction. The decode stage takes 32-bit instruction and determines whether it is a
valid instruction or not if it is a valid instruction, then it raises a valid flag and extracts
different operands from the instruction as follows:

ˆ Valid: this signal determines whether the instruction is valid to be executed or not.

ˆ Program Counter (PC): program counter of the current instruction.

ˆ Next Program Counter (PC): the next program counter to be fetched after
executing the current instruction.

ˆ EX Unit: the type of processing unit used to execute the current instruction as
shown in table 2.2.

ˆ Rs1: the first source register operand.

ˆ Rs2: the second source register operand.

ˆ Rs3: the third source register operand in case of decoding instructions that use R-4
instruction format such as floating-point multiply-and-add instruction.

ˆ Rd: the destination register in the register file is used to store and finalize the result
data.

ˆ Use Imm: use immediate as the second source operand.

ˆ Use PC: use PC as a first source operand.

Figure 2.17: Fetch-decode handshake.


2.3. ISSUE STAGE 29

Value Execution Unit


0 Nop
1 ALU
2 LSU
3 CSR
4 FPU
5 GPU
6 Conv Unit

Table 2.2: Different processing units.

2.3 Issue Stage


After decoding the instructions and extracting data operands, the data signals do not
directly proceed to the execution unit. Instead, there is an intermediate stage known as
the ”Issue stage” between the decode and execute stages. In the pipeline, the instructions
are not solely dependent on each other. However, in many cases, two or more consecutive
instructions rely on each other. To prevent significant pipeline delays, a special unit
is required to manage data dependencies and enable out-of-order instruction execution.
The issue stage serves the purpose of reordering instructions, scheduling the first ready
instruction, and forwarding the operands to their respective executing units. The following
section is going to explain different data hazards, that may affect the pipeline execution.

2.3.1 Data Hazards


Data hazards often occur when multiple instructions that have dependencies on each
other try to modify data operands at different pipeline stages. The inability to detect these
hazards could lead to potential race conditions. Figure 2.18 shows the three potential data
hazards:

ˆ Read after write (RAW): The read-after-write data hazard, also referred to as
a true data dependency, happens when an instruction tries to read a source register
while there is an ongoing write operation to the same destination register. This will
cause the instruction that tries to read this register to obtain the wrong data.

ˆ Write after read (WAR): The write-after-read data hazard, also referred to as
Anti dependency, happens when an instruction tries to read a register, and then
an instruction after it in the pipeline is trying to write to this register, this kind of
dependency can cause a problem in concurrent systems like GPUs, when the two
instructions are executed at the same time.

ˆ Write after Write (WAW): The write-after-read data hazard, also referred to as
Output dependency, happens when an instruction tries to write to a register, and
then an instruction after it in the pipeline is trying to write to this register.
30 CHAPTER 2. GPU ARCHITECTURE

Figure 2.18: Pipeline data hazard.

2.3.2 Dynamic Scheduling


A main limitation in a simple static pipeline, is that it fetches instructions and exe-
cutes them in order, which means that if a single instruction stalls the other instructions
behind it cannot proceed in the pipeline. In our case, if there are dependencies between
instructions in the pipeline such as data hazards discussed in the previous sections, they
will cause a stall in the pipeline, this will affect a system like GPU critically as the stall
will cause many processing units to be stalled, thus reduction of the system efficiency.
Suppose instruction y is dependent on currently running instruction x in the pipeline. In
that case, all instructions behind y will have to stall even if they are not dependent on x,
for example, consider the code below:

mul r4 ,r2 ,r3

add r6 ,r4 ,r5


sub r9 ,r7 ,r8
Although sub instruction has no dependence on the previous instructions, it cannot exe-
cute because the pipeline will have to stall due to the data dependency between the add
and mul instructions. This limitation can be modified by extending the issue stage to
support dynamic scheduling. Dynamic scheduling is a technique of execution in which
the instructions are not executed based on their order of appearance, they are instead
executed based on the availability of the data operands taking into consideration the
availability of resources. So instead of waiting for dependencies to be resolved the pro-
cessing core can schedule another instruction with ready operands. Figure 2.19 shows
the difference between in-order and out-of-order dispatch, the ADD instruction has a
dependency on the IMUL instruction, in case of out-of-order dispatch the first ready
instruction is scheduled and the pipeline does not need to stall until the data operands
are available, this saves a lot of cycles and increases the overall efficiency of the GPU.
2.3. ISSUE STAGE 31

Figure 2.19: In-order vs. Out-of-order Dispatch.

2.3.3 Scoreboard Unit


To make the processing cores able to schedule instructions in an out-of-order fashion a
special unit is needed to be integrated with the issue stage. In the scoreboard unit, every
instruction is tracked using a Universally Unique IDentifier(UUID), instructions
are released only when the scoreboard unit completely checks, that there is no conflict
between this instruction and any other instructions in the pipeline, if this instruction has
any kind of dependence another valid instruction will be scheduled and the operands of
this instructions will be strictly tracked until they are all ready, then the instruction will
be finally released. To maintain the flow of execution of instructions, the scoreboard unit
uses three important tables:
ˆ Instruction Stage Table: This table keeps track of the position of the instruction
in the pipeline.

ˆ Instruction Meta Data: This table stores metadata associated with each instruc-
tion such as :

– operation: The type of operation that this instruction will perform on source
operands (ADD, SUB, MUL,.....).
– Functional Unit: The type of unit that will execute the instruction (ALU,
FPU, LSU,......).
– Program Counter(PC): The PC of the issued instruction.
– Thread Mask: The status of the active threads.

ˆ Register Status: The status of each register, whether there is an instruction


currently using it or not.
The architecture of the scoreboard unit, as depicted in Figure 2.20, utilizes multidi-
mensional arrays to track the status of different registers for each thread. The scoreboard
32 CHAPTER 2. GPU ARCHITECTURE

unit is equipped with reservation buffers to store instructions awaiting their operands.
When an instruction conflicts with another in the pipeline, the scoreboard unit pushes
the instruction to the reservation station along with its metadata. To release instructions
from the reservation buffer, all of their operands must not be in use by other instruc-
tions. If this condition is met and there are sufficient resources, the ready signal’s tag is
broadcasted to extract the associated instruction’s metadata from the reservation buffer.
These operands are then forwarded to the dispatch unit to be directed to the appropriate
functional element within the execution stage of the pipeline.

Figure 2.20: Scoreboard unit.


2.3. ISSUE STAGE 33

2.3.4 Register File


Register file is a very important structure in the issue unit as it stores the data values
of instruction operands, designing a register file for SIMD architecture is not a trivial task
because the warp scheduler fetches instructions from different warps, and each warp has
its own set of parallel threads, this means that the register file has to support each thread
in within a warp with a certain number of register, so the thread can execute properly.
RISC-V ISA uses a total of 64 registers(32 for integer operations and 32 for floating-point
operations) each register is 32-bit wide, which means that if we have a SIMD core that
supports the execution of 4 warps and 4 threads, and it has to issue data for each source
operand in parallel which are in our case three source operands (Rs1, Rs2, Rs3), the
equation of calculating the total number of bits consumed by register file is as follows:

3 · (Warps NUM × Registers NUM × Threads NUM × 32)

which will result in total required bits of 96k, this huge amount of storage needed will
consume a lot of LUTs and FFs present in the FPGA, which cannot be afforded as the
LUTS and FFs are needed in implementing other processing element of the GPU. An
alternative approach to implementing a register file is to use on-chip block rams(BRAMs)
instead of using LUTs, a single BRAM36E1 block can hold up to 1024 registers. Figure
2.21 shows how the BRAMs can be utilized to build a register file, the register file is divided
into several banks each bank supports a certain warp and contains several BRAMS to
provide each thread with the required data values in parallel.

Figure 2.21: Implementing of Register File using BRAMs.


34 CHAPTER 2. GPU ARCHITECTURE

2.4 Execute Stage


2.4.1 Arithmetic Logic Unit (ALU)
In the world of digital design and microprocessors, the ALU stands for Arithmetic
Logic Unit is a fundamental unit in every Central Processing Unit(CPU) due to its im-
portant role in executing arithmetic and logic operations on binary data, which is nearly
needed by every program. Typically, ALUs encompass a set of fundamental operations
including addition, subtraction, bitwise logical operations (such as AND, OR, XOR), and
comparison operations (such as greater than, less than, and equal to), and some advanced
ALUs may support multiplication and division operations. Modern GPUs that execute
applications with heavy workloads such as gaming, scientific simulations, and data analy-
sis, need a huge number of advanced ALUs with multiple lanes to be able to perform the
fast calculations required by these programs.

2.4.1.1 ALU Operation


Figure 2.22 shows the main components of the ALU unit as blocks and the data flow
between them is represented by arrows:

ˆ Request Interface: This module is mainly an elastic buffer that receives ALU
input signals such as clk, reset, PC, and data operands signals, then forwards these
signals to the control unit to take certain actions.

ˆ Control Unit: This module has the crucial role of interpreting the input signals
and determining the operations to be executed on the operand data. The flowchart
depicted in Figure 2.23 illustrates the process of decoding these signals. The control
unit plays a key role in making decisions based on the decoded signals. First, the
control unit determines whether the operation is a branch operation or an arithmetic
operation. If it is an arithmetic operation, the control unit further determines
whether it is a signed or unsigned operation. On the other hand, if the operation is
a branch, the control unit faces the task of deciding whether it is a static branch, such
as the JAL or JALR instructions, which are always taken, or if it is a conditional
branch that may or may not be taken. These decisions made by the control unit
guide the subsequent execution of the instructions and ensure the correct operation
of the module.

ˆ Arithmetic Unit: This unit performs the actual arithmetic and logical opera-
tions of the ALU unit based on the input signals, it contains also multiplier and
divider unit to be able to perform complex tasks such as image scaling or rendering
applications that require a lot of multiplication and division operations.

ˆ Branch Unit: This unit is responsible for resolving different static and dynamic
branches and determines whether these branches will be taken or not. It also con-
tains a dynamic branch prediction module to reduce the clock cycles, which the
processing unit needs to wait before the branch destination is decided.

ˆ Commit Unit: This module is responsible for writing back the output signals of
the ALU unit to the register file and forwarding the new PC to the Fetch engine in
case of valid branch instruction.
2.4. EXECUTE STAGE 35

Figure 2.22: Overview of ALU structure.

Figure 2.23: Flowchart of ALU control unit.

2.4.1.2 Hardware Multiplication

A fast and efficient hardware multiplier is crucial in the design of high-performance


GPU because it is used in pixel manipulation and digital signal processing (DSP) algo-
rithms. The multiplier circuit can be designed using various techniques, such as add and
shift, array multiplier, or Booth’s algorithm. In addition to performing basic multiplica-
tion operations, multipliers in GPUs can also be designed to support various advanced
features, such as pipelining, parallel processing, and floating-point arithmetic. These fea-
tures allow the GPU to perform complex mathematical calculations efficiently and quickly,
enabling it to process large amounts of data in real time.
36 CHAPTER 2. GPU ARCHITECTURE

Figure 2.24: Shift and Add multiplication steps [22].

Figure 2.25: Multiplier Cricut [23].

2.4.1.3 Hardware Division

A divider circuit is commonly used in the design of a GPU (Graphics Processing


Unit) to perform certain operations on graphics data, such as scaling or texture mapping.
For example, in scaling, the divider circuit can be used to divide the coordinates of the
pixels in an image by a scaling factor to increase or decrease the size of the image. This
can be useful for rendering images on displays with different resolutions or sizes. In
general, divider circuits are useful in GPU design because they allow for efficient and
accurate mathematical operations to be performed on graphics data, which is essential
for creating high-quality, realistic graphics in video games, virtual reality applications,
and other visual computing applications. In this project an algorithm known as division
restoring algorithm is used to implement the divider circuit. The restoring algorithm is
one of the most known algorithms used to perform division in computer architecture, this
2.4. EXECUTE STAGE 37

algorithm involves continuous subtraction of the divisor from the dividend accompanied
by restoring the partial remainder if the result is negative. To keep track of the sign of
the partial remainder the algorithm uses an extra sign bit that is adjusted every clock
cycle. The restoring algorithm calculates the quotient bit by bit, which means that if a
division operation is needed to be performed on two 32-bit numbers, it will take 32 clock
cycles to be finished.

Figure 2.26: Flowchart of restoring division algorithm.

2.4.1.4 Branch Control Unit

This module within the ALU unit has the task of handling and evaluating various
branches that are executed by the processing unit. One of the major challenges in execut-
ing branch instructions lies in dealing with conditional instructions that are not always
taken consistently. As a result, the pipeline faces a stall, where it is temporarily halted
until a decision is made regarding whether to fetch the next instruction at the current
PC plus 4 or to jump to a new destination generated by the branch instruction. This
decision-making process is crucial for maintaining the correct flow of instructions and
ensuring the smooth operation of the module. In a complex system like GPU, where
more than one instruction is fetched every clock cycle and the pipeline is getting deeper
over the years, the cost of halting the pipeline is very high, and the branch stalls can-
not be tailored. One way to solve this problem is to try to predict the destination of
the branches before it is finally evaluated. This scheme has two approaches whether to
predict the behavior of the branch statically or dynamically. The static branch predictor
38 CHAPTER 2. GPU ARCHITECTURE

depends on the history of the execution of the program from the compile time, this scheme
is very effective in dealing with the branches that have a statically bimodal distribution,
which can be found in scientific computing programs that have a lot of large loops, where
the loop branch will have the same behavior for 99% of the execution time. The major
limitation of static branch prediction is that its accuracy is dependent on the nature of
the program and the frequency of the branches for this reason modern processors always
deploy dynamic branch predictors. The most basic form of dynamic branch predictions is
using a branch history buffer. This buffer is a small directly mapped cache memory, that
is indexed using the lower portion of the address of the branch instruction, inside each
memory location there is a bit that indicates whether the branch was taken the previous
time it was executed or not. In some cases, a branch instruction. With this buffer, it is
not possible to determine with certainty if the prediction is correct since another branch
with the same lower-order address bits could have placed the prediction there. In this
case, if the prediction hint is not true the bit in this position will be inverted. The 1-bit
branch predictor has a main drawback, where it will always make a wrong prediction twice
even if the branch is always taken. To overcome this limitation, a 2-bit branch predictor
is used, in this scheme a prediction must be missed twice before the direction is changed.
Figure 2.27 shows the finite state machine (FSM) of the 2-bit branch predictor scheme.

Figure 2.27: The states in a 2-bit prediction scheme.

2.4.1.5 RTL verification


In this section ALU is tested by running different instructions on the ALU and com-
paring the produced output with the expected results. The ALU is configured to support
the execution of 4 different parallel threads.

Figure 2.28: The multiplication operation on the ALU with 4 active threads.
2.4. EXECUTE STAGE 39

Figure 2.29: The division operation on the ALU with 4 active threads.

Figure 2.30: The remainder operation on the ALU with 4 active threads.

Figure 2.31: The Branch Greater Than (BGE) instruction operation where all threads within
the warp is active.
40 CHAPTER 2. GPU ARCHITECTURE

2.4.2 Floating Point Unit


To perform accurate computations and model natural phenomena, we must use floating-
point numbers. Integer numbers are not accurate enough for many applications, especially
in complex fields such as rendering, computational chemistry, graph analytics, and ma-
chine learning. To design a GPU that can be used to solve real-world problems, it must
have a specific hardware core that can handle non-integer numbers with reasonable accu-
racy and perform different kinds of floating-point operations such as multiplying, dividing,
adding, subtracting, and rounding. However, implementing floating-point operations on
hardware is not trivial. There are several challenges facing designers, including: 1) Round-
ing errors: In the real world, floating-point numbers require theoretically infinite storage
to represent their full precision. However, in practice, we only have a finite number of bits
to represent the precision of floating-point numbers. This can lead to rounding errors,
which can accumulate over time, especially in applications that require a high level of
accuracy, such as scientific computation programs. 2) Speed: Floating-point operations
are typically slower than integer operations because they require additional steps to han-
dle the sign, exponent, and significand of a floating-point number. This can affect the
timing of an entire project, especially if the floating-point core is in the critical path of
the project. 3) Area and power consumption: Floating-point units (FPUs) are typically
large and consume a lot of area and power. This is why there was not much support for
floating-point extensions in low-power embedded device applications in the past. How-
ever, advances in transistor scaling and manufacturing have made it possible to design
and integrate FPUs with low-power architectures. This has opened up new possibilities
for using floating-point computing in a wider range of applications.

2.4.2.1 Floating-Point vs Fixed-Point Representation


One of the important tradeoffs that a designer faces in the process of designing a
floating-point core is whether to use fixed-point arithmetic or floating-point arithmetic.
In this section, we will introduce the fixed-point number representation and the floating-
point representation will be explained in the next section. Fixed-point representation is
a way to represent real numbers with a finite number of bits. It is often used in systems
where power consumption and latency are more important than accuracy. A fixed-point
number consists of two parts: one for the integer part and one for the fractional part. A
point of a certain base (such as 10 for decimal or 2 for binary) separates the two parts as
shown in Figure 40.

Figure 2.32: The anatomy of the fixed point number [24].


Fixed-point numbers can be used to perform arithmetic operations such as addition,
subtraction, multiplication, and division. However, there is a drawback for using fixed-
point numbers: they can introduce rounding errors that can accumulate over different
2.4. EXECUTE STAGE 41

operations.
The main advantage of floating-point representation is its wider dynamic range, which
is the range of numbers that can be represented. Floating-point numbers achieve this at
the expense of slightly less precision and slightly more storage space. This is because
floating-point numbers encode the position of the decimal point, which allows them to
represent a much wider range of values than fixed-point numbers. Fixed-point numbers
have limited or no flexibility, while floating-point numbers have greater flexibility. This
means that floating-point numbers can be used to represent a wider range of values and to
perform a wider range of operations. Floating-point hardware is also more accurate than
fixed-point hardware. This means that floating-point calculations are less likely to produce
rounding errors. Another factor that affects the choice between fixed and floating-point
formats is the precision – the size of the gaps between numbers. A Digital signal processor
(DSP) has to round every new number that it produces by a mathematical calculation
to the closest value that can be stored by the format that it uses. This rounding and/or
truncating of numbers causes quantization error or ‘noise’ - the difference between the
actual analog values and the quantized digital values. Since fixed-point processing has
larger gaps between adjacent numbers than floating-point processing, the round-off error
can be more significant. Therefore, floating-point processing offers much higher precision
than fixed-point processing, making floating-point processors the preferred DSP when
computational accuracy is essential.

The floating-point format has different applications depending on the data set require-
ments of video and audio applications. Floating Point units are useful for high-speed ob-
ject recognition systems, high-performance computer systems, embedded systems, and
mobile applications. For example, in medical image recognition, higher accuracy al-
lows the processing of various types of signal input from light, x-rays, ultrasound, and
other sources to produce output images with valuable diagnostic information. On the
other hand, floating-point devices are more suitable for the huge communications mar-
ket. FPUs can perform specialized trigonometric calculations that are widely used in
real-time applications such as motor control, power management, and communications
data management. The graphics processing units (GPUs) today mostly use IEEE 754-
compatible single-precision 32-bit floating-point operations for arithmetic operations in
the programmable processor cores, but some newer GPUs like the Tesla T10P also support
IEEE 754 64-bit double-precision operations in hardware.
The current floating-point core in this project supports IEEE 754 32-bit single precision
format. It also supports four rounding modes:

ˆ Round to the nearest even (RNE)

ˆ Round towards zero (RTZ)

ˆ Round down towards negative infinity

ˆ Round up towards positive infinity

These rounding modes are based on the RISC-V floating-point extension.The floating-
point core also supports the following operations:
ˆ Multiplication

ˆ Division
42 CHAPTER 2. GPU ARCHITECTURE

ˆ Subtraction

ˆ Comparison

Rounding Mode Mnemonic Meaning


000 RNE Round to Nearest, ties to Even
001 RTZ Round towards Zero
010 RDN Round Down towards negative infinity
011 RUP Round Up towards positive infinity

Table 2.3: Rounding mode encoding according to RISC-V ISA.

2.4.2.2 IEEE-754 Floating Point Representation

This section will explore the IEEE-754 floating point format in more detail, following
the comparison between floating point and fixed point representations in the previous
section. The IEEE 754 standard is a way of representing real numbers using a sequence
of digits. It does this by mapping the infinite range of real numbers to a finite subset with
limited precision. A floating point number can be characterized by the following[25]:

ˆ Sign: A single bit that determines the polarity of the number whether positive (+)
or negative (-)

ˆ Exponent: The exponent range represents the range of possible powers of the radix,
spanning from the minimum to the maximum value

ˆ Significand: Known as precision or mantissa, it represents a fixed number of signif-


icant digits expressed in the

ˆ Radix: Refers to the base number used for scaling, typically either two (binary) or
ten (decimal)

When designing a floating-point representation, There is always a comprise between frac-


tion and the exponent sizes because adding a bit to one means taking a bit from the other.
This is a tradeoff between precision and range: a larger fraction gives more accuracy, while
a larger exponent allows more numbers to be represented. In general, all of the floating
point numbers can be represented in the IEEE-754 standard in the following format:

((−1)sign × significand × radixexponent )

Figure 2.33: The IEEE 754 representation [26].


2.4. EXECUTE STAGE 43

2.4.2.2.1 Binary Interchange Format Encoding Each binary interchange format


represents a floating-point number with a unique encoding. To ensure uniqueness, the
encoding process involves maximizing the value of the significand (denoted as ”m”) by
reducing the exponent (denoted as ”e”) until either the exponent reaches its minimum
value (denoted as ”emin”) or the significand becomes greater than or equal to 1. If, after
this process, the exponent is equal to emin and the significand is between 0 and 1, the
floating-point number is considered subnormal. Subnormal numbers, as well as zero, have
a special encoding with a reserved biased exponent value. The binary interchange formats
shown in Figure 42 represent floating-point data using three fields ordered as follows:

ˆ 1-bit sign (S)

ˆ A w-bit biased exponent (E), which is obtained by adding a bias to the actual
exponent (e).

ˆ A (t = p - 1)-bit trailing significand field digit string (T), denoted as d1 d2...dp-


1, where the leading bit of the significand (d0) is implicitly encoded in the biased
exponent (E).

To convert real numbers to binary interchange representation, the following steps have to
be done. For example, if 85.125 is supposed to be converted to binary interchange format
the following steps have to be done:

ˆ The separation of the whole part and the fractional part: In this case whole
= 85 , fraction=0.125

ˆ Convert the whole number to its equivalent binary representation: In this


case the binary representation will be 1010101

ˆ Convert the fractional number to its equivalent binary representation: In


this case the binary representation will be 001

ˆ Combine the two parts of the number as final result: Final value = 1010101.001

ˆ Convert the resulting final value to base 2 scientific notation:To convert a


number to base 2 scientific notation, move the decimal point to the left until the
leading digit is 1. This is called normalization.

1010101.001 → 1.010101001 × 26

ˆ Extract different fields of the number: Exponent = 6 , Mantisa = 010101001


, and Sign = 0

The sizes of the exponent and significand bits enable RISC-V to represent a wide range
of numbers, from very small to very large. This allows the GPU to perform real-world
computations, such as those used in scientific computing, engineering, and graphics.
44 CHAPTER 2. GPU ARCHITECTURE

Parameter Binary16 Binary32 Binary64 Binary128


Storage width in bits k 16 32 64 128
Precision in bits p 11 24 53 113
Maximum exponent 15 127 1023 16383
Bias E 15 127 1023 16383
Sign bit 1 1 1 1
Exponent width in bits w 5 8 11 15
Significant field width 10 23 52 64

Table 2.4: Comparison between different binary interchange formats.

2.4.2.2.2 Limitations of Binary Interchange Format Encoding While binary-


based computers are prevalent, the significance of decimal computations cannot be over-
looked. The decimal numeration system plays a vital role in various applications. A
survey was conducted on 51 commercial and financial organizations’ databases, which
encompassed financial processes like banking, billing, inventory control, financial analy-
sis, taxes, and retail sales. More than 456,420 columns containing numerical data were
examined to extract statistical information. The findings revealed that 55 percent of the
data were in decimal format, while an additional 43.7 percent were integers that could
have been stored as decimal numbers. Accuracy and appropriate rounding of these appli-
cations’ results are crucial for complying with human calculations and legal requirements.
In binary-based computers, decimal numbers need to be converted to or from binary rep-
resentation. However, due to the binary system’s limited precision and finite hardware
capabilities, exact conversions of decimal numbers may not be possible. Most fractional
numbers cannot be accurately represented in binary format. For instance, consider a dec-
imal number X; converting it to binary would require an infinite number of bits for precise
representation, which is not feasible. Hence, an approximation is made, and the stored
value becomes X. Consequently, any operations performed using this approximation will
yield inaccurate results, even if the arithmetic itself is correct. The conversion between
decimal and binary is typically carried out using software programs, which can introduce
delays. In addition to accuracy concerns, binary arithmetic introduces another problem,
namely the removal of trailing fraction zeroes. For instance, in the binary system, it is not
possible to differentiate between 1.5 and 1.50 due to the normalization nature of binary
representation. However, these trailing fraction zeroes are crucial for certain calculations,
particularly in fields like physics. For example, if the mass of an object is reported as
10.7 kg versus 10.700 kg, the two measurements are not equivalent. The former is accu-
rate to 0.1 kg, while the latter is accurate to 0.001 kg. Consequently, binary arithmetic
units are not directly suitable for financial applications and decimal arithmetic opera-
tions, as they produce results that are incompatible with legal requirements and human
expectations[27].

2.4.2.2.3 Special Cases and Exceptions Although IEEE 754 gives us the ability
to represent a wide range of numbers, some results of floating-point operations may be
too large or too small to be represented in the exponent field. The RISC-V ISA uses a
floating-point control and status register (FCSR) to deal with this problem. The least
significant five bits of the FCSR are used as exception flags, which are triggered when one
of the cases in Table 2.5 occurs.
2.4. EXECUTE STAGE 45

Flag Mnemonic Flag Meaning


NZ Invalid Operation
DZ Divide by Zero
OF Overflow
UF Underflow
NX Inexact

Table 2.5: Exception flags encoding.

Single-Format Bit Pattern Value


(exp != 0) and (significand != 0xFF) Normal
(exp == 0) and (significand == 0) Zero
(exp == 0) and (significand != 0) Subnormal
(exp == 0xFF) and (significand == 0) Infinity
(exp == 0xFF) and (significand != 0) Nan
is Nan and (significand[22] == 0) Signaling Nan
is Nan and !Signaling Nan Quiet Nan

Table 2.6: Special cases for binary single-precision floating-point numbers.

2.4.2.3 Floating Point Multiplication


To multiply two numbers in floating-point format the following steps are needed :
ˆ Adding the exponents of the two numbers and then subtracting the bias, which is
equal to 127 in the case of a single precision format.

ˆ Multiplying the significands of the two numbers.

ˆ Xoring the sign bits of the two numbers to determine the sign of the overall product.

ˆ Convert the resulting product to a Normalized floating point number.

Suppose there are two numbers A and, B in the normalized floating-point format

Z = ((−1)s × 1.M × 2E-bias )

Assume also that the mantissa bits are only 4 bits for simplicity while the hidden bit is
still present. A = 0 10000100 0100 = 40 and B = 1 10000001 1110 = -7.5 The
multiplication algorithm will be as follows:
1. The significands of the two numbers M1 and M2 will be multiplied

1.0100 × 1.1110 = 1001011000

2. Decimal point placement


10.01011000

3. Addition of the exponents E1 and E2

10000100 + 10000001 = 100000101


46 CHAPTER 2. GPU ARCHITECTURE

Figure 2.34: Flowchart of floating point multiplication.

4. The produced exponent is not the true one because E1 = e1 + bias and E2 =
e2 + bias, so the produced exponent will be equal Et = e1 + e2 + 2*bias so
in order to prevent the addition of bias twice, the bias should be subtracted from
the produced exponent

100000101 − 01111111 = 10000110

5. The sign can be obtained as follows

S = S1 xor S2 = 1

6. Collect the different fields of the product together

1 10000110 10.01011000 (bef ore normalization)

1 10000110 1.001011000 (af ter normalization)

7. The mantissa is larger than 4 bits so rounding by truncation will made and the
hidden bit will be removed
1 10000111 0010
2.4. EXECUTE STAGE 47

2.4.2.4 Floating Point Addition


To perform the addition operation, the two operands must have the same exponent.
To achieve this, the process of significand alignment is carried out. This involves compar-
ing the exponents of the operands to determine which one is larger. Then, the difference
between the exponents is calculated, and the mantissa of the operand with the smaller ex-
ponent is shifted by the number of zeros equal to the calculated difference. The exponent
of the final result will be set to the value of the larger exponent. Next, the mantissa of
the two operands are added together using binary addition. If there is a carry during this
addition, the exponent is increased by one, and the mantissa is set to the carry followed
by the most significant 23 bits of the resulting mantissa. The details of the algorithm are
shown in Figure 47.

Figure 2.35: Flowchart of floating point addition.

2.4.2.5 Floating-Point Rounding


A common problem that arises when doing floating point operations such as addition,
subtraction, and multiplication. The resulting output is sometimes a real number that
cannot be represented fully in finite bits. In such cases, the IEEE-754 standard provides
guidelines for rounding these numbers to the nearest representable value. The IEEE-754
rounding modes that are supported by the floating point core of the GPGPU are shown
in Table 2.3.

2.4.2.5.1 Round Towards Zero One of the simplest modes of rounding that IEEE-
754 supports is rounding toward zero. This mode is easy to implement in hardware, but
it has some special cases that require careful handling. For example, IEEE-754 defines
a special case called NaN, which stands for Not a Number. A floating point number is
NaN if its exponent bits are all one and its mantissa is not zero as shown in Table 2.6. A
problem that occurs when rounding NaN values is that the mantissa of the NaN number
may have non-zero bits at the least significant position. Truncating these bits may change
the NaN number into a different representation, where the exponent bits are all one and
48 CHAPTER 2. GPU ARCHITECTURE

the mantissa is zero. This representation corresponds to infinity according to the IEEE-
754 standard. However, the IEEE-754 standard solves this problem by differentiating
between two types of Nan values signaling Nan and Quiet Nan as shown in Table 2.6.
Signaling Nan is a type of Nan where the most significant bit of the mantissa is equal
to zero, when Signaling Nan is detected an exception is raised to prevent future errors
that may occur due to truncation of the least significant bits. Round towards zero is very
useful when dealing with subnormal numbers.A subnormal number is a number that is
very small so the exponent bits are not enough to represent them, so if the last bits of
the mantissa of a subnormal number are truncated, its value may become zero. This is
acceptable because the part of the number that is not zero is too tiny to fit in finite bits
that are used to represent the floating-point numbers.

2.4.2.5.2 Round Towards Nearest Rounding towards zero is a simple and efficient
way to implement rounding in hardware. However, this mode can cause large errors in
computations, because it just discards the least significant bits. For example, if we round
the decimal number 7.9 to zero, we get 7 instead of 8, which is a big difference. Moreover,
the errors in floating-point computations can accumulate over time. Another mode is to
round the number to the closest value, either the next higher or lower value. The rule
for rounding to the closest value depends on the sign of the number. For positive num-
bers, rounding up means rounding towards positive infinity, while for negative numbers,
rounding up means rounding towards zero. The rules of rounding are shown in Table 2.7.

Rounding Mode Positive Number Negative Number


Round up Rounding towards +∞ Rounding towards zero
Round down Rounding towards zero Rounding towards −∞

Table 2.7: Rounding Rules.


However there is a special case when the number is in the middle way between the
nearest larger and nearest smaller number, in this case, the IEEE-754 breaks the tie by
rounding the number to the nearest even number. The process of rounding the floating
point numbers toward the nearest even involves the use of some special bits known as
stick and guard bits. The steps for rounding the normalizing floating point number to
the nearest even are as follows :

1. Determining the rounding bit: The rounding bit is the bit posistion before the
least significant bit ( LSB) bit, this bit will be retained after rounding. This position
depends on the precision of the floating-point format being used.

2. Obtaining the guard bit: The guard bit is the bit that is just located before the
round bit, it is set to 1 in case, there are any non-zero bits beyond the rounding
point, which will cause a loss in the data during rounding. Otherwise, it is set to 0.

3. Determining the sticky bit: The sticky bit indicates whether there are any non-
zero bits after the rounding point. It “sticks” to the result and affects the rounding
decision when the number is exactly in the middle of two representable values.

4. The rounding decision is made according to the values of round and sticky bits as
shown in 2.8.
2.4. EXECUTE STAGE 49

Round bit Sticky bit Decision


0 0 Down
0 1 Down
1 0 UP
1 1 Tie

Table 2.8: Rounding Decisions.

2.4.2.6 RTL Design and Verification

Figure 2.36: The high-level overview of FPU.


The high-level overview of the floating point core (FPU) is depicted in Figure 49.
The FPU core comprises three main modules alongside an index buffer. The purpose
of the index buffer is to store metadata associated with each request (such as UUID,
WID, TMask, PC, RD, WB) in a first-in, first-out (FIFO) order. This arrangement
enables the FPU to handle requests in the same order they are received, ensuring proper
synchronization between the operation results and the program being executed on the
GPU. The FPU has three main modules (will described in later sections) for different
operations. The module that is enabled depends on the input opcode.

Both the input and output ports of the FPU have a handshake interface to synchronize
the flow of data into and out of the FPU. This handshake protocol is similar to the
protocols used in common protocols such as AXI:

ˆ The valid signal indicates that the data on the corresponding interface is valid and
stable. Once valid is asserted, it must not be deasserted until the handshake is
complete.
50 CHAPTER 2. GPU ARCHITECTURE

Instruction Opcode operation


FPU ADD 0 Addition (op1 + op2)
FPU SUB 4 Subtraction (op1 - op2)
FPU MUL 8 Multiplication (op1 * op2)
FPU MADD 3 Fused multiply-add (op1 * op2) + op3
Negated fused multiply-add -(op1 *
FPU NMADD 15
op2) - op3
FPU MSUB 7 Fused multiply-sub (op1 * op2) - op3
Negated fused multiply-sub -(op1 *
FPU NMSUB 11
op2) + op3
FPU COMP 10 Comparison, operation
FPU CVTWS 1 Unsigned integer to Float conversion
FPU CVTWS 1 Signed integer to Float conversion
FPU CVTUS 5 Unsigned integer to Float conversion

Table 2.9: Instructions supported by FPU.

ˆ The ready signal indicates that the interface is capable of processing data on the
following rising clock edge. Once valid and ready are asserted during a rising clock
edge, the transaction is complete.

ˆ After a completed transaction, valid may remain asserted to provide new data for
the next transfer.

ˆ The protocol direction is top-down. This means that ready may depend on valid,
but valid must not depend on ready.
One of the most important metadata is the operation tag field. Operation tags are
additional information that is attached to an operation. They can be used to track where
the results of the operation came from. Operation tags move through the FPU without
being changed, and they are always associated with the operation that they were created
with.

2.4.2.6.1 Fused Multiply Add Unit The fused multiply-add (FMA) unit is one
of the building blocks of the FPU core. This design choice of using a single module to
perform addition and multiplication operations has several advantages:
ˆ FMA modules can perform multiplication and addition operations in a single in-
struction, which can significantly reduce rounding errors. This can result in im-
proved numerical accuracy, especially for iterative computations or when dealing
with large numbers.

ˆ Many applications, such as neural networks, rely on the multiply-and-accumulate


operation.

ˆ Combining multiplication and addition into a single step can reduce the instruction
count, which can improve performance by reducing latency and cache miss rates.

ˆ FMA instructions can reduce power consumption compared to separate multiply


and add operations since they can perform both operations in a single step. This is
2.4. EXECUTE STAGE 51

particularly beneficial in energy-constrained environments or mobile devices where


power efficiency is crucial.

The RTL of the FMA is shown in Figure 52. This design has a long critical path,
which can limit the maximum frequency at which the FMA can operate. The solution to
this problem is to pipeline the critical path, which will increase the maximum frequency
of the system from 66 MHz to 160 MHz at the cost of increasing the latency.

Figure 2.37: RTL design of the Floating point multiplier.


52 CHAPTER 2. GPU ARCHITECTURE

Figure 2.38: FMA module without pipelining


2.4. EXECUTE STAGE 53

Figure 2.39: FMA module with pipelining


54 CHAPTER 2. GPU ARCHITECTURE

2.4.2.6.2 Verification The steps of testing the FPU core are as follows:

1. Use python script to generate random floating point vectors as shown in 54

2. Dump the produced floating numbers to the FPU

3. Perform different operations on the floating point numbers

4. Extract the output and compare it with Python output

Figure 2.40: Python script to generate random numbers in IEEE-754 floating point format

Floating operations testing The multiplication operation is tested by multiplying


two floating-point vectors and comparing the output to the output produced by the same
operation in Python. The addition and subtracting operations are tested in the same way.

Figure 2.41: RTL simulation of multiplication of two floating point vectors


2.4. EXECUTE STAGE 55

Figure 2.42: The output produced from the FPU vs the actual output

Figure 2.43: The error percentage produced from the multiplication


56 CHAPTER 2. GPU ARCHITECTURE

Figure 2.44: RTL simulation of Addition of two floating point vectors

Figure 2.45: The output produced from the FPU vs the actual output
2.4. EXECUTE STAGE 57

Figure 2.46: The error percentage produced from the addition

Figure 2.47: RTL simulation of Subtraction of two floating point vectors


58 CHAPTER 2. GPU ARCHITECTURE

Figure 2.48: The output produced from the FPU vs the actual output

Figure 2.49: The error percentage produced from the subtraction


2.4. EXECUTE STAGE 59

Figure 2.50: RTL simulation of converting IEEE-754 numbers to the nearest integer

Input Approximation
75.11204945036411 75
300.54744821509166 301
766.264255313085 766
878.5626430788918 879
-461.02451392541013 -461
-935.433141987377 -935
-187.70898012719522 -188
780.8973092710785 781
-209.06777986403586 -209
-483.19200494111806 -483
475.7076863444606 476
-157.2947256582371 -157
-815.9343442137628 -816
-274.09916577479737 -274
39.90017099455781 40
-720.6216768138403 -721
-869.4262660133307 -869
-108.0196437235594 -108
658.7183789078322 659
810.2743781094275 810
-420.84335455008386 -421
349.998328741402 350
-417.7595467500603 -418
-644.9503193988262 -645
-60.75731384969754 -61
103.42021755982046 103
-121.12395568020906 -121
-836.2065186216659 -836
-288.78110431056166 -289
998.6286837046068 999
826.4287099445658 826
483.594587151362 484

Table 2.10: Rounding of different floating point numbers to the nearest integer .
60 CHAPTER 2. GPU ARCHITECTURE

2.4.3 Load Store Unit (LSU)


The Load/Store unit (LSU) is responsible for managing memory requests between the
core and the memory unit. Its primary function is to generate various types of requests
to the data memory. The LSU unit utilizes two queues, namely the Load Queue (LDQ)
and the Store Queue (STQ), to handle load and store operation as shown in Figure 2.52.
The main controller in the LSU unit receives inputs from two interfaces. One interface
is dedicated to receiving signals from the Data Cache. These signals convey information
about the readiness of the data cache to accept requests or if there is a valid response from
it. Another port is used to receive requests generated from the issue stage to do different
memory operations, the request signals convey information about the memory address
that the request needs to access, the data needed to be written to memory in case of
store instructions, and the id of the thread that sent this request. The load/store queues
are used to store different instructions alongside their metadata until the data cache is
ready to receive these instructions, after that the committed instruction is dequeued to
free space for other instructions.

2.4.3.1 Address Generation Unit


This unit is used to generate addresses that can be used to access the memory unit.
It deals with the problem of misaligned memory address, which is a problem that occurs
when the request address is not divisible by the memory word side, for example if the
memory word consists of 4 bytes, then the address must be divisible by 4 or it can lead
to an error. However address generation unit solves this problem by sending separate
word-aligned accesses. It can also determine if the request address belongs to a certain
range of addresses that are non-cachable and insert a certain flag in the address so the
cache bypasses it directly to the main memory. Non-cacheable addresses are utilized
to maintain data consistency among different cores within the GPU. By designating an
address as non-cacheable, it ensures that the core does not possess a local copy of the data
referred to by this address. Consequently, the core cannot modify the content without
the knowledge of other cores. This approach helps prevent inconsistencies and conflicts
that might arise from multiple cores attempting to modify the same data simultaneously.

Figure 2.51: Bypass of non-cachable addresses to the main memory .


2.4. EXECUTE STAGE 61

Figure 2.52: The Load/Store Unit Architecture.

2.4.3.2 Communication between LSU and Memory Unit


To facilitate seamless data transfer and prevent errors, the communication between
the Load/Store unit (LSU) and the memory unit employs a handshake mechanism. The
communication process shown in Figure 2.53 proceeds as follows:
ˆ If the memory unit is ready to receive a new request and there is a valid memory
request in the LSU, then a valid request flag is raised.
ˆ After receiving a grant access signal from the memory to send a request and the in-
struction queues LDQ and STQ are not full the LSU controller pulls the rsp ready
62 CHAPTER 2. GPU ARCHITECTURE

to high.

ˆ As soon as the req valid is triggered the LSU controller sends the required address
to be accessed to the main memory alongside the request data to be written in case
of store instruction.

ˆ The memory controller responds by pulling the rsp valid signal to high for one
clock cycle and sends the required data for the core.

Figure 2.53: Memory Transactions

2.4.3.3 Cache Prefetching


Prefetch is a computer architecture technique used to overcome the long latency of
memory access, the idea is to fetch instructions or data from their original location in
the main memory to the local cache before they are used. If the prefetch operation is done
accurately and early enough, this can help so much in reducing or even eliminating the
long latency needed to access main memory and getting rid of compulsory cache misses.
The prefetch can be done using two ways:

ˆ Hardware Prefetch: The prefetch is typically done using dedicated hardware


integrated with the core, that analyzes the data stream and determines the possible
memory blocks that may be needed in the future.

ˆ Software Prefetch: The prefetch is typically done by using a compiler that stati-
cally analyzes the code before the execution and inserts additional prefetch instruc-
tions in certain program locations.
2.4. EXECUTE STAGE 63

The current LSU unit implemented in this project supports only software prefetching
so if a compiler or profiler can analyze the code and determine possible prefetch instruc-
tions location, they will make use of the GPU ISA that supports prefetch instructions. It
will be considered in the future to add a prefetch unit to the GPU cores.
64 CHAPTER 2. GPU ARCHITECTURE

2.4.4 Control Status Register (CSR) Unit


To use the GPU in real-world applications and be able to support operating system
(OS) kernels, the GPU needs to use special kinds of registers known as Control Status
Registers (CSRs). CSRs are additional registers used to store various information
about the processing core. RISC-V architecture assigns a dedicated address space for the
CSR registers which consists of 4096 registers, however, RISC-V architecture does not use
the whole address space, instead, it leaves some registers unused so the user can create
custom CSR instructions. Both privileged and unprivileged programs can use the CSR
registers, privileged programs use these registers to access some control information such
as the addresses of interrupt service routines (ISRs), or can use them as a safe memory
to store some information when the OS switch from privileged mode to user’s mode. The
CSR registers are also utilized by unprivileged programs to access performance counters
used in debugging mode or obtain statuses related to floating-point operations. These
statuses include error flags and the rounding mode employed in various operations. Table
2.11 shows the CSR registers that are supported by the GPU. To read or modify the
contents of the CSR register the RISC-V ISA is extended with special instructions for
that purpose. Figure 2.54 shows the anatomy of CSR instructions, these instructions
primarily comprise a 12-bit identifier spanning bits 31 to 20, responsible for selecting the
CSR register. Additionally, there’s a 5-bit immediate value encoded within the uimm
field.

Figure 2.54: CSR extended instructions.


2.4. EXECUTE STAGE 65

Number PrivilegeName Description


Machine Information Registers
0xF11 MRO MVENDORID Core ID
0xF12 MRO MARCHID Architecture ID
0xF13 MRO MIMPID Implementation ID
0xF14 MRO MHARTID Thread ID
Machine Trap Setup
0x300 MRW MSTATUS Machine status register
0x301 MRW MISA ISA and extension
0x302 MRW MEDELEG Machine exception delegation register
0x303 MRW MIDELEG Machine interrupt delegation register
0x304 MRW MIE Machine interrupt enable register
0x305 MRW MTVEC Machine trap-handler base address
User Floating-Point CSRs
0x001 MRO FFLAGS Floating-point operations status flags
0x002 MRO FRM Floating-point rounding mode
Machine SIMT CSRs
0xFC0 MRO NT Number of threads
0xFC1 MRO NW Number of warps
0xFC2 MRO NC Number of cores
Machine Performance-monitoring counters
0xB00 MRW MCYCLE Machine cycle counter
0xB02 MRW MINSTRET Machine instruction retired counter

Table 2.11: Control and Status Registers Map


66 CHAPTER 2. GPU ARCHITECTURE

2.4.5 Convolution Unit


2.4.5.1 Background
2.4.5.1.1 Convolution Neural Networks Convolutional neural networks (CNNs)
are among the most commonly employed neural networks for running deep learning ap-
plications. Enhancing the speed of CNN execution has been a leading research area ever
since CNN won the prestigious ImageNet challenge in 2012. Since then, CNN has become
the benchmark for various applications, including medical pattern recognition, traffic fore-
casting, sentiment analysis, and speech recognition. Traditionally the main structure of
CNN is as follows:
ˆ Convolution layers

ˆ Pooling layers

ˆ Fully connected layers

The convolution layer plays a vital role in extracting image features using small kernels
with adjustable parameters. The behavior of these kernels is governed by various factors
like padding, stride, and kernel size. To meet the computational demands, multiple ker-
nels are utilized within each convolution layer. Figure 2.55 illustrates the 2D convolution
algorithm, revealing that it involves four nested loops. This emphasizes that convolution
is a computationally intensive task, and the traditional approach may not efficiently han-
dle larger image sizes. The convolution layer is the main bottleneck in accelerating the
execution of CNN as it consumes 90% of the execution time.

Figure 2.55: 2D convolution code.

2.4.5.1.2 Systolic Arrays One of the proposed architectures to accelerate the ex-
ecution of convolution operation is Systolic Array. The systolic array is a 2D grid
2.4. EXECUTE STAGE 67

consisting of interconnected blocks each block is known as processing elements (PEs) as


shown in Figure 2.56, and each PE is capable of performing simple operations like multi-
ply and accumulate (MAC)operation. The data in the systolic array flows in the pipeline
way between different PEs in the 2D grid, this way of data flow solves the problem of
high bandwidth required by the memory system to handle the huge number of load/store
instructions needed to deal with the intermediate results, as the only cells that communi-
cate with the outside world are the boundary ones. At first glance, a systolic array looks
like it is a pipeline system, but the key difference between a systolic array and a pipeline
system is that the systolic array can be configured to be a 2D array in which the data can
flow at different speeds and directions, moreover, both the input and the results can flow
through different cells of the systolic array, in contrast to the pipeline system in which
the only flowing signals are the results.

Figure 2.56: A high-level overview of a traditional 3x3 systolic array.


Despite the simplicity of the hardware of the systolic array and the high operating
frequency it can reach, this kind of architecture has a lot of limitations that prevent it
from being used as an accelerator for convolution operation:
ˆ Systolic arrays are designed mainly to accelerate matrix multiplication operations,
68 CHAPTER 2. GPU ARCHITECTURE

so their architecture may not be optimal for executing convolution workloads, to


support convolution on systolic arrays some further modifications are needed to be
done to convert convolution into matrix multiplication, these modifications may
slow the development rate of deep learning applications that depend mainly on
convolution.

ˆ The need to include special FIFO buffers to store the image and kernel parameters,
to save memory bandwidth. These buffers require a special circuit to control the
flow of data in/out of the buffers, which consume a lot of die area even bigger than
that needed by the computing structures.

ˆ Large kernels with a size, that exceeds the capacity and bandwidth of the systolic
array, increase the rate of memory access, which will cause a lot of communication
bottlenecks.

ˆ Another significant problem arises when the dimension of the weight matrix is not
evenly divisible by the MXU tile size. In such cases, the matrix needs to be divided
into multiple tiles. However, when we reach the last row and column of the tile,
not all the processing elements (PEs) will be utilized. This results in inefficient
utilization of the systolic array resources. Figure 2.57 addresses this problem.

Figure 2.57: Poor utilization of systolic array resources [28].

2.4.5.2 New Architecture Proposal


2.4.5.2.1 Overview The implemented GPU in this architecture comes with a new
specific architecture for performing convolution workloads, the idea is to introduce a new
2.4. EXECUTE STAGE 69

architecture that reduces the number of load/store instructions with more flexibility than
the systolic array. The new convolution unit topology makes use of the newly added tmc
instruction to extend the RISC-V architecture to support the SIMT executing model and
the banked register file that can act as buffers for the kernel and image values.

2.4.5.2.2 Convolution Unit Architecture The architecture of the convolution unit


integrated with the GPU to accelerate convolution operations is depicted in Figure 2.58.
This unit takes advantage of the parallel nature of the GPU, particularly its ability to
fetch multiple data using the same instruction. The convolution unit leverages the TMC
instruction, discussed earlier, to activate threads based on the kernel size (as long as the
kernel size is within the maximum number of threads). For example, if the kernel size is
three, the user can execute the tmc 7 instruction to activate three parallel threads, with
each thread fetching a value from the register file. To eliminate the need for complex
control circuits and buffers that consume significant die area, the unit utilizes a banked
register file. Each thread has its own set of registers, allowing independent data fetching
without the need for additional buffers. The data operands fetched from the register
file are then processed by a group of high-performance multipliers, which calculate the
product of each pair of data. Another module performs an addition reduction operation on
the product vector to obtain the final convolution value. The reduction module employs
the Kogge-Stone style prefix tree algorithm to compute the parallel prefix sum of the
input vector. This algorithm breaks down the input values into smaller subgroups and
performs parallel sum operations on each subgroup. The results of each subgroup serve
as inputs for the sum operations on the next level of subgroups, continuing until the
final result is computed, as shown in Figure 2.59. The Kogge-Stone algorithm utilizes a
tree-like structure, with each level corresponding to a different level of subgroups. Logic
gates and shift registers are employed to generate the final output. The main advantages
of this architecture include a reduction in the number of instructions and clock cycles
required to execute convolutions. Each 1D convolution can be performed in a single
instruction, and values are accumulated until the total number of kernel rows is reached.
Furthermore, the unit minimizes data commits. For example, a 3x3 convolution can be
divided into three 1D convolutions, with each convolution’s output accumulated until the
accumulator counter reaches three. The final data is then committed to the register files.
This approach contrasts with a typical architecture, where the result of each step’s product
is committed before fetching the data again to obtain the sum. Figure 2.60 illustrates that
the convolution unit effectively overlaps the latency needed to commit data, resulting in
a significant reduction in power consumption.
70 CHAPTER 2. GPU ARCHITECTURE

Figure 2.58: Overview of convolution unit in the GPU.


2.4. EXECUTE STAGE 71

Figure 2.59: Kogge–Stone adder.

Figure 2.60: Reducing the number of commits relative to the normal architectures.
72 CHAPTER 2. GPU ARCHITECTURE

2.4.5.3 Results
Figure 2.61 compares the hardware resources utilized by the convolution unit as the
number of threads within the core varies.

Figure 2.61: The resource utilization of convolution unit with different threads configuration.

Figure 2.62: Result of the execution of convolution on the GPU.


2.5. COMMIT STAGE 73

2.5 Commit Stage


Commit stage is the stage that comes immediately after the execution stage, its main
responsibility is to take the outputs from the execution stage and write them back to the
appropriate location in the register file, this makes it a critical component, as it updates
state of the processor and makes the results of executed instructions available for use by
subsequent instructions. The commit stage consists mainly of two modules writeback and
popcounter.

2.5.1 Writeback
The writeback module receives different requests from various functional units, the
process of receiving requests like other transactions in the GPU, happens using a hand-
shake of valid and ready signals. After the writeback module receives different commit
requests alongside their data, it directs them to a stream arbiter, to determine which
source gets to commit its data. Although the execute unit gives access to one operation
at a time, an arbiter is needed because various operations take a different number of clock
cycles to be finished. So, at any time, the writeback could have data from multiple func-
tional units that need to be committed. After a certain functional unit is given a grant
to commit its data, a writeback request is sent to the issue stage, to update the register
file.

2.5.2 Popcount
Popcount, short for ”population count”, is a module used to calculate the number of
1’s in a binary data set or binary vector. In the commit stage, this module can be used
to determine the number of valid request signals from different units, the number of valid
requests is then sent to the CSR unit to be written in a certain register used to monitor
the instructions count.

Figure 2.63: Writeback module architecture.


74 CHAPTER 2. GPU ARCHITECTURE

2.6 Cache Sub-System


2.6.1 Background
For many years the performance gap between the speed of the CPU and the speed of
memory was huge, which makes CPUs waste significant time without doing useful work
waiting for the data to be fetched from the memory. This problem has addressed the
need for high-performance memory with low access latency to be able to execute memory-
bounded applications such as matrix multiplication without wasting a high portion of the
program time waiting for the data to be fetched. One of the solutions that was introduced
to this problem is the use of what is known as cache memory. Cache memory is a type of
memory organization that depends on SRAM technology, it has a higher access speed than
other memory technologies such as hard memory, flash memory, or even DRAM memory.
However, the cache memory cannot be used as the main memory because the SRAM cell
as shown in Figure 2.64 consists of 6 transistors, which makes it less dense and more
expensive than DRAM cells shown in Figure 2.65, where each cell in DRAM technology
consists only of a transistor and a capacitor. The ultimate goal for memory designers is to
achieve a memory with the following properties: large capacity, high speed, and low cost.
However, as shown in Figure 2.66 there is a trade-off between the performance and cost,
the more performance is needed, the higher the cost per gigabyte (GB)is. The designers
in the modern memory hierarchy try to make benefit of the high capacity but relatively
slow main memory and the faster cache memory with higher cost by integrating them in
one hierarchy as shown in Figure 2.67. The combination of both cache memory and main
memory will result in a memory structure that has the speed of the cache memory and the
capacity of the main memory. The cache memory stores a portion of the most frequently
accessed data by the processor, therefore reducing the amount of wasted cycles, that the
CPU spends waiting for the fetched data to arrive, which will lead to an increase in the
overall performance of the system. The remaining data will be stored in the main memory
and it will be retrieved in blocks when a cache miss occurs. The memory hierarchy is
a term that refers to the way by which different memory technologies (in capacity and
speed) are arranged together with certain sizes to form the modern memory system. In
Figure 2.67 the top memory element is the fastest and the lowest one in capacity, as we
go from top to bottom the speed is reduced, and the capacity increases. The hierarchy
consists of cache levels above each other where the last cache just before the main memory
is known as the last-level cache, then comes the main memory, and finally the bottom
memory component and the largest one which may composed of HDDs or SSDs.
2.6. CACHE SUB-SYSTEM 75

Figure 2.64: Structure of SRAM cell.

Figure 2.65: Single memory cell and array [29].


76 CHAPTER 2. GPU ARCHITECTURE

Figure 2.66: Comparison of different storage devices in terms of cost per GB and performance
(IOPS) [30].

Figure 2.67: Modern Memory Hierarchy.


2.6. CACHE SUB-SYSTEM 77

The processor fetches data from the memory hierarchy in a sequential manner, it first
starts by accessing the top element to determine whether the required data is present
or not, in case of finding the required data in the top-level cache a hit occurs and the
processor will not have to search the lower-level memories for the data, on the other hand,
if the data is not founded in the top level memory a miss occurs and the processor will
search for the required data in the next lower element, in the lower memory the accesses
may result in hit or miss again, if miss occurred then the processor will continue searching
for data in the lower components. As the access process goes down further in the memory
hierarchy, the likelihood, that a data hit event occurs increases until it reaches the bottom
component of the hierarchy, where it is guaranteed to always result in a hit (assuming the
desired data is present in the main memory). Here is an example to demonstrate the effect
of the memory size on the latency, assume that there is a two-level memory hierarchy,
that consists of a single cache and main memory, the total latency of the system can be
mathematically represented by the following expression
hit hit
LatencyT = Pcache · Latencycache + (1 − Pcache ) · Latencymain memory
where
hit
0 ≤ Pcache ≤1
Latencycache ≪ Latencymain memory
hit
Pcache represents the hit probability in the cache

From the above equation it is seen that the latency is dominated by the latency of
the main memory, so to decrease the overall latency the cache must have a high hit rate,
in the best-case scenario the hit rate is equal to 1, which will make the effective latency
equal to the latency of the cache memory. However, the git rate can be determined by
two factors:
ˆ The size of the cache.

ˆ The access memory pattern of the processor


As the size of the cache increases more data can be stored, therefore the chance that a
hit occurs is high, but this comes at the cost of more access latency and more power con-
sumption. The second factor that affects the hit rate is the pattern by which the processor
accesses the memory, in case the processor continues to fetch a small set of data over and
over, this data can be stored in the cache, therefore the subsequent accesses will produce
a hit in the cache, leading to an increase in the hit cache rate, thus small cache is sufficient.

Many applications such as Matrix Multiplication and Convolution exploit two


types of localities :
ˆ Temporal locality : It is a program access pattern, where the same data is accessed
over and over in a short period. In this pattern, if a memory location is accessed
in the current time, there is a high probability that this location will be accessed in
the future.
ˆ Spatial Locality : It is a program access pattern, where the nearby locations are
accessed one location after another. In this pattern, if a certain memory location is
accessed in the current time, there is a high probability that the next locations will
be accessed in the future.
78 CHAPTER 2. GPU ARCHITECTURE

Figure 2.68: Sudo code for matrix multiplication.

Figure 2.69: The memory access pattern of the different array elements white means not yet
touched, light means older accesses, and dark means newer accesses. .

2.6.1.1 Design Considerations


In the designing process of cache memory there are important design considerations
that should be taken into account [31]:
ˆ Technology Node: To be able to integrate cache memory with the processor on
the same chip to achieve high communication speed due to short inter-connects, the
technology that is used to fabricate the cache memory should be the same as the one
used in processor fabrication. Therefore SRAM technology which consists of many
6T cells as shown in Figure 2.64 is used to construct the cache memory. SRAM has
two advantages over DRAM cells:

– SRAM can operate at very high speeds comparable to the processor frequency.
– SRAM cells are based on CMOS logic so they can be fabricated with the
processor on the same chip, which is not the case in DRAM cells, which cannot
be integrated with the processor on the same chip.

ˆ Cache Size: The capacity of the cache is an important parameter in the designing
process of the cache memory, large caches have a higher hit rate, however, this comes
with the drawback of increasing the total cost of the cache due to the increase in the
transistor count and increasing the latency as a result of extra overhead introduced
by the large transistor count.

ˆ Data Management: The data inside the cache is separated into individual entries
with the same size called cache blocks as shown in Figure 2.70. Each block contains
an entry to store a unique label that identifies each block, this label is called the
tag. During the process of accessing data, the processor searches the tag entries of
2.6. CACHE SUB-SYSTEM 79

each block to determine if there is a tag that matches the address, if a matching tag
is found a hit event occurs as described before. The tag bit also is used to describe
certain properties of the block :

– It can determine whether the block is empty or not.


– It can determine whether the block is modified or not.
– It can also be used to determine how recent the last access to this block was.

Figure 2.70: Cache Blocks.

2.6.1.2 Cache Associativity


The considerations discussed in 2.6.1.1 are not the only issues that the designers should
consider when designing cache organization. There is another problem that arises due to
the presence of a much larger system address space than the size of the cache, this problem
makes it harder to choose the block, where the data chunks will be placed as there are
more data chunks than available blocks. There are two extreme placement policies, the
first one is called Direct Mapping, in this scheme, the data chunk has only one available
place to be placed in, and the other scheme is known as Fully associative, in this scheme,
the data chunk can be placed in any available block in the cache memory. In addition
to these extreme schemes, there is a middle range scheme called Set associative. In set
associative caches the block can be mapped to a fixed number of locations, in other words,
if a cache is said to be an n-associative cache, the data chunk can be mapped to n different
blocks within the cache. In an n-way associative cache the cache is divided into several
sub-caches each one is known as a Set, and each set contains n number of blocks, when
the cache is accessed the data chunk is mapped into a certain set using the index field in
the address bits, after that the data chunk can be placed in any block in the cache. This
placement scheme combines both the properties of direct-mapped and fully associative
caches. The main advantage of using associate caches over direct mapped caches is the
reduction in the number of cache conflicts, to demonstrate this further, assume that we
have a direct-mapped cache composed of 8 entries and the data chunks are placed in the
cache according to the following equation

P osistion = (Blocknumber) % (# of cache blocks)


80 CHAPTER 2. GPU ARCHITECTURE

In this case the addresses N and N+8 will always conflict in this type of cache. This
problem can be overcome by increasing the level of the associativity of the cache, instead
of having one set with 8 columns, we can have 2 sets with 4 columns,4 sets with 2 columns,
or even 8 sets with 1 column (full-associative cache) as shown in Figures 2.72 2.73 2.74.
The conflict misses rate is reduced when the associativity of the cache increases but this
comes at the cost of increasing the complexity of hardware, as more comparators, wider
muxes, and larger tag stores are needed.

Figure 2.71: Different logical cache organizations.

Figure 2.72: 2-way set associative cache: Blocks with the same index can map to 2 locations.
2.6. CACHE SUB-SYSTEM 81

Figure 2.73: 4-way set associative cache: Blocks with the same index can map to 4 locations.

Figure 2.74: Fully associative cache: Any block can map to any location in the cache.
82 CHAPTER 2. GPU ARCHITECTURE

2.6.1.3 Replacement Policy


As discussed before the address space of the system is beyond the maximum capacity
that the cache can handle, this will lead us to a critical situation when a new chunk of
data arrives and there is no space in the cache. This problem is trivial in direct-mapped
caches because each chunk of data can only be mapped to one block, so the occupied
block which has the same index as the new data chunk will be evicted from the cache,
and the new chunk of data will replace it. The problem appears in the set and fully
associative caches, where the new data chunk can be mapped to multiple locations in
the cache, the replacement mechanism will be more challenging, in case all the potential
blocks that the new chunk of data can be mapped to are occupied. A replacement scheme,
that will not affect the hit rate of the cache in the future has to be chosen. One of the
famous and widely used replacement mechanisms is the Least Recently Used (LRU)
scheme. LRU exploits the concept of temporal locality in the cache, the block that has
not been accessed recently has a lower chance of being accessed in the future, therefore,
the victim block that will be replaced is the least recently accessed block in the set in
case of set-associative cache or the least recently accessed block among all cache blocks
in case of fully-associative cache. However, complex hardware is needed to keep track of
the last time each block is accessed which complicates the cache design very much. There
are possible alternative replacement policies that can be used instead of LRU such as:
ˆ Least Frequently Used (LFU): LFU policy does not track the recency of each
block access, instead it keeps track of the frequency of the block access. The evicted
block is the block with the least access frequency.
ˆ Hierarchical LRU: This policy divides the N-way set into M groups and keeps
track of the most recently used (MRU) group and within that group, it keeps track
of the MRU block.
ˆ Victim-NextVictim: This policy simplifies the process of tracking each block
within the cache, it tracks the victim block only and the next block that has the
potential to be evicted.

2.6.1.4 Handling Write Requests


The way that the processor handles write requests can affect the overall performance
of the system very much. When a processor sends a write request to the cache, the value
can be modified only in the cache without updating the main memory with the new data
written into the cache, this approach will enhance the performance of the memory system
because the processor will not have to wait for the huge latency of the main memory, but
this approach will cause Inconsistent Data problem, due to the difference in data across
the main memory and the cache. In the case of our GPU architecture, this problem will
be amplified as the inconsistent data in the main memory will be shared among different
cores. There are two approaches to handle this problem:
ˆ Write-Through Caches: Write-through cache is a type of cache that forwards
the data to the main memory as soon as a write request occurs, this technique of
updating the main memory and the cache simultaneously is simple and reliable,
it can be used in the applications, where the write requests are infrequent. The
write-through approach ensures that the data in the main memory is up-to-date,
which helps in recovering data in case of a power cut or system failure. However,
2.6. CACHE SUB-SYSTEM 83

this approach introduces a high latency, as the processor will have to wait until the
request is written to the main memory, which has a much higher access penalty than
the cache memory. This approach violates the primary reason for having a cache in
the system, which is to reduce latency as much as possible.

ˆ Write-Back Caches: In contrast to write-through caches, the data is updated


only in the cache without writing the modified data to the main memory. The data
is written only to the main memory when a certain block is evicted using one of the
previously discussed techniques such as LSU and LFU. Write-back caches enable
the processor to write new data to the caches and continue executing normally
without waiting for main memory access, it also can combine multiple writes to the
same block before eviction, which helps in reducing the bandwidth requirements
between cache memory and main memory, thus improving energy efficiency. The
main disadvantage of write-back caches is the complexity of restoring the data from
the cache in the event of power failures, which can be nearly impossible. The blocks
of write-back caches are modified to include a special bit called Dirty Bit, this bit
is used to indicate the status of the cache block and whether it is modified (dirty)
or unchanged (clean), if the block is clean then there is no need to write the data
to the main memory which can save a lot of bandwidth.

Figure 2.75: Handling write requests in Write-through cache.


84 CHAPTER 2. GPU ARCHITECTURE

Figure 2.76: Handling write requests in Write-back cache.

2.6.1.5 Multi-level Cache


The modern memory hierarchy tends to integrate several levels of caches below each
other, which is a try to decrease the miss penalty by fetching data from faster memories
with much lower latency than the main memory. The integration of multiple cache levels
with each other requires the concern of some issues:

ˆ Inclusion type between different cache-levels

ˆ How the write requests are handled.

ˆ Types of different caches.

In memory hierarchy the main memory always has a clone of data present in the cache,
in other words, it can be said that the relation between the cache and the main memory
is inclusive, however, in the case of the relation between multiple caches in the memory
hierarchy cannot always be said to be inclusive. The relation between the different caches
has three shapes:
2.6. CACHE SUB-SYSTEM 85

1. Inclusive

2. Exclusive

3. Non-inclusive

Inclusive and Exclusive relations are two extremes, where in inclusive relation, the data
in the highest level of the cache is guaranteed 100% to be found in the lower cache levels,
in contrast, in the exclusive relation the data in the highest cache level is guaranteed
100% to be not found in any lower level caches. The middle relation between the previous
two opposite extremes is the non-inclusive relation, where in this relation the data in the
highest cache level may be or not present in the lower-level caches. The main advan-
tage of the exclusive relation is the saving of a great amount of cache capacity because
the data in one level will not be stored in any other levels, on the other hand, inclusive
property simplifies the search process, especially in the multi-core systems, as if a certain
data wanted to be searched in local cache levels of certain processor, the largest cache in
the hierarchy only will be searched because it is guaranteed to store all the data present
in other levels. The non-inclusive relation is less conservative than the other two types
of relation, so when inserting a new chunk of data into a cache at a certain level, there
will not be strict requirements to evict this chunk of data from other levels (in case of
exclusive policy) or transferring this chunk to the remaining levels (in case of inclusive
policy).

The second issue in multi-cache design is the way the write request will be handled. If
a block is written to a cache at a certain level, the block can be modified and contain data
that is different from the other version in lower-level caches, this problem of inconsistent
data is the same problem discussed in 2.6.1.4, which can be handled by using any of
the two techniques write-back or write-through. This problem occurs in the inclusive and
non-inclusive caches, where multiple copies of the data are present in more than one cache
level.
The cache can be divided into types Instruction Cache and Data Cache, these
two types of caches are required by the processor to execute any type of program. The
division of the cache into two parts has two advantages:

ˆ Preventing one type of data from occupying the cache completely, as the processor
needs the two types of data to complete its execution, if one type of data is not in
the cache memory, the processor will have to fetch this data from the lower level
memories leading to extra amount of latency. By dividing the cache into instruction
and data caches, all information needed by the different parts of the processor will
be available at the highest cache level with minimal possible latency.

ˆ Not every part of the processor works on the same type of data. For example,
the fetch engine is responsible for retrieving instructions, while the execute engine
operates on non-instruction data. By separating the cache into two parts, each cache
can be allocated near the specific part of the processor that requires its services.
This reallocation reduces the complexity of interconnects.
86 CHAPTER 2. GPU ARCHITECTURE

2.6.1.6 Cache Coherence

The GPU that is designed in this project integrates different cores, the presence of
multiple cores will need a special cache system that can deal with the problem of incon-
sistent data seen by the different cores. To take a deep dive into the problem of data
inconsistency let’s consider Figure 2.77 which illustrates the memory hierarchy of multi-
core systems, the two processors fetched the value of the variable Var from the main
memory in their local caches. Suppose that the CPU1 did some operations on the fetched
variable from the main memory and modified its value, then the new variable is stored in
the local cache of the CPU1. The problem will arise when CPU2 tries to fetch the value
of Var again as shown in Figure 2.78, in this case, the CPU2 will see the old value of
this variable which will lead to the occurrence of errors when trying to use this variable
in further calculations. This problem addressed the need for a cache coherence that
ensures that each copy of data in the local cache of each processor is consistent.

Figure 2.77: The Cache Coherence Problem.

Figure 2.78: The Cache Coherence Problem.


Cache coherence protocol is a technique that is used to ensure that any change in the
value of shared resources will be propagated to the entire system, this protocol can be
implemented at the software level by one of the following instructions:

ˆ FLUSH-LOCAL Instruction: This instruction is used to invalidate a certain


block in the processor’s local cache, so when there is an attempt to read this block,
2.6. CACHE SUB-SYSTEM 87

the processor will not fetch outdated data, instead, it will retrieve the most recent
data from the main memory or other processor’s cache.

ˆ FLUSH-GLOBAL Instruction: This instruction is used to invalidate all the


blocks in different local caches corresponding to a certain address, this will make
sure that other processors will not read stale data, instead, they will read data from
the main memory or the processor’s cache that modified this address.

The approach of implementing cache coherence protocol using software simplifies the
hardware, but it makes the task of developing applications that target multi-core systems
harder and slower, as it introduces more burdens on the programmer. To make the cache
coherence protocol invisible to the programmer, the hardware is modified to support
the cache coherence protocol (CCP). CCP is implemented in hardware using snooping
protocol, where the cache controller of each processor will monitor or snoop the shared
bus, to observe write requests made by different processors, if the cache controller detects
a write request, it verifies whether it has a local copy of the requested block if it has one
in its local cache, the cache controller will invalidate this copy before the other processor
write the new data to its local memory, therefore snooping protocol is also called Write
Invalidate Protocol.

Figure 2.79: The FSM of snoop protocol from the CPU side.

Figure 2.80: The FSM of snoop protocol from the Bus side.
88 CHAPTER 2. GPU ARCHITECTURE

State Definition
Invalid This state indicates that a certain
block in the cache is invalid, so the
processor needs to fetch the block
again from the main memory or the
local cache of another processor
Shared This state indicates that the
cache contains the most recent
copy of the data, and this data is
shared among main memory and
different caches of the processors
within the system
Exclusive This state indicates that the
value of the data copy in the
cache is similar to that of main
memory

Table 2.12: Definition of different states present in the FSM of snooping protocol.

2.6.1.7 Shared Caches


There is a new cache hierarchy introduced in multi-core systems which is the Shared
Cache concept. Shared cache memory shown in Figure 2.81 is a type of cache that is
shared by multiple processors in multi-core systems, and can be present in different cache
levels for example it can be found in the last level of cache, which is located just before
the main memory, or in intermediate position between the closest cache to the processor
and the last level cache. A shared cache can be partitioned in different ways, whether
static or dynamic partitions. Shared caches have some advantages over the traditional
caches:

ˆ Preventing the waste of resources as when a processor is idle, the shared memory
can be used by other cores, thus solving the problem of resource under-utilization.

ˆ The data chunk will be stored once in the cache memory, in contrast to the dedicated
cache system, where the data needs to be written in every local cache of each
processor, thus solving the problem of data redundancy.

ˆ Reducing the communication rate between different cores, the bus will handle less
traffic, leading to a decrease in the complexity of interconnect design.
2.6. CACHE SUB-SYSTEM 89

Figure 2.81: Shared memory among different cores.

2.6.2 Cache Architecture


To handle the high memory pressure produced from the different cores within the
GPGPU, GPGPU is equipped with an advanced non-blocking high bandwidth cache, this
cache is set-associative and divided into several banks to allow different accesses to be
served concurrently. The cache uses snoop protocol to enable cache coherence by efficient
flushing of data for every CPU access. The cache reduces memory pressure by utilizing
multi-banking techniques with the use of virtual ports. This design approach allows each
bank in the cache system to handle multiple requests independently, moreover, each bank
integrates a structure called a miss status holding register (MSHR) to be able to work
when a miss occurs, thus making the cache of the GPGPU non-blocking. The cache
of the GPGPU comes in a flexible and configurable form, as it allows modification in
many architectural aspects such as cache size, number of banks, bank line size, and other
parameters. The cache also is designed to be used as an instruction, data cache, L1-
cache, L2-cache, or L3-cache, making it suitable for many system requirements. The
cache memory is divided into sub-banks each one can work as an independent cache,
each bank receives requests sent to it from the processor through a module known as
Core Request Bank Select, after receiving the requests, the bank serves them in four
pipelined stages:

ˆ Schedule

ˆ Tag Access

ˆ Data Access

ˆ Response Merge

After the bank finishes serving the different requests, a module called Core Response
Merger combines the different responses and sends them to their corresponding cores.
The architecture of the cache is shown in Figure 2.82. In the upcoming section, each
module of the cache will be discussed.

2.6.2.1 Core Request Bank Select


The GPU supports the SIMT (Single Instruction Multiple Thread) paradigm, which
results in several threads executing parallel to each other, this will cause a problem when
90 CHAPTER 2. GPU ARCHITECTURE

Figure 2.82: The Cache Architecture.

several threads try to access the cache memory, if these accesses from the thread are not
coordinated and synchronized, this will lead to an increase in the rate of miss conflicts
and delay when accessing the cache memory, thus much loss in the overall performance,
to avoid these problems a module called Core Request Bank Select is used to orches-
trate the cache memory access. The Core Request Bank Select module receives all the
incoming requests from the different threads and distributes these requests fairly among
the available banks, therefore it ensures that the workload is divided among different
banks equally without having certain bank overload, this prevents the problem of under-
utilization of cache memory. This module also provides an additional feature, which is
the ability to track the clock cycles that a certain processor has to wait before its re-
quest is served, this helps to identify the potential bottlenecks in the cache subsystem.
When the module receives a certain address, it performs certain checks related to the
request address, by extracting different fields of address as shown in Figure 2.83 it can
determine the bank that the request is sent to and the set that this address belongs to in
case of set-associate cache. A potential problem can occur if two or more threads share
the same access credentials such as bank number, request tag, and set number, this will
lead to a bank conflict as shown in Figure 2.84, where more than one thread is directed
to the same bank. The bank selector module solves this problem by the use of virtual
multi-porting, this technique is used to divide the bank further to multiple ports that can
2.6. CACHE SUB-SYSTEM 91

be accessed simultaneously. The bank selector checks the requests to the cache, if two
requests target the same bank and the same cache line, then in case of the enabling of
virtual multi-porting, the selector will resolve the requests to two different ports, if there
are no available ports the requests will be blocked. Figure 2.85 shows the pseudo code for
the virtual port assignment algorithm. The module is equipped with a built-in counter,
to keep track of the stalled requests in each bank or each port in case of enabling virtual
porting.

Figure 2.83: Anatomy of the memory address.

Figure 2.84: The problem of bank conflict.


92 CHAPTER 2. GPU ARCHITECTURE

Figure 2.85: Pseudo code of the virtual port assignment.

2.6.2.1.1 RTL verification This section discusses the verification of the bank selec-
tor modules under different scenarios, suppose that there are 4 threads, that send requests
to the cache memory as shown in Table 2.13.

Request Address Data Tag


Request1 0xdac0 0x80000700 0x4
Request2 0xcdf13 0x80000600 0x3
Request3 0xfe422 0x80000500 0x2
Request4 0x12331 0x80000400 0x1

Table 2.13: Requests sent by different threads to the cache.

Figure 2.86 shows the simulation of the bank selector module, the module configured
to handle 4 requests and distribute them on 4 banks. The simulation shows the bank
number, where each request is forwarded, and the bank ID was extracted from the address
associated with each request, the simulation also shows the different output signals that
represent certain components of a request such as data and tag. Figure 2.87 shows the
case when the available cache banks are only 2 banks, while 4 requests want to be handled,
therefore two of these requests are ignored and the first two requests are the only ones to be
handled. Figure 2.88 shows the case when two requests want to access the cache memory
but they have the same addresses, in this case, the bank selector does not support virtual
multi-porting the two requests will conflict, and the second request will not be handled.
This problem can be solved by enabling virtual multi-porting as shown in 2.89, the two
requests that target the same bank, each one of them will be distributed on one of the
available ports.

Request Address Data Tag


Request1 0xdac0 0x80000700 0x4
Request2 0xdac0 0x80000600 0x3

Table 2.14: Two requests sent to the cache with the same addresses.
2.6. CACHE SUB-SYSTEM 93

Figure 2.86: RTL simulation of bank selector core, when the number of requests is equal to the
available banks.

Figure 2.87: RTL simulation of bank selector core, when the number of requests is higher than
the available banks.
94 CHAPTER 2. GPU ARCHITECTURE

Figure 2.88: RTL simulation of bank selector core, when two requests are sent to the cache
without enabling multi-porting.

Figure 2.89: RTL simulation of bank selector core, when two requests are sent to the cache
with the enabling of multi-porting.
2.6. CACHE SUB-SYSTEM 95

2.6.2.2 Data and Tag stores


The data store is used to store the actual data written to the cache, on the other
hand, the tag store is used to store metadata associated with each request to the cache,
therefore the data can be easily retrieved from the cache. In high-traffic scenarios, the
data and tag stores can become potential bottlenecks since handling a large volume of
requests can lead to increased delays, ultimately diminishing the cache system’s overall
performance. Furthermore, scaling the capacity of these stores may be limited, as doing
so could result in a higher consumption of FPGA resources. To be able to implement
fast data and tag stores with high capacity without consuming a great amount of LUT
resources in the FPGA, BRAMs are utilized to build these stores. BRAM which stands
for Block Random Access Memory is a type of memory embedded in the FPGA fabric
to provide on-chip storage for relatively large data without consuming other elements in
FPGA fabric such as LUTs, Flip-Flops, and DSP units. A BRAM block can hold either
18,000 or 36,000 bits, and the quantity of these blocks varies depending on the FPGA
vendor. These BRAM blocks can also be configured to function as either single-port RAM
or double-port RAM.

2.6.2.2.1 Dual-port RAM Dual-port RAM is a type of hardware memory topology,


that enables simultaneous read and write operations on distinct memory locations. It
utilizes two separate ports, each with its own set of read-and-write address lines, data
input lines, and data output lines. The two ports can read and write data at the same
time without interference or contention, this provides an efficient mechanism that can deal
with multiple data streams or events and enables high-speed processing of large amounts
of data. The nature of applications, that are executed on GPU, such as image or video
frames processing involves performing complex operations on large sets of data, these
operations require efficient management of data streams and high parallelism, therefore
in the design of GPU memory dual-port RAMs are used to manage the parallel process-
ing of large amounts of data. Dual-port RAMs can be used in a GPU design to enable
simultaneous read and write access to the memory, allowing for efficient management of
the data streams. For example, in a GPU-based image processing application, dual-port
RAMs can be used to store the input and output image frames, which can be accessed
simultaneously by multiple processing units for parallel processing. Each processing unit
can access a different portion of the memory, enabling high-speed processing of the data in
parallel. Dual-port RAMs can also be used in a GPU design to implement shared memory,
which allows multiple processing units to access the same memory locations simultane-
ously. This can help to improve the performance and efficiency of the GPU design, as it
allows for efficient sharing of data between different processing units.
96 CHAPTER 2. GPU ARCHITECTURE

Figure 2.90: Dual-port RAM structure.

2.6.2.2.2 Single-port RAM A single-port RAM (Random Access Memory) is a type


of memory in hardware design that allows for read and write access to a single memory lo-
cation at a time. It consists of a single set of read-and-write address lines, data input lines,
and data output lines. Single-port RAMs are commonly used in simple hardware designs
that require basic memory functionality, such as storing data or program instructions.
They are often used in microcontrollers, digital signal processors, and other embedded
systems. In a single-port RAM, the read and write operations must be performed sequen-
tially, which can limit the performance and efficiency of the system when multiple data
streams or events need to be managed. However, single-port RAMs are simple and easy
to implement and are often sufficient for many basic memory applications. This type of
memory is used to implement both tag and data stores.

Figure 2.91: Single-port RAM structure.


2.6. CACHE SUB-SYSTEM 97

2.6.2.2.3 RTL verification Dual-port RAM and single-port RAM can be imple-
mented in Verilog by writing RTL code that instructs the synthesis tool to utilize the
BRAM available in the FPGA. The implemented RAM module is flexible and can be
easily configured according to user specifications through certain parameters in Table
2.15.

Parameter Description
DATAW Data bus width.
SIZE The depth of the RAM
ADDRW The width of the address bus
BYTEENW The byte access behavior
OUT REG Whether the output is registered
or not
INIT ENABLE Control the initialization process
of the RAM
INIT FILE The path to the text file that is
used to initialize the RAM

Table 2.15: Parameters description.

Figure 2.93 shows the RTL simulation of the dual-port RAM module when INIT ENABLE
parameter is enabled, in this case, the entries of the RAM are filled with the values from
the file specified by INIT FILE parameter. In this case, the RTL RAM can act as a
ROM to store unmodified data or act as a photo memory, that stores the input image
that will be processed later.

The memory can also be configured to be byte-addressable memory, which means that
the processor can send requests to modify some bytes only instead of modifying the whole
word, the position of the bytes to be modified is determined by the wren signal shown in
the Figure 2.90, the width of this signal is determined by the BYTEEN parameter that
represents the number of bytes within one word. Figure 2.92 shows the RTL simulation
of the RAM module with BYTEEN equal to 4.
98 CHAPTER 2. GPU ARCHITECTURE

Figure 2.92: RTL simulation of dual-port RAM when initialization is enabled.

Figure 2.93: RTL simulation of dual-port RAM when BYTEEN parameter is equal to 4.

2.6.2.3 Miss Reservation Stations

The performance of the cache subsystem is affected so much by the pattern by which
the processor accesses the cache, some applications such as sparse linear algebra and
graph analytics have irregular cache access patterns, which leads to an increase in the
miss rate, thus reducing the overall performance of the cache system[32]. To address the
issue of frequent cache misses, the cache needs to be capable of simultaneously processing
requests from the processor even when a miss occurs, rather than merely waiting for the
miss to be resolved. A type of cache memory known as the non-blocking cache has the
potential to solve this problem, by performing some optimizations that enable the cache
to handle multiple hits while resolving the miss request, this scheme known as hit under
miss is shown in Figure 2.94, in this scheme the miss penalty is reduced by handling the
2.6. CACHE SUB-SYSTEM 99

hit requests until a second miss occurs, in this case, the cache stalls. A more complex
and effective optimization that can be made is miss under miss, in this method shown
in Figure 2.95, the cache overlaps the miss penalty of several misses, by serving different
hit-and-miss requests until one of the following events occurs:

ˆ All instructions in the flight are waiting for their operands and there are no ready
instructions to be executed.

ˆ The write-back buffer of the cache controller is full.

Figure 2.94: Hit under miss scheme.

Figure 2.95: Miss under miss scheme.

Although non-blocking caches have the potential to increase the system performance
very much, their implementation and design in reality is a very complex task for two
reasons:

ˆ The potential collision between hits and misses, this problem is ignored in blocking
caches because the processor has to wait until all misses are resolved and further
requests will not be processed, on the other hand, in a non-blocking cache the hit can
collide with a miss coming from lower level-memory hierarchy, and in the systems
that support handling of multiple misses, the misses can collide with each other. To
handle these potential problems, non-blocking caches give the hits higher priority
than misses and may need to order the handling of colliding misses.

ˆ The second challenge is to keep track of the occurred misses in the cache. This
problem was trivial in blocking caches because there is only one miss that is re-
solved at a certain time, this is not the case in non-blocking caches where multiple
outstanding misses are handled concurrently. At first time it is not seen as a hard
100 CHAPTER 2. GPU ARCHITECTURE

task because someone can think that the misses will be returned in the same or-
der they were initiated, thus simple queue is sufficient to track the different misses,
however, this is not true in many scenarios, as a miss in L1-cache may cause a hit
or miss in L2-cache, the problem will appear if L2-cache is also non-blocking, this
make the orders of committing requests from L2-cache to L1-cache is different from
the original order of the requests. Additionally, in systems with multiple cores or
multiprocessors, variations in cache access times can further complicate the tracking
of outstanding misses.

Modern processors use a set of registers named Miss Status Handling Registers
(MSHRs) to keep track of different cache misses. The number of MSHRs in the cache
corresponds to the maximum number of outstanding misses that the cache can handle
concurrently. Each entry in the MSHR stores some important information about the
cache position in which the miss has occurred, the tag value associated with this miss,
and the instruction that has triggered this miss. Figure 2.96 shows how the MSHR is
integrated with the non-blocking cache to handle cache misses.

Figure 2.96: Integration of the miss reservations with the cache memory.
2.6. CACHE SUB-SYSTEM 101

To handle the cache misses properly, the MSHR supports five main functions as follows:

ˆ Allocation operation: Allocation is the process of reserving new entry in the


MSHR table to handle a certain memory. The process of finding a valid entry in the
MSHR is done by using the tree comparator structure shown in Figure 2.98, tree
comparator module is used to determine the first valid entry to be allocated when
a cache miss occurs. During the allocation process, certain information is enqueued
into the assigned MSHR (Miss Status Handling Register) entry. This information
includes data identifying the miss and the associated tag for the memory request.
The tag is crucial for the memory system to pinpoint the cache location where the
miss occurred. Subsequently, it notifies the instruction responsible for generating the
miss that the miss has been resolved and that the data operands are now available
for further execution.

ˆ Lookup operation: This operation is used when a miss occurs, to search the
MSHR entries to identify whether there is a pending request that have the same
address as the new miss, if so, then there is no need to allocate a new entry in the
MSHR, if there is no match, then a new location will be allocated to the new cache
miss. The summary of this operation is shown in Figure 2.97.

ˆ Replay operation: When the data arrives from the lower memory hierarchy, this
operation is done to notify all the entries in the MSHR with same tag as the memory
response, that the data is arrived. The ready bit in the entries with matching tags
are is set to 1, indicating that that this entry is ready to be dequeued from the
MSHR.

ˆ Fill operation: This operation occurs when the memory respond back with the
required data, the entry to fill with this data is determined by the memory ID which
is one of the inputs of the MSHR module.

ˆ Dequeue operation: This operation is used to remove a memory request from


MSHR entries when its corresponding data arrives from the lower memory hierar-
chies, the slots to be considered is the slots with their valid and ready bits are set
to 1, when both ready and valid signal are active, the dequeue ready signal is fired
to indicate that there is a ready slot to be removed from the MSHR.

Figure 2.97: The Lookup operation.


102 CHAPTER 2. GPU ARCHITECTURE

Figure 2.98: Comparator Tree.

2.6.2.4 Cache Bank


To increase the bandwidth of the cache and allow multiple access per clock cycle to
be able to execute intensive applications, the cache is divided into multiple banks, each
bank works as an independent cache that can serve different requests concurrently with
other banks. The requests are mapped to the banks in an interleaving way shown in the
Figure 2.99. Another optimization to increase the number of bank access per clock cycle
is to pipeline the bank access instead of single cycle access. The stages of accessing the
cache bank as shown in Figure 2.100 is as follow:

ˆ Schedule Stage: The incoming core requests to the cache may be not the only
source that need to utilize the cache, there are many other sources that need to
access the cache at certain time such as :

– Snoop requests by flush controller to invalidate certain cache entry.


– Miss reservation stations may need to send a request to dequeue some entries,
that have received their data operands from the main memory.
– A response from other memory hierarchies has arrived and need to be stored
in the cache.
– A request is sent by the core and need to be processed.

There is a priority arbiter at the schedule stage to give grant access to the different
requests, the order of priorities of requests is as follow:

– Flush Request

– MSHR Dequeue Request


2.6. CACHE SUB-SYSTEM 103

– Memory Fill Request

– Core Request

ˆ Tag Access Stage: When a core request is sent to the cache bank, the tag portion
of the physical address associated with this request is checked to find whether it
presents in the tag store or not, if so, a tag match signal is triggered. In case of
memory fill request the tag store is filled with the tag associated with the memory
request. The flush request invalidates the valid bit corresponds to a certain entry
in the tag store.

ˆ Data Access Stage: In this stage the actual data is retrieved if the request tag
cause a match in the tag store. In case of fill requests, the data comes from the main
memory is stored in the data store entry corresponds to the memory ID associated
with the memory request.

ˆ Response stage: In this stage the response to the received requests are sent back
to the cores through elastic buffers. The memory requests in case of a miss occurs
are also sent to the lower memory hierarchies.

Figure 2.99: Access of Four-way interleaved cache banks using block addressing.

Figure 2.100: The stages of access the cache bank.


104 CHAPTER 2. GPU ARCHITECTURE

2.6.2.4.1 RTL verification Figure 2.101 shows the structure of the cache bank, the
bank uses an elastic buffer to receive different core requests or memory responses. The
main reason to use an elastic buffer is to ensure smooth data transfer when connecting
between different domains, as the clock frequency may differ from the core side to the
cache side or from the main memory side to the cache side. The elastic buffer decouples
the two sides (transmitter and receiver) and uses a valid/ready handshake mechanism to
transfer data between the two sides, when the sender has data to send, it triggers the
valid signal and on the other side when the receiver is ready to receive data, it triggers
the ready signal, when the both signals are high, the data transfer occurs. The bank can
enter a deadlock state in two cases:

ˆ A new request is ready to be processed and the MSHR is full.

ˆ There is an incoming memory response and the memory request queue is full.

To over come this problem, a counter is used to keep track of the number of pending
requests in MSHR, if the MSHR is almost full, the bank triggers a signal to block the new
requests from being issued.

Figure 2.101: High level overview of the cache bank structure.


2.6. CACHE SUB-SYSTEM 105

The three requests in Table 2.16 are sent to the cache bank, where all the addresses
associated with the requests will cause a miss. Figure 2.102 shows how the bank will
response to these requests, the bank sends a memory requests to the lower level memory
in the hierarchy, each request is uniquely identified with an ID, that indicates its position
in the MSHR. Figure 2.102 shows the main advantage of the non-blocking cache bank,
that it can work during a stall without waiting for the response to arrive from the main
memory. Table 2.17 shows the responses from the main memory, each response has an ID
corresponds to the MSHR entry, that contains the address of the request waiting for the
data. Figure 2.103 shows the bank respones to the core after resolving the misses. The
core response data is the same as the data recieved from the main memory.

Request Address Data Tag


Request1 0x1 0x0 0x2
Request2 0x2 0x0 0x3
Request3 0x2 0x0 0x1

Table 2.16: Requests sent to the cache bank.

Figure 2.102: RTL simulation of cache bank 1.

ID Data
0x0 0xaabbccdd
0x1 0xccddeeff
0x2 0xbbccddee

Table 2.17: Responses from the main memory.


106 CHAPTER 2. GPU ARCHITECTURE

Figure 2.103: RTL simulation of cache bank 2.

2.6.2.5 Core Response Merger


The Response Merger module is a crucial module to ensure efficient handling of
responses produced by different cache banks. When a certain request produces a hit in
one of the cache banks, the bank will need to send a response back to the core that has
sent this request, the core response merger bus chooses one tag at a time to handle and
then transmits all the bank responses corresponding to this tag as a coherent batch. The
response merger module chooses the next tag to send the bank responses to using the
following steps :

ˆ The response merger scans all virtual ports of each bank to find a valid port.

ˆ The valid ports are the ports that have valid requests but have not been serviced
yet.

ˆ The request signals from the valid port is sent to the Find First Tree to choose
the tag to be processed as shown in Figure 2.104.

ˆ A two-dimensional array keeps track of which ports’ responses have been sent and
removes them from the Find First Tree.

ˆ This array is cleared once all Active Ports in a bank are serviced.

ˆ All the requests from the ports that have a tag that matches the one selected by
the Find First Tree are merged then their responses are sent as one batch.

ˆ The Valid Signal and Requested Data are sent out to the threads that initiated the
request.

ˆ That Virtual Port is then marked as serviced so that it is removed from the Find
First Tree, the summarized steps of this algorithm are shown in Figure 2.105.
2.6. CACHE SUB-SYSTEM 107

Figure 2.104: Find First Tree.

Figure 2.105: Pseudo-code of response merging algorithm.

2.6.2.6 Cache Arbiter


In a crowded system like a GPU many requests will be sent to the cache from different
threads, so to ensure smooth and coordinated access to the cache resources, the different
requests pass through a module called Cache Arbiter. This module handles the incoming
requests using stream arbiter, the requests are gathered together in a certain group
known as lane, the stream arbiter selects only one request from each lane and gives it
the grant to access the cache resources, then receives the Response from the Cache and
sends it back to the GPU. This is used to resolve conflicts when multiple requests need
108 CHAPTER 2. GPU ARCHITECTURE

to use the Cache. Initially, the different threads initiate a request to the cache in case it
is ready to receive signals, each lane contains a certain number of requests. Each request
is associated with metadata that indicates certain information about the request such
as valid signal, (Read/Write), tag, address, and byte enable, these data are merged and
then transferred to the stream arbiter. The Stream Arbiter grants a request and sends
it to the Cache. It grants the same request index for each Lane. For example, when
it grants request 2, that request is granted across all Lanes. The Cache then sends the
response back to the Cache Arbiter. This response contains the Valid signal, Thread
Mask, Requested Data, and Tag. The response then goes through the stream demux.
After the response is demuxed there is now a response for each requestor in every lane
that contains: Valid, Thread Mask, Requested Data, and Tag.

Figure 2.106: Cache Arbiter.

Stream arbiter uses the Round-Robin policy as the arbitration algorithm to schedule
the different incoming requests. The round-robin scheduling policy is chosen due to the
following reasons:
ˆ The low-time latency needed to select and forward the requests to the cache.
ˆ The fairness in sharing the resources among different requestors.
ˆ The relatively low worst-case delay.
ˆ The low utilization of hardware resources.
To ensure that the arbitration is done with a speed close to the original clock speed of the
system, the round-robin scheduler checks the requests in parallel to determine the first
2.6. CACHE SUB-SYSTEM 109

valid request to be scheduled, this scan is done using parallel prefix OR operation to
determine the position of the first valid request and then a priority encoder is used to get
the index of this location. To ensure that all requests will be served the schedule makes
the chosen request the lowest priority request after it is scheduled, so a chance is given
to the later ones to be scheduled. Figure 2.108 shows the RTL simulation of the round-
robin scheduler, that accepts 5 requests where each of them is valid, every clock cycle
the scheduler grants a different request and produces the index of the granted request
alongside the hot-encoded vector.

Figure 2.107: Logical structure of round-robin arbiter.

Figure 2.108: RTL simulation of Round-Robin scheduler.


110 CHAPTER 2. GPU ARCHITECTURE
Chapter 3

AXI-Controller

3.1 Background
To connect the GPU with the main memory an interface is needed to standardize
the transactions between the GPU and the memory and to abstract the resources of the
GPU, so when it is used in a heterogeneous system with another host processor, the
processor can communicate easily with the GPU using shared memory visible by both
architectures. The Advanced eXtensible Interface(AXI) is used to connect between the
GPU and the main memory. The AXI protocol is a part of The Advanced Microcontroller
Bus Architecture (AMBA) protocol, which is an open standard, on-chip interconnect
used for the connection and management of several function units within a system-on-
chip (SOC). The usage scope of AMBA has gone beyond the usage in microcontroller
devices, it is utilized nowadays in a range of ASIC and SoC applications. AMBA was
introduced by ARM in 1996. The first AMBA buses were highly developed Advanced
System Bus (ASB) and Advanced Peripheral Bus (APB). In its 2nd version in 1999,
AMBA 2.0, ARM added AMBA High-performance Bus (AHB) that is a single clock-edge
procedure. In 2003, ARM introduced the 3rd generation, AMBA 3.0, with Advanced
Extensible Interface (AXI) to reach even high-performance interconnects. In 2010, ARM
introduced the 4th generation, AMBA 4.0, which includes the second version of AXI,
AXI4. AXI4 is used to wrap many IPs because it offers three important properties:

1. Productivity: By standardizing the AXI interface the developers need only to


learn only a single protocol to program-wide ranges of IPs without any concerns
about the underlying hardware.

2. Flexibility: AXI protocol provides three types of interfaces suitable for different
applications

(a) AXI4: It is used for memory-mapped interfaces and allows sending data bursts
up to 256 data transactions with just one address.
(b) AXI4-Lite: It is similar to the AXI4 interface but with smaller resource
utilization.
(c) AXI4-Stream: It is used for high streaming data that can burst an unlimited
amount of data.

3. Availability: Many IP vendors provide support for the AXI interface.

111
112 CHAPTER 3. AXI-CONTROLLER

3.2 AXI Adapter


This module is used to interface between two different AXI-based systems or compo-
nents. The AXI adapter is used to translate the signals and protocols between the two
systems, allowing them to communicate and exchange data. For example, it interfaces
the cluster of cores with the main Memory as shown in Figure 3.1. The AXI Adapter
allows the top module that contains GPU cores to communicate with the main memory
via independent channels:

ˆ Write Address

ˆ Write Data

ˆ Write Response

ˆ Read Address

ˆ Read Data

The AXI Protocol uses a two-way handshake mechanism. The information source uses
the VALID signal to show when valid address, data, or control information is available
on the channel. The destination uses the READY signal to show when it can accept the
information.

Figure 3.1: Overview of the AXI-Interface


3.2. AXI ADAPTER 113

Signal Description
mem req valid Validity of the core request
mem req rw Whether the request is read or write
mem req byteen Which Byte within the Word has valid data
mem req addr Memory Address to Read/Write from
mem req tag Memory Tag Requested
mem req data Data to be written to the memory

Table 3.1: Caption

3.2.1 Read Operation


Table 3.1 shows the different signals generated by the core when it wants to send a
request to the memory. In case a read request is needed to be sent the following steps
are done :

1. mem req addr is set to the address of the block that the core needs to read from
and mem req tag is set to the Tag ID.

2. mem req valid is asserted to indicate that all control signals are available at the
input.

3. mem req rw is set to 0 to indicate a Read.

After configuring the request signals the AXI Adapter converts these signals to signals
compatible with the AXI4 protocol and then sends the request to the slave side to fetch
the required data as shown in Figure 3.2.

Figure 3.2: The read process in AXI protocol


114 CHAPTER 3. AXI-CONTROLLER

3.2.2 Write Operation


The process to send a write request to the memory using AXI-adapter is as follows:

1. mem req addr is set to the address of the block that the core needs to write to
and mem req tag is set to the Tag ID.

2. mem req data is set to the Write Data.

3. Strobe (Which Byte Lane is Valid) is set with mem req byteen.

4. mem req valid is asserted to indicate that all control and data signals are available
at the input.

5. mem req rw is set to 1 to indicate a Write.

Similar to the read operation discussed before the AXI-adapter will convert the request
signals to other signals compatible with AXI4 protocol and then send the signals to the
memory.

Figure 3.3: The write process in AXI protocol


Chapter 4

Software Testing Tools Flow

4.1 Introduction
The project’s main aim is not only to develop high-performance GPU but also to
provide an easy software framework to test and execute real-world applications instead
of just using simulators such as Questa Sim, Vivado, and Modelsim to perform RTL
simulations. The development of such a framework accelerates the testing process and
makes it easier to integrate new modules into the device and test it without concern for
the underlying hardware. The tools used to implement the software testing framework
will be discussed in the upcoming sections. The testing enviroment in this thesis was
not built from scratch, it is a modified version of the testing enviroment used in Vortex
project[33].

4.2 Ramulator
To evaluate the performance of the implemented GPU on the FPGA there is an
important parameter to be taken into consideration, which is the latency of (Dynamic
Random Access Memory)DRAM. Many projects ignore this aspect and assume that data
will be always available in the cache, however in real-world applications the program
binaries are first located in the Main Memory then the processor fetches them to the
cache, the latency of the Main Memory is very high compared to the cache and often
dominates the executing time of many applications. The main challenge in simulating the
Main Memory based on DRAM technology is that the technology used to manufacture
the DRAM cells is not compatible with the (Complementary Metal Oxide Semiconductor)
CMOS technology used to manufacture the FPGA logic. So the real latency of the DRAM
is very hard to emulate on the FPGA. To overcome this problem a DRAM simulator
known as Ramulator. The Ramulator framework developed by researchers at ETH
Zurich [34] is an efficient and accurate DRAM simulator that is designed with the aim
of providing cycle-accurate simulations for DRAM systems taking into consideration to
be easily extendable to support newer versions of DRAMs. Ramulator utilizes a generic
template to simulate the DRAM systems, this generic template can be customized later
to support specific characteristics of different DRAM protocols, this structure enables the
Ramulator to support various standards such as DDR3/4, LPDDR3/4, WIO1/2, GDDR5,
HBM, and even academic proposals like TLDRAM, AL-DRAM, and RowClone.

115
116 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW

Figure 4.1: Example of generic code used by Ramulator.

4.3 Accurate-cycle simulation using Verilator


Verilator is an open-source tool that converts Verilog or SystemVerilog to multi-
threaded high-performance C++ code. The produced code by the verilator is cycle-
accurate that can simulate the hardware behavior of the hardware at every clock edge.
A fast and accurate C++ testbench can be implemented using verilator that has the
advantage of being easily integrated with other C++ simulating tools so more test cases
can be evaluated for example, by generating C++ testbenches the Ramulator library
discussed in the previous section can be included easily with the converted C++ code, thus
enabling the evaluation of the code in a real-world environment that cannot be provided
by other RTL simulators. Although the verilator is a cycle-based simulator, which does
not evaluate time between the clock edges which makes the simulations unable to monitor
intra-period glitches, it compensates for this limitation by being very strict in synthesizing
RTL code as it doesn’t accept non-synthesizable codes and warnings that can be tailored
by other simulators, which enforces the developer to write a high quality code free of time
glitches.

4.4 OpenCL Platform


The main programming framework supported in this thesis to program the GPU is
(Open Computing Language) OpenCL. In recent years OpenCL became the main frame-
work to execute binaries that target heterogeneous architectures, especially those systems
that combine CPUs and GPUs. The OpenCL framework comprises an application pro-
gramming interface (API), implemented through a runtime library and a language known
as OpenCL C, this language is portable and supports a parallel programming model. The
ordinary OpenCL model consists mainly of two sides:
ˆ Host Processor: The main platform to execute the application binaries and the
runtime library.
ˆ Accelerators: Devices to execute certain code known as kernels, which is written
in OpenCL C, examples of these devices are GPUs, FPGAs, and DSP chips.
4.5. MODELS OF THE OPENCL 117

4.5 Models of the OpenCL


To facilitate the process of writing efficient and scalable parallel code, OpenCL divides
the system architecture into three hierarchies Platform, Execution, and Memory Mod-
els. Those three hierarchies interact with each other and define the OpenCL operation.

4.5.1 Platform Model


Platform Model describes the logical connections between the computing devices in a
heterogeneous system, the system is divided mainly into a host processor and multiple
accelerator devices. Each accelerator is divided into multiple computing units and each
computing unit consists of multiple processing elements as shown in Figure 4.2. This
division is a logical abstraction and how the processing elements are arranged in the
computation core is dependent on the device type.

Figure 4.2: OpenCL Platform Model.

4.5.2 Execution Model


The execution model is responsible for logically dividing the execution of a certain
program into two parts:
ˆ Kernels: Parallel blocks of code run on accelerator devices such as GPUs, CPUs,
or FPGAs.

ˆ OpenCL API: The OpenCL API and the remaining application code runs on the
host device(CPU).

The different communications between the host device and the other accelerators in-
cluding the memory transactions are done through a special data structure known as
118 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW

Command Queue. The command queue connects the CPU and each accelerator device
with each other and through it, the CPU sends the instructions to be executed to each
device. The CPU invokes the different devices through kernels, where these kernels are
arranged in a two-dimensional space known as 4.3, where each node resembles a Work
Item that will be executed on a certain processing element, and these work items are
arranged into a Work Group that is executed on the whole computing unit. In this
project, the processing element is the SIMD core, while the computing unit is the top
GPU module that contains a group of SIMD cores.

Figure 4.3: The arrangement of kernels NDRange space.

4.5.3 Memory Model


OpenCL divides the memory hierarchy into four parts as follows:

ˆ Host Memory: This type of memory is available for the host device.
4.6. THE FLOW OF OPENCL HOST APPLICATION 119

ˆ Local Memory: This type of memory is available for all the processing elements
within the compute unit.

ˆ Global Memory: This type of memory is available for the computing unit.

ˆ Private Memory: This type of memory is available for a single processing element.

Figure 4.4: The hierarchical structure of OpenCL memory.

4.6 The Flow of OpenCL Host Application


To execute an OpenCL code some steps have to be done to ensure the proper execution
of the program, these steps include defining kernel tasks, setting up memory buffers, man-
aging the data transfer between the host and accelerator, and launching kernel that will
be executed parallelly on the accelerator device. Figure 4.5 shows the steps of launching
and executing OpenCL kernels:

1. The First step is to identify and select the OpenCL platform to execute the pro-
gram kernels, as OpenCL standards support the existence of multiple platforms for
120 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW

different vendors in the same system. The host CPU uses an API to identify the
different platforms supported by the system.

2. After selecting the platform, the host API sends a query for the chosen platform to
identify the devices supported by it for example POCL platform supports different
types of architectures such as X-86, RISC-V, and Nvidia GPUs.

3. Prepare the environment for the selected device by allocating different data struc-
tures used for executing OpenCL programs such as command queues, memory
buffers, and kernels. These structures cannot be shared by more than one process.

4. Create a command queue to launch the kernels on the target devices, the command
queue is exclusive only for each device.

5. Initialize a one-dimensional data structure known as a buffer, to use it to pass input


arguments to the kernels and to fetch the output results.

6. Generate and build a program entity, the OpenCL program entity is a group of
kernels, which are special functions written in the OpenCL C program language
uniquely identified by the kernel attribute. There are two ways to create these
kernels:

(a) Compile OpenCL C code using vendor kernel compiler during runtime.
(b) Using pre-compiled binary image generated before execution.

7. Generate kernel object, which is a container that warps a kernel function and pro-
vides a way to pass parameters to it.

8. In this step memory buffers and data structures are passed to the kernel function
using a special function known as clSetKernelArgument.

9. Use special commands to move the data from the host memory to the accelerator
memory through the command queue.

10. The final step is to execute the kernel using the clEnqueueNDRangeKernel
function.
4.6. THE FLOW OF OPENCL HOST APPLICATION 121

Figure 4.5: The program flow of OpenCL application.


122 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW

4.7 POCL
To enable the execution of the OpenCL kernels on the GPU, the OpenCL binaries
must be compatible with the RISC-V architecture. The Portable Computing Language
(POCL) is an open-source implementation of OpenCL that supports various types of
architectures such as ARM, X-86, and RISC-V. It also provides support for user’s custom-
designed accelerators. The POCL compiler uses Clang as a front-end framework to
compile OpenCL codes and the Low Level Virtual Machine (LLVM) compiler as a back-
end framework to transform the generated binaries from the OpenCL compilation to be
compatible with the target device. The POCL runtime environment is built on three
distinct layers:

ˆ OpenCL Runtime API Calls: This layer is a middleware between the software
applications and the actual OpenCL runtime environment, it offers standard APIs
that the running software applications can use to perform different operations.

ˆ Device-Independent Common Interface: This layer provides an abstraction


for the common functions that can be used by different devices such as :

– Compiler driver
– File system abstraction
– Memory abstraction
– Threading and synchronization libraries

ˆ Device-Specific Layer: This layer is responsible for implementing specific hard-


ware functions for certain target devices.

Figure 4.6: The process of compiling OpenCL applications to binaries that target RISC-V
system using POCL compiler.
4.8. GPU RTL DRIVER 123

4.8 GPU RTL Driver


The GPU RTL driver abstracts the underlying hardware of the GPU and provides the
host processor with standard APIs to access the required resources. The RTL driver is a
C++ library that encapsulates the hardware functionalities by utilizing the transformed
RTL code written in Verilog produced from the Verilator tool described in the section 4.3.
When using Verilator-converted code and abstracting it with the RTL driver, developers
gain the ability to conduct accurate and flexible tests. This configuration allows writing
C/C++ testbenches that emulate hardware behavior. Consequently, it enables testing
across various applications without concerning the programmer with the hardware details.
Figure 4.7 shows the configuration of RTL simulation, the host processor is the main CPU
that will execute the software applications and communicate with the GPU using the RTL
driver, and the main OpenCL kernel that targets the GPU will be compiled using POCL
compiler then the produced binary image will be dumped to the Global memory, which
is a DRAM memory accessed by both the host and the target device implemented using
Ramulator framework discussed in the section 4.2. Before the execution begins, the data
section in the Global memory will be initialized by the host CPU.

Figure 4.7: The overview of the configuration used to perform cycle-accurate simulation.

4.9 Testing Steps


This section describes the steps required for performing RTL testing. At first, the
kernel code that targets the GPU will be compiled using POCL. Figure 4.8 shows an
example of an OpenCL kernel, the function intialize tasks is used at runtime to map
the POCL kernel to the hardware resources, the function uses CSR instructions to get
some information about the device such as:
124 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW

ˆ Number of GPU cores.

ˆ Number of Warps supported by the GPU.

ˆ Number of threads within the Warp.

After obtaining these parameters, the intialize tasks function uses them to distribute
the workloads equally among the threads and warps within each processing core. For
example, if a convolution operation is wanted to be performed on an image, the image will
be divided into Work Groups equal to the number of GPU’s cores, and then these Work
Groups will be divided further into Work Items equal to the number of warps supported
by each core, the Work Items contain parallel instructions corresponds to the number of
threads within the warp. After compiling the OpenCL kernels and producing the binary
image the host CPU executes a sequence of code to communicate with the target device
and initialize the Global memory with the required data needed by it. Figure 4.9 shows
the steps performed to launch the OpenCL application the code emulates the steps shown
in Figure 4.5:

ˆ Platform and Device Initialization: Using clGetPlatformIDs and clGetDe-


viceIDs functions, the host CPU selects the available device to communicate with
it.

ˆ Context Creation: Using clCreateContext function the host processor creates


an OpenCL context for the selected device.

ˆ Buffer Allocation: Using the clCreateBuffer function, the host processor creates
three buffers to be able to pass the input arguments and retrieve the output results
from the kernel function.

ˆ Program Initialization: Using the clCreateProgramWithBinary function,


the host processor creates a program entity using the pre-compiled POCL kernel.

ˆ Program Building: Using the clBuildProgram function, the host processor


builds the program entity to be used by the target device.

ˆ Kernel Creation: Using the clCreateKernel function, the host processor creates
a kernel to be executed on the target device with a unique name identified by the
parameter KERNEL NAME.

ˆ Setting Kernel Arguments: Using the clSetKernelArg function, the host pro-
cessor sets the input and output buffers of the kernel.

ˆ Command Queue Creation: Using the clCreateCommandQueue function,


the host processor initializes a command queue, by which the host processor sends
commands to the target device.

ˆ Data Transfer: Using the clEnqueueWriteBuffer function, the host processor


to the memory objects.

ˆ Kernel Execution: Using the clEnqueueNDRangeKernel function, the host


processor sends the compiled kernel to the target device.
4.9. TESTING STEPS 125

ˆ Data Retrieval: Using the clEnqueueReadBuffer function, the host processor


retrieves the output results computed by the target device, to compare it with the
previously computed golden reference data to test the functionality of the target
GPU.

Figure 4.8: OpenCL kernel.

Figure 4.9: Code for initializing the OpenCL applications.


126 CHAPTER 4. SOFTWARE TESTING TOOLS FLOW
Chapter 5

Results and Evaluation

5.1 RTL Design Results


In this project the target FPGA to implement the design on is XC7Z100-2FFG900I
FPGA. Table 5.1 shows the FPGA resources. Table 5.2 shows the sources utilized by
different numbers of cores. The design was implemented using an operation frequency
of 142 MHz, figure 5.1 shows that the design meets the timing with a positive slack of
0.499 ns and a positive hold slack of 0.049 ns.

DSPs 2020
BlockRAMs 755
LUTs 277400
IO 900

Table 5.1: FPGA resources.

Num of Cores LUTs (%) DSP (%) FF (%) LUTRAMs (%) Fmax MHz
1 9.73 1.19 2.72 6.9 142
2 19.58 2.38 5.61 13.80 142
4 39.12 4.75 11.14 27.61 142
6 58.87 7.13 16.68 41.41 142
8 77.34 9.50 22.18 55.22 142
10 96.94 11.88 27.72 69.02 142

Table 5.2: Hardware synthesis results for different numbers of cores.

Figure 5.1: Timing Constrains.

127
128 CHAPTER 5. RESULTS AND EVALUATION

Figure 5.2 shows a power analysis for one core based on the default power estimation
of Vivado, the total power consumed is 0.711 watt, consisting of 69% of dynamic power
and 31% of static power. Figure 5.3 shows the power consumption of the GPU with eight
cores, the total power consumed is 5.82 watt, consisting of 95% dynamic power and 5%
static power.

Figure 5.2: Power consumption of one core.

Figure 5.3: Power consumption of eight cores.


5.1. RTL DESIGN RESULTS 129

Figure 5.4: Layout of GPGPU with 8 cores.

Figure 5.5: Hardware utilization of each module.


130 CHAPTER 5. RESULTS AND EVALUATION

5.2 Benchmarking
5.2.1 Introduction
In this section the performance of the GPGPU implemented in this thesis is evaluated
using different benchmarks that are classified into two main categories:

ˆ Computing-bound benchmarks such as sfilter, sgemm, and vecadd which are subsets
of Rodinia benchmarks[35].

ˆ Memory-bounded benchmarks such as guassian, sort, and saxpy benchmarks.

The performance metric used to evaluate the performance and efficiency of the GPGPU
is the Instruction Per Cycle (IPC). IPC is used to measure the average number of
instructions executed in one clock cycle. The IPC is used as an evaluation metric for the
following reasons:

ˆ Efficiency measurement: IPC is an efficient measurement of performance as it


indicates the Processor Utilization and Instruction Efficiency. IPC reflects
how the hardware resources are utilized, the higher the IPC value, the better the re-
sources are utilized. The higher IPC also reflects that more instructions are executed
per clock period, indicating better instruction execution efficiency.

ˆ Performance Analysis: The IPC is a good tool to analyze the performance of the
program to identify the different bottlenecks, as a lower IPC indicates that there is
a potential bottleneck in the architecture of the hardware.

ˆ Comparative Analysis: The IPC can be used to compare different architectures


with different frequencies.

5.2.2 Sort Benchmark


In this section the GPGPU will be evaluated using different sort algorithms, and the
effects of threads number, warps number, core number, and workload on the performance
of the GPGPU will be examined.

5.2.2.1 Merge Sort


The first sorting algorithm to be tested on the GPGPU is the merge sort algorithm,
merge sort is one of the most popular sorting algorithms used in the field of computer
science due to its stability and consistency. In Figure 5.6, the chart displays the Instruction
Per Cycle (IPC) concerning different core counts as the number of threads and warps
changes. In Figure 5.7, the chart displays the IPC concerning different core counts as the
number of elements to be sorted changes.
5.2. BENCHMARKING 131

Figure 5.6: IPC results of merge sort algorithm for different GPU configurations with a constant
workload of 32 elements.

Figure 5.7: IPC results of merge sort algorithm for different workload sizes.

Figure 5.8 shows how the execution time of the program in terms of the number of
clock cycles changes with varying workloads concerning different cores count, it can be
concluded that with higher core count the program workload is distributed among different
cores, thus the overall execution time does not increase rapidly as in case of lower core
count.
132 CHAPTER 5. RESULTS AND EVALUATION

Figure 5.8: The count of clock cycles needed to sort various arrays with diverse workloads.

5.2.2.2 P-Sort

In this section another sorting algorithm will be evaluated on the GPGPU. A sorting
algorithm known as Parallel Counting Sort will be used to sort an array of floating
point numbers with varying sizes.

Figure 5.9: The execution of psort algorithm on the GPGPU with 1 core and workload of 12
elements.
5.2. BENCHMARKING 133

Figure 5.10: IPC results of p-sort algorithm for different GPU configurations with a constant
workload of 32 elements.

Figure 5.11: IPC results of p-sort algorithm for different workload sizes.
134 CHAPTER 5. RESULTS AND EVALUATION

Figure 5.12: The count of clock cycles needed to sort various floating-point arrays with diverse
workloads.
5.2. BENCHMARKING 135

5.2.3 Convolution Benchmark


This section examines the performance of the GPGPU when performing convolution
using various core configurations.

Figure 5.13: IPC results of Convolution algorithm for different GPU configurations with a
constant workload of 16x16 input Matrix and 3x3 filter.

Figure 5.14: Execution time in microseconds of Convolution algorithm for different GPU con-
figurations with a constant workload of 16x16 input Matrix and 3x3 filter.
136 CHAPTER 5. RESULTS AND EVALUATION

Figure 5.15: IPC results of Convolution algorithm for different core counts configurations with
different workload sizes and constant 3x3 filter.

Figure 5.16: Execution time in milliseconds of Convolution algorithm for different core counts
configurations with different workload sizes and constant 3x3 filter.
5.2. BENCHMARKING 137

5.2.4 Sgemm Benchmark


Sgemm which stands for Single-precision General Matrix Multiply is a widely
used benchmark from the Rodinia benchmarks used to measure the performance of dif-
ferent computing platforms such as CPU, GPU, and TPU in performing floating point
matrix multiplication operations.

Figure 5.17: Sgemm benchmark code.

Figure 5.18: IPC results of Sgemm benchmark for different GPU configurations with a constant
workload of floating-point 64x64 matrix .
138 CHAPTER 5. RESULTS AND EVALUATION

Figure 5.19: Execution time in milliseconds of Sgemm benchmark for different GPU configu-
rations with a constant workload of floating-point 64x64 matrix .

Figure 5.20: IPC results of Sgemm benchmark for different core counts with different matrix
sizes.
5.2. BENCHMARKING 139

Figure 5.21: Execution time in milliseconds of Sgemm benchmark for different core counts with
different matrix sizes.

5.2.5 Saxpy Benchmark


Saxpy which stands for Single-Precision A-X Plus Y is a famous benchmark from
the Rodinia benchmarks used to measure GPU computational performance in scenarios
that involve doing operations on vectors. The Saxpy simply calculates the vector product
of the vector X with a scalar value A and then adds the results to another vector Y,
this operation can evaluate how efficiently a GPU can handle parallel vector operations,
especially when this operations are done on floating-point numbers.

Figure 5.22: Saxpy benchmark code.


140 CHAPTER 5. RESULTS AND EVALUATION

Figure 5.23: IPC results of Saxpy benchmark for different GPU configurations with a constant
workload of floating-point array with 4096 elements.

Figure 5.24: IPC results of Saxpy benchmark for different core counts with different array sizes.
5.2. BENCHMARKING 141

Figure 5.25: Execution time in milliseconds of Saxpy benchmark for different core counts with
different array sizes.
142 CHAPTER 5. RESULTS AND EVALUATION

5.2.6 NearestNeighbor Benchmark


NearestNeighbor is one of Rodinia’s benchmarks that uses the K-nearest Neighbors
(K-NN) algorithm to find the nearest K Hurricanes to a specific location. The latitude
and longitude of the wanted location are entered then the K-NN algorithm calculates the
Euclidean distance between the latitude and longitude of different Hurricanes stored in
data records and the target latitude and longitude, then determines the K Hurricanes
with minimum Euclidean distance from the target location.

Figure 5.26: The K-NN kernel code.

Figure 5.27: The execution of the K-NN algorithm to find the 4-nearest hurricanes to a location
of a longitude equal to 60 degrees and latitude of 20 degrees.
5.2. BENCHMARKING 143

Figure 5.28: IPC results of executing K-NN algorithm to find the 5 nearest neighbors for
different cores counts.

Figure 5.29: Execution time in milliseconds of K-NN algorithm to find the 5 nearest neighbors
for different cores counts.
144 CHAPTER 5. RESULTS AND EVALUATION

5.2.7 Guassian Benchmark


The Gaussian elimination also known as Gaussian reduction, is one of the most widely
used techniques in Linear Algebra to solve numerical equations by converting these equa-
tions into matrices. Gaussian elimination is a very expensive algorithm in terms of com-
putation as it has a time complexity of O(n3 ). In Figure 5.31, the chart displays the
IPC concerning different core counts as the number of threads and warps changes at a
constant workload of a 4x4 matrix.

Figure 5.30: The execution of Gaussian elimination kernel on the GPU .

Figure 5.31: IPC results of Gaussian elimination algorithm for different GPU configurations
with a constant workload of 4x4 matrix.
5.2. BENCHMARKING 145

5.2.8 Comparison with other architectures


The execution time comparison of the K-NN algorithm to find the 5 nearest neighbors,
measured in seconds, is illustrated in Figure 5.32. The comparison is conducted between
the GPGPU with 8 cores, where each core is configured with 4 warps and 8 threads, and
the Intel Xeon E5-1640 CPU.

Figure 5.32: Comparison between the execution time of K-NN algorithm on GPGPU with 8
cores and Intel Xeon E5-1640.

The comparison of execution times for the convolution operation in milliseconds is


illustrated in Figure 5.33. The comparison is conducted between the GPGPU with 8
cores and 4 cores. Each core is configured with 4 warps and 8 threads. Additionally, the
Intel Xeon E5-1640 CPU is included in the comparison. The convolution operation uses
a 3x3 Sobel filter with various matrix sizes, ranging up to 1024x1024.
146 CHAPTER 5. RESULTS AND EVALUATION

Figure 5.33: Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and Intel Xeon E5-1640.

Figure 5.34: GPGPU speedup over Intel Xeon E5-1650.


5.2. BENCHMARKING 147

Figure 5.35: Comparison between the execution time of convolution on GPGPU with 8 cores,
4 cores, and TPU v2.

Figure 5.36: GPGPU speedup over TPU v2.


148 CHAPTER 5. RESULTS AND EVALUATION

The comparison of execution times for vector addition in milliseconds is illustrated in


Figure 5.37. The comparison is conducted between the GPGPU with 8 cores. Each core
is configured with 4 warps and 8 threads. Additionally, the Tesla-4 GPU is included in
the comparison. The Tesla-4 has higher performance than the GPGPU implemented in
this thesis but with higher cost and higher power consumption.

Figure 5.37: Comparison between the execution time of convolution on GPGPU with 8 cores,
and GPU Tesla-4.
Chapter 6

Conclusion and Future Work

6.1 Summary of the Thesis


Chapter 2 introduces the Microarchitecture of the GPGPU implemented in this
thesis, it begins by describing the hardware and software extensions needed to extend the
RISC-V processor to support the Single Instruction Multiple Threads (SIMT) execution
paradigm. Then this chapter proceeds with the structure of the GPGPU cores, the core
consists mainly of two important parts the main core Pipeline and the Cache subsystem.
The pipeline is similar to the ordinary 5-stage RISC-V pipeline with some extensions.
The other part of the chapter describes the cache system and how it is modified to handle
the high bandwidth required to support multiple execution threads. The pipeline stages
are as follows:
ˆ Fetch Stage: This stage fetches the instructions from the instruction cache using
the current program counter (PC), this stage is provided with a wavefront scheduler
that schedules the different warps in a round-robin fashion, it also equipped with an
instruction buffer to keep track of different instructions fetched by different warps.

ˆ Decode Stage: This stage takes the 32-bit instructions fetched from the instruc-
tion cache and extracts the different fields to determine the type of the instructions,
whether I-type, R-type, or S-type besides the data operand such as Source Regis-
ter1, Source Register2, and Destination Register and then determines the suitable
processing element to execute this instruction.

ˆ Issue Stage: This stage checks the data operands extracted by the decode stage
to verify if all the source operands of the instruction are ready or not if the source
operands are ready and there are no data hazards, the data values of the operands
will be provided by the banked register file, if not the issue stage will choose the
next available instruction and store the unready instruction in reservation stations
until all operands are available.

ˆ Execute Stage: This stage contains different processing elements to execute sev-
eral types of operations that are determined at the decode stage:

1. ALU Unit: This unit is responsible for executing arithmetic and logical op-
erations besides handling different types of branch instructions.
2. FPU Unit: This unit is responsible for executing operations that use floating-
point numbers such as float addition, float multiplication, and float square root.

149
150 CHAPTER 6. CONCLUSION AND FUTURE WORK

3. LSU Unit: This unit is responsible for handling the write and read requests
to the external memory.
4. CSR Unit: This unit is responsible for reading and writing certain registers in
the system that contain certain information about the status and configuration
of the system such as the number of threads, warps, and cores.
5. Conv Unit: This unit is responsible for accelerating the convolution opera-
tion.

ˆ Commit Stage: This stage is responsible for updating the register file after fin-
ishing the current instruction.

Chapter 3 describes the structure of the AXI adapter that connects the processing
cores with the external memory.

Chapter 4 describes the software stack of the GPGPU. The GPGPU implemented
in this thesis supports OpenCL as the main programming language, OpenCL is compiled
using the POCL compiler to generate RISC-V compatible binaries, and then the host
processor dumps these binaries to the GPGPU through a global memory visible by the
two architectures. The chapter also describes the GPGPU RTL driver that abstracts the
microarchitecture of the GPGPU from the host processor, so an average programmer can
write codes that target the GPGPU without knowing much about its architecture.

Chapter 5 evaluates the area and the performance of the GPGPU, the GPGPU can
support up to 8 cores with each core can be configured with up to 4 warps and 8 threads,
this means that the maximum number of supported threads are up to 256 threads. The
GPGPU also supports up to 16k Instruction-cache and 16k Data-cache with optional level
2 cache. The other part of the chapter evaluates the performance of the GPGPU with
different core configurations using the Rodinia Benchmark which is a famous set of several
programs that evaluates the performance of heterogeneous systems[35].

Figure 6.1: The host flow of OpenCL application on the design environment.
6.2. FUTURE WORK 151

6.2 Future Work


This section introduces the new ideas to be implemented in the future to increase the
performance of the GPGPU and make it more suitable for real-world applications.

6.2.1 Cache Subsystem


The cache system is very critical for the performance of the GPGPU as it is a bot-
tleneck for many applications. It is intended in the future to add a translation lookaside
buffer (TLB) to the cache system, which is a special cache used to keep track of the recent
access, with the use of TLB the GPGPU can support virtual memory. Another improve-
ment to the cache system is the Hardware Prefetcher, the hardware prefetcher can be
used to track a stream of data to determine the next data elements that the program will
need in the future, this can be used to decrease the memory overhead.

6.2.2 Graphics Pipeline


It is intended in the future to support a graphics pipeline for rendering different
applications the Pipeline will consist of:

ˆ Geometry Stage: This stage is used to receive the 3D object to be rendered in


the form of vertices that transform these vertices to a set of triangles suitable to be
displayed on the screen using a vertex shader.

ˆ Rasterization Stage: This stage receives the produced triangle from the Geometry
stage and traverses the edge of the triangles to map these edges to pixels on the
screen.

ˆ Texture Stage: This stage is responsible for adding surface details to 3D objects.
152 CHAPTER 6. CONCLUSION AND FUTURE WORK
Appendix A

GPGPU Instruction Set


Architecture (ISA)

Figure A.1: Instruction types supported by the GPGPU.

153
154 APPENDIX A. GPGPU INSTRUCTION SET ARCHITECTURE (ISA)

Figure A.2: Arithmetic Instructions supported by the GPGPU.

Figure A.3: Logical Instructions supported by the GPGPU.

Figure A.4: Shift Instructions supported by the GPGPU.

Figure A.5: Conditional Branch Instructions supported by the GPGPU.

Figure A.6: Unconditional Branch Instructions supported by the GPGPU.


155

Figure A.7: Data transfer Instructions supported by the GPGPU.

Figure A.8: Floating-point arithmetic Instructions supported by the GPGPU.

Figure A.9: Floating-point comparison Instructions supported by the GPGPU.

Figure A.10: Floating-point conversion Instructions supported by the GPGPU.

Figure A.11: Floating-point data transfer Instructions supported by the GPGPU.


156 APPENDIX A. GPGPU INSTRUCTION SET ARCHITECTURE (ISA)

Figure A.12: Control Status Register (CSR) Instructions supported by the GPGPU.

Figure A.13: Extended instructions to support multi-threading.


Bibliography

[1] Vijay Sharma, Manisha Pattanaik, and Balwinder Raj. ONOFIC approach: low
power high speed nanoscale VLSI circuits design. Mar. 2013. doi: 10.1080/00207217.
2013.769186.
[2] John L. Hennessy and David A. Patterson. Computer Architecture, Sixth Edition: A
Quantitative Approach. 6th. San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc., 2017. isbn: 0128119055.
[3] Fares Elsabbagh et al. “Vortex: OpenCL Compatible RISC-V GPGPU”. In: arXiv
preprint arXiv:2002.12151 (2020).
[4] url: https://circuitdigest.com/tutorial/what- is- fpga- introduction-
and-programming-tools.
[5] Graham M. Holland. “Abstracting OpenCL for Multi-Application Workloads on
CPU-FPGA Clusters”. In: 2019. url: https : / / api . semanticscholar . org /
CorpusID:215864989.
[6] Muhammed Al Kadi et al. “General-purpose computing with soft GPUs on FPGAs”.
In: ACM Transactions on Reconfigurable Technology and Systems 11 (Jan. 2018),
pp. 1–22. doi: 10.1145/3173548.
[7] Giovanni Todaro et al. “Enhanced Soft GPU Architecture for FPGAs”. In: 2023
18th Conference on Ph.D Research in Microelectronics and Electronics (PRIME).
2023, pp. 177–180. doi: 10.1109/PRIME58259.2023.10161749.
[8] Xu Liu et al. “A hybrid GPU-FPGA based design methodology for enhancing ma-
chine learning applications performance”. In: Journal of Ambient Intelligence and
Humanized Computing 11 (2020), pp. 2309–2323.
[9] Zeke Wang et al. “FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs”.
In: 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad,
CA: USENIX Association, July 2022, pp. 967–986. isbn: 978-1-939133-29-25. url:
https://www.usenix.org/conference/atc22/presentation/wang-zeke.
[10] Jeff Bush et al. “Nyami: A synthesizable GPU architectural model for general-
purpose and graphics-specific workloads”. In: ISPASS 2015 - IEEE International
Symposium on Performance Analysis of Systems and Software (Apr. 2015), pp. 173–
182. doi: 10.1109/ISPASS.2015.7095803.
[11] Ruobing Han et al. CuPBoP: CUDA for Parallelized and Broad-range Processors.
June 2022. doi: 10.48550/arXiv.2206.07896.
[12] Ruobing Han et al. COX: CUDA on X86 by Exposing Warp-Level Functions to
CPUs. Dec. 2021.

157
158 BIBLIOGRAPHY

[13] Blaise Tine et al. Vortex: Extending the RISC-V ISA for GPGPU and 3D-GraphicsResearch.
Oct. 2021.
[14] Hyeonguk Jang et al. “Developing a Multicore Platform Utilizing Open RISC-V
Cores”. In: IEEE Access PP (Aug. 2021), pp. 1–1. doi: 10.1109/ACCESS.2021.
3108475.
[15] Matheus Cavalcante et al. “Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V
Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI”.
In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems PP (Dec.
2019), pp. 1–14. doi: 10.1109/TVLSI.2019.2950087.
[16] Graham M. Holland. “Abstracting OpenCL for Multi-Application Workloads on
CPU-FPGA Clusters”. In: 2019. url: https : / / api . semanticscholar . org /
CorpusID:215864989.
[17] Topi Leppänen et al. “Efficient OpenCL system integration of non-blocking FPGA
accelerators”. In: Microprocessors and Microsystems 97 (Jan. 2023), p. 104772. doi:
10.1016/j.micpro.2023.104772.
[18] Pekka Jääskeläinen et al. “pocl: A Performance-Portable OpenCL Implementation”.
In: International Journal of Parallel Programming (Aug. 2014). doi: 10 . 1007 /
s10766-014-0320-y.
[19] Topi Leppänen et al. “Unified OpenCL Integration Methodology for FPGA De-
signs”. In: Oct. 2021. doi: 10.1109/NorCAS53631.2021.9599861.
[20] url: https:https://opae.github.io/0.13.0/docs/fpga_api/quick_start/
readme.html.
[21] May 2019. url: https : / / educationecosystem . com / blog / enabling - fpgas -
software-developers/.
[22] url: https://www.allaboutcircuits.com/technical-articles/multiplication-
examples-using-the-fixed-point-representation/.
[23] Shadrokh Samavi. “Basic Multiplication Schemes”. In: Jan. 2014.
[24] Hassan Albalawi, Yuanning Li, and Xin Li. “Computer-Aided Design of Machine
Learning Algorithm: Training Fixed-Point Classifier for On-Chip Low-Power Imple-
mentation”. In: June 2014. doi: 10.1145/2593069.2593110.
[25] “IEEE Standard for Floating-Point Arithmetic”. In: IEEE Std 754-2019 (Revision
of IEEE 754-2008) (2019), pp. 1–84. doi: 10.1109/IEEESTD.2019.8766229.
[26] D.A. Patterson and J.L. Hennessy. Computer Organization and Design MIPS Edi-
tion: The Hardware/Software Interface. ISSN. Elsevier Science, 2013. isbn: 9780124078864.
url: https://books.google.com.eg/books?id=EVhgAAAAQBAJ.
[27] Alvaro Vazquez and Florent Dinechin. “Efficient implementation of parallel BCD
multiplication in LUT-6 FPGAs”. In: Dec. 2010, pp. 126–133. doi: 10.1109/FPT.
2010.5681767.
[28] url: https://www.telesens.co/2018/07/30/systolic-architectures/.
[29] url: https://semiengineering.com/dram-scaling-challenges-grow/.
[30] Saba Ahmadian et al. “ETICA: Efficient Two-Level I/O Caching Architecture for
Virtualized Platforms”. In: IEEE Transactions on Parallel and Distributed Systems
PP (Mar. 2021), pp. 1–1. doi: 10.1109/TPDS.2021.3066308.
BIBLIOGRAPHY 159

[31] Onur Mutlu et al. “A modern primer on processing in memory”. In: Emerging
Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann.
Springer, 2022, pp. 171–243.
[32] Mikhail Asiatici and Paolo Ienne. “Stop crying over your cache miss rate: Handling
efficiently thousands of outstanding misses in fpgas”. In: Proceedings of the 2019
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019,
pp. 310–319.
[33] Tine Blaise et al. “Bringing opencl to commodity risc-v cpus”. In: 2021 Workshop
on RISC-V for Computer Architecture Research (CARRV). 2021.
[34] Haocong Luo et al. Ramulator 2.0: A Modern, Modular, and Extensible DRAM
Simulator. Aug. 2023.
[35] Shuai Che et al. “Rodinia: A benchmark suite for heterogeneous computing”. In:
Oct. 2009, pp. 44–54. doi: 10.1109/IISWC.2009.5306797.

You might also like