Next Article in Journal
Robust Adaptive Control of a Coaxial-Ducted-Fan Aircraft with Uncertainty Model
Previous Article in Journal
Decentralized Control Framework for Optimal Platoon Spacing and Energy Efficiency
Previous Article in Special Issue
Field-Programmable Gate Array-Based True Random Number Generator Using Capacitive Oscillators
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA

Engineering Research Center of Network Management Technology for High Speed Railway of Ministry of Education, School of Computer Science and Technology, Beijing JiaoTong University, Beijing 100044, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(1), 168; https://doi.org/10.3390/electronics14010168
Submission received: 26 November 2024 / Revised: 27 December 2024 / Accepted: 28 December 2024 / Published: 3 January 2025
(This article belongs to the Special Issue FPGA-Based Reconfigurable Embedded Systems)

Abstract

:
FPGA-based convolutional accelerators have been widely used in image recognition scenarios. Many convolutional accelerators utilize the systolic array structure to enhance parallelism. Developing a method to efficiently estimate the utilized hardware resources of an FPGA for such a structure would be helpful in improving the speed of achieving an optimal systolic array structure with the best performance on a given FPGA device. Currently, most estimations of work have either focused on the evaluation of hardware resources for general structures or have not adequately assessed hardware resources specifically for systolic arrays. To reduce estimation latency, this paper proposes an Accurate and Fast Hardware Resources Estimation method (AFHRE) that addresses these shortcomings by analyzing the structure of systolic arrays and utilizing mathematical formulas to describe their characteristics. Experiments show that the DSP resource occupancy estimated by AFHRE is fully consistent with that by Vivado HLS. The error rates of other three types of hardware resources (BRAM, LUT, and FF) are within 11 % . In addition, the speed of resource estimation using this method is 40 X to 610 X faster than that of Vivado HLS. AFHRE can serve as a preprocessing step for Vivado HLS, achieving some optimal or sub-optimal solutions systolic array parameters much faster than original simulation manners of Vivado HLS.

1. Introduction

Convolutional Neural Networks (CNNs) have gained widespread application in many computer vision tasks, such as image classification, natural language processing, and face recognition [1,2,3,4]. As CNN algorithms have advanced, the recognition accuracy of CNN networks has improved; however, their scale has also expanded, leading to a significant increase in computational complexity. General-purpose processors are increasingly unable to meet their computational needs [5].
CNN inference tasks are one example of memory-intensive and computation-intensive tasks [6,7]. The memory and transmission bottleneck of CNN inference tasks primarily arises from the large amount of memory that CNN modules usually require to store parameters and intermediate activation values [8]. This holds especially true for large models and high-resolution input images, as they demand significant memory. During the inference process, a large amount of data needs to be transferred between the computing units and memory, which may cause memory bandwidth to become a bottleneck. Moreover, CNN inference tasks occupy substantial computational resources as CNNs, particularly Deep Convolutional Neural Networks (DCNNs), exhibit very high computational complexity [9]. The convolution operations at each layer involve a large number of matrix calculations, which require significant computational resources. To meet the high computational demands, high-performance computing hardware, such as GPUs, TPUs, and FPGAs, is often required [10]. Among these, FPGAs have gradually become a mainstream hardware platform for designing CNN hardware accelerators due to their programmability, low power consumption, and high parallelism, making them widely used for designing hardware accelerators for CNN inference tasks [11].
There are two general methods for implementing convolution modules, one of which is a structure made up of multipliers, addition trees, and accumulators, referred to as the multiply–accumulate tree structure in this paper. Many convolution operation IPs use multiply–accumulate trees to accelerate computation [12,13]. In software design terms, the multiply–accumulate tree involves increasing the number of elements participating in the computation in each loop iteration to fully utilize parallel hardware resources and optimize program performance. In FPGA hardware design terms, the optimization strategy involves placing multiple computation units to parallelize tasks within the loop body, thereby achieving acceleration. In grouped convolution, the input feature map is divided into multiple groups, and each group only convolves with a subset of convolutional kernels. Generally, the grouping operation involves dividing the input feature map, grouping the convolutional kernels, and then performing the convolution operation followed by concatenation. One challenge of employing a multiply–accumulate tree is that deploying hardware resources becomes exponentially more difficult, as it requires the replication and deployment of the hardware logic within the loop body. Excessive use of multiply–accumulate trees can also increase the complexity of hardware design, necessitating a careful balance between parallelism and resource occupancy.
The other approach to implementing convolution modules utilizes systolic arrays [14]. Systolic array architectures find applications in parallel processing, data storage and management, image processing, computer vision, neural networks, machine learning, signal processing, and more [15,16,17]. In recent years, systolic arrays have gained significant traction in the implementation of CNNs. A systolic array is essentially a simple pipeline where data pulses forward through the array driven by the clock. It is particularly suitable for handling a large number of multiply–accumulate (MAC) operations, which are common in convolutional neural networks. Convolution operations can be transformed into matrix multiplication, and systolic arrays are designed for matrix multiplication. Compared to loop unrolling technology, systolic arrays can significantly improve computational efficiency and throughput by decomposing computation tasks into multiple parallel processing units. FPGAs’ programmability permits the flexible configuration of systolic arrays’ functionality and structure to meet specific application requirements. Optimized systolic array designs for particular applications can achieve lower power consumption during computation-intensive CNN inference tasks.
Figure 1a shows an example where a 4 × 4 input feature map is convolved with four 3 × 3 weight matrices to produce four 2 × 2 output feature maps. Figure 1b,c illustrate the two matrices obtained by rearranging the data using the im2col function for the input feature map and for the weights, respectively. In this example, the systolic array consists of 4 × 4 processing elements (PEs), where each PE corresponds to a multiplier.
Figure 2 illustrates an example of a 4 × 4 systolic array computation process. The matrix multiplication shown in Figure 2 takes 12 clock cycles to perform. Below is the data transfer and computation process during specific clock cycles:
  • t0 clock: P E 00 = W 22 × Q 22 ; W 22 P E 01 ; Q 22 P E 10 .
  • t1 clock:
    { P E 00 = P E 00 + W 21 × Q 21 ; W 21 P E 01 ; Q 21 P E 10 ;}
    { P E 01 = W 22 × Q 23 ; W 22 P E 02 ; Q 23 P E 11 ;}
    { P E 10 = W 52 × Q 22 ; W 52 P E 11 ; Q 22 P E 20 .}
  • t2 clock:
    { P E 00 = P E 00 + W 20 × Q 20 ; W 20 P E 01 ; Q 20 P E 10 ;}
    { P E 01 = P E 01 + W 21 × Q 22 ; W 21 P E 02 ; Q 22 P E 11 ;}
    { P E 02 = W 22 × Q 32 ; W 22 P E 03 ; Q 32 P E 12 ;}
    { P E 10 = P E 10 + W 51 × Q 21 ; W 51 P E 11 ; Q 21 P E 20 ;}
    { P E 20 = W 82 × Q 22 ; W 82 P E 21 ; Q 22 P E 30 ;}
    { P E 11 = W 52 × Q 23 ; W 52 P E 12 ; Q 23 P E 21 ;}
  • ……
  • t11 clock: { P E 33 = P E 33 + W 90 × Q 11 ; W 90 D r a i n ; Q 20 D r a i n ;}
For the systolic arrays technique, most previous work only provides rough estimates of the theoretically optimal values under the constraints of computational resources and the performance requirements of convolution operation intellectual property (IP) during the FPGA implementation phase. In this context, this paper quantitatively analyzes the resource occupancy of CNN inference models implemented with systolic arrays on FPGAs. Therefore, the main contributions of this paper are as follows:
1.
The resource composition of systolic arrays is analyzed.
2.
A resource calculation method called AFHRE has been established to compute the FPGA resources occupied by any given set of systolic array parameters.
3.
Resource calculation methods are used to derive a series of parameters for systolic arrays, assisting design tools in finding suitable scales of CNN accelerators more quickly under hardware resource constraints.
The rest of the paper is organized as follows: Section 2 introduces related work. In Section 3, the patterns of how hardware resources for systolic arrays vary with scale are identified, and the relationship between the hardware resources of the systolic arrays and their scale is quantified. In Section 4, an algorithm for generating performance-optimized systolic arrays under given resource constraints and CNN model conditions is proposed using the roofline model. Section 5 presents the experiments and results, while Section 6 concludes the paper.

2. Related Work

In CNN models, convolutional layers account for over 90% of the computational workload [18,19,20]. Therefore, much of the current research on CNN accelerators focuses on accelerating the convolutional layers of CNNs. Among them, many systolic arrays are used to build CNN inference accelerators, such as Eyeriss [21], TPU [22], and FireFly [23].
Odyssey [24] was proposed to comprehensively construct and accurately model the performance of the systolic array design space. The dMazeRunner [25] framework was proposed, which automatically and efficiently optimizes the execution of different convolution and fully connected (FC) layers on dataflow accelerators. CoSA [26] exploits the inherent patterns in DNN operators and hardware to transform the DNN scheduling space into a mixed-integer programming (MIP) problem, incorporating both algorithmic and architectural constraints. This formulation allows for the automatic generation of a highly efficient schedule in a single step. However, these studies only provided optimization methods and did not include a method for evaluating hardware resources, or they used commercial software such as Xilinx (AMD) Vivado HLS or Altera (Intel) Quartus II for evaluation, which takes a considerable amount of time.
Timeloop [27] offers a versatile and comprehensive framework that effectively represents deep neural network (DNN) accelerators across a wide design space, automatically generating the complete mapping space for any given architecture. Additionally, no single architecture currently delivers optimal performance and energy efficiency across various workloads due to trade-offs between flexibility and efficiency. A framework called TENET [28] was proposed to model the data flow of tensor applications on spatial architectures, introducing a relation-centric notation that uniformly represents data flow, data assignment, and interconnections as relations. This notation enables a broader design space for data flow and accommodates a greater number of tensor benchmarks compared to previous notations. However, this type of research primarily provides theoretical methods and optimization frameworks without conducting experimental validation on hardware platforms.
An autonomous approach named Con-fuciuX [29] optimizes hardware resource assignments for specific models and data flow styles. Con-fuciuX utilizes a reinforcement learning technique known as REINFORCE to guide the search process, incorporating a comprehensive hardware performance cost model within the training loop to assess rewards. A domain-specific genetic algorithm-based method, GAMMA [30], was proposed, specifically designed for this hardware mapping problem. This study demonstrates that mapping is crucial for overall performance and efficiency, as it directly influences the extent to which the accelerator leverages reuse within the CNN. Furthermore, research reveals the benefits of optimizing mappings for each layer rather than using a fixed mapping for every DNN layer. However, determining the appropriate mapping for a specific accelerator and layer remains an open question. The vastness of the mapping space makes brute-force exhaustive search methods impractical.
As shown in Table 1, previous studies have either focused on the evaluation of hardware resources for general structures or have not adequately assessed hardware resources specifically for systolic arrays. In this paper, we propose an evaluation algorithm that addresses these shortcomings by analyzing the structure of systolic arrays and utilizing mathematical formulas to describe their characteristics.

3. Hardware Resource Usage Analysis

This section derives and analyzes the patterns of hardware resource changes for two classical systolic array structures as their scale varies. Section 3.1 focuses on the minimalist systolic array structure, while Section 3.2 utilizes AutoSA [31] as an example to analyze the systolic array structure generated by polyhedral compilers.

3.1. Minimalist Systolic Arrays

Figure 3 shows a systolic array composed of M × N PEs. The function of the Drain module is to destroy the data that have been used. Typically, an IP on an FPGA platform is composed of four types of hardware resources: BRAM, DSP, FF, and LUT. A systolic array consists of logic components such as Expressions, FIFOs, Instances, Multiplexers, and Registers, as shown in Table 2.
1.
Expression: An expression is a segment of code that computes a value in programming and hardware description languages (HDLs). It may consist of operators, variables, constants, and function calls.
2.
FIFO: FIFO (First In, First Out) is a data structure that allows data to be read in the order in which they were stored.
3.
Instance: An instance in hardware design refers to the specific implementation of a module or component.
4.
Multiplexer: A multiplexer is a selector that chooses one output from multiple inputs, forwarding the selected input signal to the output based on control signals.
5.
Register: A register is a storage unit capable of temporarily holding data in hardware. Registers are used to store data, states, or control information, and they read or write data under the control of a clock signal. Registers are often used in pipelined designs to maintain data values across different computation cycles.
Expressions and multiplexers are composed of LUTs. Instances are composed of DSPs, FFs, and LUTs. Registers are composed of LUTs. FIFOs are composed of FFs and LUTs. Additionally, the simple systolic array does not use BRAM. This article derives the hardware resource evaluation formula using the example of computing 16-bit fixed-point numbers.
When performing matrix multiplication between an M × K matrix and a K × N matrix, the number of DSPs, FFs, and LUTs occupied by the simple systolic array is given by D S P s u m ( M , N ) , F F s u m ( M , N , K ) , and L U T s u m ( M , N , K ) , respectively. According to Table 2, Equations (1)–(3) can be derived.
D S P s u m ( M , N ) = D S P s i n s ( M , N ) .
F F s u m ( M , N , K ) = F F s f i f o ( M , N ) + F F s i n s ( M , N , K ) + F F s r e g ( M , N ) .
L U T s u m ( M , N , K ) = L U T s e x p ( M , N ) + L U T s f i f o ( M , N ) + L U T s i n s ( M , N , K ) + L U T s m u l ( M , N ) .

3.1.1. Estimation of Expression

L U T s e x p is divided into three parts. As shown in Figure 4, the first part consists of PEs that are not connected to Drain modules. There are [ ( M 1 ) ( N 1 ) ] PEs in this part. The second part consists of PEs that are only connected to a single Drain module and the corresponding Drain modules. There are M + N 2 PEs and M + N 2 Drain modules in this part. The third part consists of PEs that are connected to two Drain modules and the corresponding Drain modules. There are one PE and two Drain modules in the third part. Thus, the sum of these three parts is given by Equation (4), where e p a , e p b and e p c are all constants.
L U T s e x p ( M , N ) = e p a ( M 1 ) ( N 1 ) + e p b ( M + N 2 ) + e p c .

3.1.2. Estimation of FIFO

F F s f i f o is divided into two parts. The first part is the number of FFs in the link queue between PEs. The second part is the number of FFs in the link queue between the PE and the Drains. Moreover, the number of FFs occupied by the link queue between PEs is exactly half of the number of FFs occupied by the link queue between the PEs and the Drains. Let the FFs occupied by the link queue between the PE and the Drain be denoted as 1 2 f i f o a . Then, the FFs occupied by the link queue between PEs is 1 4 f i f o a . According to Figure 5, when M = N = 1 , F F s f i f o = f i f o a ; when M = N = 2 , F F s f i f o = 3 f i f o a ; when M = N = 3 , F F s f i f o = 6 f i f o a ; when M = N = 4 , F F s f i f o = 10 f i f o a ; and so forth. When M = N = i , F F s f i f o = f i f o a i = 1 min ( M , N ) i . Thus, when M > 0 and N > 0 , Equation (5) holds. Similarly, L U T s f i f o also follows this pattern, which can be expressed as Equation (6), where f i f o b and f i f o c are constants.
F F s f i f o ( M , N ) = 1 4 f i f o a + 1 2 f i f o a m i n M , N M N + f i f o a m i n M , N 1 + m i n M , N 2 .
L U T s f i f o ( M , N ) = f i f o b + 1 2 f i f o c m i n M , N M N + f i f o c m i n M , N 1 + m i n M , N 2 .
Figure 5. As the scale of the systolic array increases, the hardware resources of DSPs of the FIFO and FFs/LUTs of the instance also grow systematically. ①, ②, ④, and ⑦ represent the communication resources between the PEs and Drain modules. ③, ⑤, ⑥, ⑧, ⑨, and ⑩ represent the communication resources between the PEs and PEs.
Figure 5. As the scale of the systolic array increases, the hardware resources of DSPs of the FIFO and FFs/LUTs of the instance also grow systematically. ①, ②, ④, and ⑦ represent the communication resources between the PEs and Drain modules. ③, ⑤, ⑥, ⑧, ⑨, and ⑩ represent the communication resources between the PEs and PEs.
Electronics 14 00168 g005

3.1.3. Estimation of Instance

Under the condition that each PE in the minimalist systolic array performs only one multiplication operation per clock cycle, as shown in Figure 5, each PE contains a DSP for data computation, while the Drain module is used to release data and does not require a DSP. Therefore, the number of DSPs is equal to the number of PEs, allowing us to derive Equation (7).
D S P s i n s ( M , N ) = M × N .
F F s i n s is divided into three parts. As shown in Figure 6, the first part consists of PEs that are only connected to other PEs. The second part consists of PEs that are connected to Drain modules. The third part consists of Drain modules. The first part has a total of ( M 1 ) × ( N 1 ) PEs. The number of FFs for these PEs is related to l o g 2 K and is denoted as i n s a + l o g 2 K , where i n s a is a constant. The remaining PEs belong to the second part, totaling M + N 1 PEs. Similarly, the number of FFs for these PEs is related to K and is denoted as i n s b + l o g 2 K , where i n s b is a constant. There are a total of M + N Drain modules, which are also related to K and are denoted as i n s c + 3 l o g 2 K , where i n s c is a constant. Finally, the sum of these three parts is given by Equation (8). Similarly, L U T s i n s also follows this pattern, which can be expressed as Equation (9), where i n s d , i n s e , and i n s f are constants.
F F s i n s ( M , N , K ) = i n s a + l o g 2 K M 1 ( N 1 ) + i n s b + l o g 2 K M + N 1 + i n s c + 3 l o g 2 K M + N .
L U T s i n s ( M , N , K ) = i n s d + l o g 2 K M 1 ( N 1 ) + i n s e + l o g 2 K M + N 1 + i n s f + 2 l o g 2 K M + N .
Figure 6. In order to facilitate the statistical analysis of the hardware resources for FFs and LUTs of the instance, the systolic array is divided into three parts for statistics. ①, ②, and ③ in the figure correspond to i n s a + l o g 2 K M 1 ( N 1 ) , i n s b + l o g 2 K M + N 1 , and i n s c + 3 l o g 2 K M + N in Equation (8), respectively. Similarly, ①, ②, and ③ in the figure correspond to i n s d + l o g 2 K M 1 ( N 1 ) , i n s e + l o g 2 K M + N 1 , and i n s f + 2 l o g 2 K M + N in Equation (11), respectively.
Figure 6. In order to facilitate the statistical analysis of the hardware resources for FFs and LUTs of the instance, the systolic array is divided into three parts for statistics. ①, ②, and ③ in the figure correspond to i n s a + l o g 2 K M 1 ( N 1 ) , i n s b + l o g 2 K M + N 1 , and i n s c + 3 l o g 2 K M + N in Equation (8), respectively. Similarly, ①, ②, and ③ in the figure correspond to i n s d + l o g 2 K M 1 ( N 1 ) , i n s e + l o g 2 K M + N 1 , and i n s f + 2 l o g 2 K M + N in Equation (11), respectively.
Electronics 14 00168 g006

3.1.4. Estimation of Multiplexer

L U T s m u l is linearly related to the total number of PEs and Drain modules, and the number of LUTs for each processing unit and Drain module is fixed at m u l a . In Figure 7, there are a total of [ ( M + 1 ) ( N + 1 ) 1 ] modules. Therefore, this relationship can be expressed by Equation (10).
L U T s m u l ( M , N ) = m u l a ( M + 1 ) ( N + 1 ) 1 ) .

3.1.5. Estimation of Register

F F s r e g is divided into four parts. As shown in Figure 8, the first part consists of the first PE, the last PE, and the two Drain modules connected to the last PE. The FFs in this part have a fixed value, denoted as r e g a . The second part is composed of PEs that are each connected to only three other PEs. There are ( M + N 4 ) PEs in this part (assuming M + N 4 ). The third part is composed of PEs that are connected to four other PEs, totaling ( M 2 ) ( N 2 ) PEs in this part (with M 2 and N 2 ). The fourth part consists of the PEs connected to Drain modules and the Drain modules themselves, with a total of M + N 2 groups (assuming M + N 2 ). Finally, the sum of these four parts is given by Equation (11), where r e g a , r e g b , r e g c , and r e g d are all constants.
F F s r e g ( M , N ) = r e g a + r e g b M + N 4 + r e g c M 2 ( N 2 ) + r e g d M + N 2 , M 2 , N 2 , r e g a + r e g b M + N 4 + r e g d M + N 2 , M 3 , N = 1 , r e g a + r e g b M + N 4 + r e g d M + N 2 , M = 1 , N 3 , r e g a + r e g d M + N 2 , 2 < M + N < 4 , r e g a , M = N = 1 .

3.2. Systolic Arrays Generated by Polyhedral Compilers

Figure 9 shows a schematic diagram of a systolic array module. This systolic array is generated by the polyhedral compiler AutoSA [31]. When performing matrix multiplication between an M × K matrix and a K × N matrix, the number of DSPs, FFs, and LUTs occupied by the systolic array are denoted as B m s u m ( M , N , K , S ) , D S P s u m ( M , N , K , S ) , F F s u m ( M , N , K , S ) , and L U T s u m ( M , N , K , S ) , respectively, where the parameter S indicates that one PE can simultaneously process S data items. This section is divided into four subsections to estimate the resources of BRAM, DSP, FF, and LUT.

3.2.1. Estimation of BRAM

The BRAM resources are divided into two parts: PEs and the AXI4 protocol transmission channels, represented as B m i n s t a n c e s and B m c h a n n e l s , respectively. Therefore, the total size of the used BRAM, denoted as B m k e r n e l , is the sum of B m i n s t a n c e s and B m c h a n n e l s , as shown in Equation (12).
B m s u m ( M , N , K , S ) = B m i n s t a n c e s ( M , N , K , S ) + B m c h a n n e l s ( M , N , K , S ) .
There are M PEs receiving the input data A transmitted by A _ 2 i n and A _ b o u n d . Similarly, there are N PEs receiving the input data B transmitted by B _ 2 i n and B _ b o u n d . In addition, all the PEs transfer the computed data to the buffer oC. Therefore, Equation (13) can be presented as follows.
B m i n s t a n c e s ( M , N , K , S ) = M × B m A 2 i n + N × B m B 2 i n + B m o C .
The systolic array IP communicates via the AXI4 protocol. There are three AXI4 protocol transmission channels that utilize BRAM resources: the AXIA channel for receiving data A, the AXIB channel for receiving data B, and the AXIC channel for sending data C. Therefore, Equation (14) can be presented as follows.
B m c h a n n e l s ( M , N , K , S ) = B m A X I A + B m A X I B + B m A X I C .

3.2.2. Estimation of DSP

In the systolic array IP, the computations for data A and data B are designed to occur only within the M × N PEs. Therefore, the DSP resources are utilized solely in these M × N PEs, leading to Equation (15).
D S P s u m ( M , N , S ) = D S P i n s t a n c e s ( M , N , S ) = M × N × S .

3.2.3. Estimation of FF

In the systolic array IP, the FF resources can similarly be divided into two parts: those allocated for internal modules and those used for the AXI4 interface protocol, as expressed in Equation (16).
F F s u m ( M , N , K , S ) = F F i n s t a n c e s ( M , N , K , S ) + F F c h a n n e l s ( M , N , K ) .
In the AXI4 interface section, compared to BRAM resources, FF resources are utilized not only in the data transmission channels A, B, and C but also in the control interface. Therefore, F F c h a n n e l s consists of four parts: F F A X I A , F F A X I B , F F A X I C , and F F A X I c t r l . Consequently, Equation (17) can be listed.
F F c h a n n e l s ( M , N , K ) = F F A X I A ( M , N , K ) + F F A X I B ( M , N , K ) + F F A X I C ( M , N , K ) + F F A X I c t r l ( M , N , K ) .
According to the internal module division of the systolic array IP, the FF resources are divided into three parts: those used in the data preprocessing part (represented by F F p r e ), those used in the computation part (represented by F F P E ), and those used in the data release part (represented by F F f r e e ). Equation (18) can be presented.
F F i n s t a n c e s ( M , N , K , S ) = F F p r e ( K , S ) + M × N × F F P E ( K , S ) + F F f r e e ( K , S ) .
As shown in Figure 9, the FF resources F F p r e in the data preprocessing part consist of eight modules: F F A _ s e , F F A _ 3 i n , F F A _ 2 i n , F F A _ b o u n d , F F B _ s e , F F B _ 3 i n , F F B _ 2 i n , and F F B _ b o u n d . Consequently, Equation (19) can be listed. Similarly, the FF resources F F f r e e in the data release part consists of eight modules: F F A _ D , F F B _ D , F F C _ D B 1 , F F C _ D 1 , F F C _ D B 2 , F F C _ D 2 , F F C _ D 3 and F F C _ D S . Consequently, Equation (20) can be listed.
F F p r e ( M , N , K , S ) = F F A _ s e ( M , K , S ) + F F A _ 3 i n ( M , K , S ) + ( M 1 ) F F A _ 2 i n ( K , S ) + F F A _ b o u n d ( M , K , S ) + F F B _ s e ( N , K , S ) + F F B _ 3 i n ( N , K , S ) + ( N 1 ) F F B _ 2 i n ( K , S ) + F F B _ b o u n d ( N , K , S ) .
F F f r e e ( M , N , K , S ) = M × F F A _ D ( K , S ) + N × F F B _ D ( K , S ) + N × F F C _ D B 1 ( K , S ) + ( M 1 ) N × F F C _ D 1 ( K , S ) + F F C _ D B 2 ( M , N , K , S ) + ( N 1 ) × F F C _ D 2 ( K , S ) + F F C _ D 3 ( M , N , K , S ) + F F C _ D S ( M , N , K , S ) .

3.2.4. Estimation of LUT

Unlike FF resources, LUT resources are used not only in instances and channels but also in expressions and multiplexers. Consequently, Equation (21) can be presented. According to the internal module division of the systolic array IP, the LUT resources are divided into three parts: those used in the data preprocessing part (LUT resources are represented by L U T p r e ), those used in the computation part (LUT resources are represented by L U T P E ), and those used in the data release part (LUT resources are represented by L U T f r e e ). Equation (22) can be presented. In the AXI4 interface part, LUT resources are also utilized in channels A, B, C, and the control interface. Consequently, Equation (23) can be presented.
L U T s u m ( M , N , K , S ) = L U T i n s t a n c e s ( M , N , K , S ) + L U T c h a n n e l s ( M , N , K ) + L U T e x p r e s s i o n s _ m u l t i p l e x e r s ( M , N , K , S ) .
L U T i n s t a n c e s ( M , N , K , S ) = L U T p r e ( K , S ) + M × N × L U T P E ( K , S ) + L U T f r e e ( K , S ) .
L U T c h a n n e l s ( M , N , K ) = L U T A X I A ( M , K ) + L U T A X I B ( N , K ) + L U T A X I C ( M , N ) + L U T A X I c t r l ( K ) .
Expressions and multiplexers are primarily composed of four types of combinational logic: addition, AND, OR, and multiplication. Consequently, Equation (24) can be presented.
L U T e x p r e s s i o n s _ m u l t i p l e x e r s ( M , N , K , S ) = L U T a d d ( M , N , K , S ) + L U T a n d ( M , N , K , S ) + L U T o r ( M , N , K , S ) + L U T m u l ( M , N , K , S ) .

4. Accurate and Fast Hardware Resources Estimation Method (AFHRE) of Hardware Resources for CNN Model

Given a CNN network model and a target FPGA device, our goal is to search for the design with the minimum latency while maximizing the utilization of on-chip resources. The optimization problem is summarized in Equation (25). In the constraints, M A X L U T , M A X B M , M A X F F , and M A X D S P represent the total number of L U T s , B R A M s , F F s , and D S P s in the target FPGA device, respectively. Furthermore, M A X M , M A X N , and M A X K denote the maximum values of M, N, and K specified by the user, respectively.
m i n { l a t e n c y ( M , N , K , S ) } s . t . L U T s u m M , N , K , S M A X L U T , B M s u m M , N , K , S M A X B M , D S P s u m M , N , S M A X D S P , F F s u m M , N , K , S M A X F F , 0 < N M A X N , 0 < M M A X M , 0 < K M A X K .
Since the data must ultimately be destroyed by the Drain module after entering the systolic array, the operation of the systolic array exhibits the characteristic of atomicity. Given the number of operations l a y e r o p i for the i t h layer convolution, the attainable performance of the systolic array is P a r r a y ( M , N , K , S ) FLOP/s, and the time spent to execute one atomic computation is s l t ( M , N , K , S ) seconds. Therefore, the following Equation (26) holds. Because the systolic array operates atomically, when computing the final data for each layer, if the remaining computation is less than what the systolic array can handle in one computation, the computation task will still be completed according to the full computation time of the systolic array. Therefore, we apply the ceiling function in Equation (26).
l a t e n c y ( M , N , K , S ) = s l t ( M , N , K , S ) × i = 0 C N N l y a e r N U M 1 l a y e r o p i P a r r a y ( M , N , K , S ) × s l t ( M , N , K , S ) .
The performance of the systolic array follows the roofline model [32]. As shown in Figure 10, when the computational intensity I ( M , N , K , S ) is in the memory-bound region, i.e., I ( M , N , K , S ) I m a x , the maximum performance of the systolic array is the product of the bandwidth β and the computational intensity; that is, P a r r a y ( M , N , K , S ) = β × I ( M , N , K , S ) . When the computational intensity I ( M , N , K , S ) is in the compute-bound region, i.e., I ( M , N , K , S ) > I m a x , the maximum performance of the systolic array is limited by the maximum performance of the FPGA platform, i.e., P a r r a y ( M , N , K , S ) = P m a x . In summary, the attainable performance P a r r a y ( M , N , K , S ) is expressed in Equation (27).
P a r r a y ( M , N , K , S ) = β × I ( M , N , K , S ) I ( M , N , K , S ) I m a x , P m a x I ( M , N , K , S ) > I m a x .
For a single matrix multiplication operation, each PE performs K × S multiply-accumulate operations. There is a total of M × N PEs. Therefore, the total number of operations for a single matrix multiplication is M × N × K × S . Meanwhile, data A, data B, and data C are being transferred. The computational intensity I ( M , N , K , S ) can be described by Equation (28).
I ( M , N , K , S ) = M × N × K × S B y t e i A + B y t e i B + B y t e o C .
Thus, by optimizing Equation (25), Algorithm 1, which we call AFHRE, can be used to find the globally optimal systolic array size for the CNN model C N N A under the given hardware resource conditions. The complexity of our method is O(n), and the evaluation time is linearly related to the number of rows and columns of the systolic array.
Algorithm 1: AFHRE
Input: M A X M , M A X N , M A X K , C N N i n d e x , l a y e r o p
Result: o p t i m a l M , o p t i m a l N , o p t i m a l K
Electronics 14 00168 i001

5. Evaluation

The FPGA platform model used in the experiment is the XILINX KCU 1500. The version of HLS is Vivado HLS 2019.1. Section 5.1 provides a resource estimation for the minimalist systolic array and compares the estimated hardware resource values from our proposed method with those estimated by Vivado HLS. Section 5.2 provides a resource estimation for the systolic array generated by polyhedral compilers and compares the estimated hardware resource values from our proposed method with those estimated by Vivado HLS. Section 5.3 compares the evaluation speed of our proposed method with that of Vivado HLS.

5.1. Theory and Experimental Analysis of Minimalist Systolic Arrays Hardware Resources

In Section 3.1, the analysis of the scale changes of the minimalist systolic arrays shows a clear regularity in the variation in the number of DSPs and FFs. The horizontal coordinates in Figure 11, Figure 12 and Figure 13 represent the parameters M, N, and K of the array. Specifically, ( 2 , 2 , 2 ) indicates the parameters ( M , N , K ) . As shown in Figure 11 and Figure 12, the resource estimations for DSPs and FFs are fully consistent with the resource estimates from Vivado HLS, demonstrating the accuracy of our method.
Our method has a certain degree of error when estimating the LUT resources for simple systolic arrays; however, the maximum error does not exceed 5 % as shown in Figure 13. Table 2 indicates that the LUT resources are comprised of four components. Through a more precise comparison with Vivado HLS, our method accurately calculates the values of L U T s e x p , L U T s f i f o , and L U T s m u l , while the result for L U T s i n s shows a negligible error.

5.2. Theory and Experimental Analysis of Systolic Arrays Generated by Polyhedral Compilers

The horizontal coordinates in Figure 14, Figure 15, Figure 16 and Figure 17 represent the parameters M, N, and K of the array. Specifically, ( 2 , 2 , 2 , 1 ) indicates the parameters ( M , N , K , S ) .
Each block of BRAM in the FPGA has a size of 18 Kb. BRAM resources are primarily used in the data port and data serialization sections of the AXI4 protocol. As shown in Figure 14, we can see that the estimated BRAM values are either equal to or differ by two from those of Vivado HLS. In Figure 14, the error values are generally within 5 % . In the case of parameters (3, 3, 3, 1), the relatively small number of BRAMs used results in a slightly larger error of 75% when comparing our estimation strategy to Vivado HLS. However, since the amount of BRAM resources used is small, this error has a negligible impact on the overall resource estimation.
The experiment verifies the resource estimation for 16-bit fixed-point matrix computations. At this point, the number of DSPs used by each PE is fixed. As shown in Figure 15, we can see that the estimated number of DSPs matches exactly with the number estimated by Vivado HLS.
Figure 16 and Figure 17 display the comparative estimates of FF and LUT resources, respectively. The usage of FF and LUT resources has reached 10 4 and 10 5 , respectively, indicating that constructing systolic arrays on an FPGA requires a substantial amount of these two types of resources. This helps our method reduce the error when estimating the final hardware resource values compared to the actual values. It can be observed that the maximum error in estimating FF resources in Figure 16 is approximately 11 % . In Figure 17, the maximum error in estimating LUT resources is about 5 % .

5.3. Estimation Time

Figure 18, along with Figure 19 and Figure 20, uses the convolutional models YOLOv2, AlexNet, and VGG-16 as examples to demonstrate the estimated time of the proposed method and the acceleration achieved compared to Vivado HLS calls. Our tool provides estimates with a substantial speed up, ranging from 40× to an impressive 610× faster than the commercial tool Vivado HLS.

6. Conclusions

Previous works have either focused on the evaluation of hardware resources for general structures or have not adequately assessed hardware resources specifically for systolic arrays. This paper proposes a method called AFHRE, which quantifies the resource occupancy of convolution modules implemented with systolic arrays on FPGAs and addresses these shortcomings. Additionally, it analyzes the resource composition of systolic arrays and establishes a resource calculation model to compute the FPGA resources occupied by arbitrary systolic array parameters. Experiments show that the DSP resource values estimated using this method are fully consistent with those estimated by Vivado HLS. The error rates for three other types of FPGA resources (BRAM, LUT, and FF) are within 11%. Furthermore, the speed of the above resource estimation is 40X to 610X faster than that of Vivado HLS. In summary, AFHRE can quickly determine the final occupied hardware resource values and whether the hardware resources of the selected FPGA are sufficient to achieve the minimum latency under the constraints of the hardware resource values of the convolutional accelerator. AFHRE can serve as a preprocessing step for Vivado HLS, achieving optimal or sub-optimal solutions for systolic array parameters more quickly than the original simulation methods of Vivado HLS. Thus, it can aid in designing convolutional modules with systolic array structures.

Author Contributions

Methodology, Y.W., H.Z. and J.Z.; Software, J.Z.; Formal analysis, Y.W.; Writing—original draft, Y.W.; Writing—review & editing, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program under Grant 2022YFB3303700 and the Key Laboratory of Artificial Intelligence, Ministry of Education, ShangHai JiaoTong University (Number of Open Project Program of Key Laboratory of Artificial Intelligence, Ministry of Education: AI202106).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3813–3824. [Google Scholar] [CrossRef]
  2. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
  3. Hao, C.; Dotzel, J.; Xiong, J.; Benini, L.; Zhang, Z.; Chen, D. Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Codesign. IEEE Des. Test 2021, 38, 7–26. [Google Scholar] [CrossRef]
  4. Xu, P.; Zhang, X.; Hao, C.; Zhao, Y.; Zhang, Y.; Wang, Y.; Li, C.; Guan, Z.; Chen, D.; Lin, Y. AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 40–50. [Google Scholar] [CrossRef]
  5. Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
  6. Zhou, X.; Du, Z.; Zhang, S.; Zhang, L.; Lan, H.; Liu, S.; Li, L.; Guo, Q.; Chen, T.; Chen, Y. Addressing Sparsity in Deep Neural Networks. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 1858–1871. [Google Scholar] [CrossRef]
  7. Coates, A.; Huval, B.; Wang, T.; Wu, D.J.; Ng, A.Y.; Catanzaro, B. Deep Learning with COTS HPC Systems. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, Atlanta, GA, USA, 17–19 June 2013; pp. III-1337–III-1345. [Google Scholar]
  8. Jadhav, S.S.; Gloster, C.; Naher, J.; Doss, C.; Kim, Y. A Multi-Memory Field-Programmable Custom Computing Machine for Accelerating Compute-Intensive Applications. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2021; pp. 0619–0628. [Google Scholar] [CrossRef]
  9. Chen, Z.; Ma, Y.; Wang, Z. Hybrid Stochastic-Binary Computing for Low-Latency and High-Precision Inference of CNNs. IEEE Trans. Circuits Syst. Regul. Pap. 2022, 69, 2707–2720. [Google Scholar] [CrossRef]
  10. Nurvitadhi, E.; Sheffield, D.; Sim, J.; Mishra, A.; Venkatesh, G.; Marr, D. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 77–84. [Google Scholar] [CrossRef]
  11. Nurvitadhi, E.; Venkatesh, G.; Sim, J.; Marr, D.; Huang, R.; Ong Gee Hock, J.; Liew, Y.T.; Srivatsan, K.; Moss, D.; Subhaschandra, S.; et al. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 5–14. [Google Scholar] [CrossRef]
  12. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar] [CrossRef]
  13. Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1354–1367. [Google Scholar] [CrossRef]
  14. Wei, X.; Yu, C.H.; Zhang, P.; Chen, Y.; Wang, Y.; Hu, H.; Liang, Y.; Cong, J. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 18–22 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
  15. Basalama, S.; Sohrabizadeh, A.; Wang, J.; Guo, L.; Cong, J. FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA. ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 3570928. [Google Scholar] [CrossRef]
  16. Samajdar, A.; Joseph, J.M.; Zhu, Y.; Whatmough, P.; Mattina, M.; Krishna, T. A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Virtual Meeting, 23–26 August 2020; pp. 58–68. [Google Scholar] [CrossRef]
  17. Nazemi, M.; Nazarian, S.; Pedram, M. High-performance FPGA implementation of equivariant adaptive separation via independence algorithm for Independent Component Analysis. In Proceedings of the 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, 10–12 July 2017; pp. 25–28. [Google Scholar] [CrossRef]
  18. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  19. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  20. Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2014, arXiv:1312.4400. [Google Scholar]
  21. Chen, Y.H.; Krishna, T.; Emer, J.; Sze, V. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 31 January–4 February 2016; pp. 262–263. [Google Scholar] [CrossRef]
  22. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar] [CrossRef]
  23. Li, J.; Shen, G.; Zhao, D.; Zhang, Q.; Zeng, Y. FireFly: A High-Throughput Hardware Accelerator for Spiking Neural Networks with Efficient DSP and Memory Optimization. IEEE Trans. Very Large Scale Integr. (VLSI) Systems 2023, 31, 1178–1191. [Google Scholar] [CrossRef]
  24. Basalama, S.; Wang, J.; Cong, J. A Comprehensive Automated Exploration Framework for Systolic Array Designs. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
  25. Dave, S.; Kim, Y.; Avancha, S.; Lee, K.; Shrivastava, A. dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators. ACM Trans. Embed. Comput. Syst. 2019, 18, 1–27. [Google Scholar] [CrossRef]
  26. Huang, Q.; Kang, M.; Dinh, G.; Norell, T.; Kalaiah, A.; Demmel, J.; Wawrzynek, J.; Shao, Y.S. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Online, 14–19 June 2021; pp. 554–566. [Google Scholar] [CrossRef]
  27. Parashar, A.; Raina, P.; Shao, Y.S.; Chen, Y.H.; Ying, V.A.; Mukkara, A.; Venkatesan, R.; Khailany, B.; Keckler, S.W.; Emer, J. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, USA, 24–26 March 2019; pp. 304–315. [Google Scholar] [CrossRef]
  28. Lu, L.; Guan, N.; Wang, Y.; Jia, L.; Luo, Z.; Yin, J.; Cong, J.; Liang, Y. TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Online, 14–19 June 2021; pp. 720–733. [Google Scholar] [CrossRef]
  29. Kao, S.C.; Jeong, G.; Krishna, T. ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 622–636. [Google Scholar] [CrossRef]
  30. Kao, S.C.; Krishna, T. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In Proceedings of the 39th International Conference on Computer-Aided Design, Virtual Conference, 2–5 November 2020. [Google Scholar] [CrossRef]
  31. Wang, J.; Guo, L.; Cong, J. AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 28 Februar–2 March 2021; pp. 93–104. [Google Scholar] [CrossRef]
  32. Williams, S.; Waterman, A.; Patterson, D. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
Figure 1. By executing the im2col function, the convolution operation is transformed into a matrix multiplication operation. (a) One 4 × 4 input feature map is convolved with four 3 × 3 weight filters to obtain four 2 × 2 output feature maps; (b) Executing the im2col function on the input feature map; (c) Executing the im2col function on the weight; (d) After executing the im2col function, the input feature data is multiplied with the weight data to obtain the output feature map data.
Figure 1. By executing the im2col function, the convolution operation is transformed into a matrix multiplication operation. (a) One 4 × 4 input feature map is convolved with four 3 × 3 weight filters to obtain four 2 × 2 output feature maps; (b) Executing the im2col function on the input feature map; (c) Executing the im2col function on the weight; (d) After executing the im2col function, the input feature data is multiplied with the weight data to obtain the output feature map data.
Electronics 14 00168 g001
Figure 2. Performing matrix multiplication of a Q matrix of size 9 × 4 and a W matrix of size 4 × 9 on a systolic array of size 4 × 4 requires 12 clock cycles to complete the computation.
Figure 2. Performing matrix multiplication of a Q matrix of size 9 × 4 and a W matrix of size 4 × 9 on a systolic array of size 4 × 4 requires 12 clock cycles to complete the computation.
Electronics 14 00168 g002
Figure 3. A schematic diagram of an M × N systolic array.
Figure 3. A schematic diagram of an M × N systolic array.
Electronics 14 00168 g003
Figure 4. In order to facilitate the statistical analysis of the hardware resources for LUTs of the expression, the systolic array is divided into three parts for statistics. ①, ②, and ③ in the figure correspond to e p a ( M 1 ) ( N 1 ) , e p b ( M + N 2 ) , and e p c in Equation (4), respectively.
Figure 4. In order to facilitate the statistical analysis of the hardware resources for LUTs of the expression, the systolic array is divided into three parts for statistics. ①, ②, and ③ in the figure correspond to e p a ( M 1 ) ( N 1 ) , e p b ( M + N 2 ) , and e p c in Equation (4), respectively.
Electronics 14 00168 g004
Figure 7. In order to facilitate the statistical analysis of the hardware resources for LUTs of multiplexers, the systolic array is divided into two parts for statistics. ① and ② together comprise a total of [ ( M + 1 ) ( N + 1 ) 1 ] modules.
Figure 7. In order to facilitate the statistical analysis of the hardware resources for LUTs of multiplexers, the systolic array is divided into two parts for statistics. ① and ② together comprise a total of [ ( M + 1 ) ( N + 1 ) 1 ] modules.
Electronics 14 00168 g007
Figure 8. In order to facilitate the statistical analysis of the hardware resources for FFs of registers, the systolic array is divided into four parts for statistics. ①, ②, ③, and ④ in the figure correspond to r e g a , r e g b M + N 4 , r e g c M 2 ( N 2 ) , and r e g d M + N 2 in Equation (11), respectively.
Figure 8. In order to facilitate the statistical analysis of the hardware resources for FFs of registers, the systolic array is divided into four parts for statistics. ①, ②, ③, and ④ in the figure correspond to r e g a , r e g b M + N 4 , r e g c M 2 ( N 2 ) , and r e g d M + N 2 in Equation (11), respectively.
Electronics 14 00168 g008
Figure 9. Systolic array module diagram generated by AutoSA [31].
Figure 9. Systolic array module diagram generated by AutoSA [31].
Electronics 14 00168 g009
Figure 10. Basis of the roofline model [32].
Figure 10. Basis of the roofline model [32].
Electronics 14 00168 g010
Figure 11. Comparison of DSP resources estimated by Vivado HLS and those estimated by our method when estimating the minimalist systolic arrays hardware resources.
Figure 11. Comparison of DSP resources estimated by Vivado HLS and those estimated by our method when estimating the minimalist systolic arrays hardware resources.
Electronics 14 00168 g011
Figure 12. Comparison of FF resources estimated by Vivado HLS and those estimated by our method when estimating the minimalist systolic arrays hardware resources.
Figure 12. Comparison of FF resources estimated by Vivado HLS and those estimated by our method when estimating the minimalist systolic arrays hardware resources.
Electronics 14 00168 g012
Figure 13. Comparison of LUT resources estimated by Vivado HLS and those estimated by our method when estimating the minimalist systolic arrays hardware resources.
Figure 13. Comparison of LUT resources estimated by Vivado HLS and those estimated by our method when estimating the minimalist systolic arrays hardware resources.
Electronics 14 00168 g013
Figure 14. Comparison of BRAM resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Figure 14. Comparison of BRAM resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Electronics 14 00168 g014
Figure 15. Comparison of DSP resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Figure 15. Comparison of DSP resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Electronics 14 00168 g015
Figure 16. Comparison of FF resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Figure 16. Comparison of FF resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Electronics 14 00168 g016
Figure 17. Comparison of LUT resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Figure 17. Comparison of LUT resources estimated by Vivado HLS and those estimated by our method when estimating systolic arrays generated by polyhedral compilers.
Electronics 14 00168 g017
Figure 18. Speed and acceleration factor of hardware resource values estimated by our method when implementing YOLOv2 acceleration.
Figure 18. Speed and acceleration factor of hardware resource values estimated by our method when implementing YOLOv2 acceleration.
Electronics 14 00168 g018
Figure 19. Speed and acceleration factor of hardware resource values estimated by our method when implementing AlexNet acceleration.
Figure 19. Speed and acceleration factor of hardware resource values estimated by our method when implementing AlexNet acceleration.
Electronics 14 00168 g019
Figure 20. Speed and acceleration factor of hardware resource values estimated by our method when implementing VGG-16 acceleration.
Figure 20. Speed and acceleration factor of hardware resource values estimated by our method when implementing VGG-16 acceleration.
Electronics 14 00168 g020
Table 1. Comparison between different architecture performance-tuning methods.
Table 1. Comparison between different architecture performance-tuning methods.
Search
Algorithm
SpeedAccuracyOn-Board
Validation
Odyssey [24]Mathematical
Programming Search
Slowaccurate
Using
Vivado HLS
FPGA
dMazeRunner [25]Pruning
Random Search
Mediumnot givenNo (CPU)
CoSA [26]Mathematical
Programming
Fastnot givenGPU
Timeloop [27]Pruning
Random Search
Mediumnot givenDianNao
GPU
Eyeriss
TENET [28]Pruning
Random Search
Mediumestimation
accuracy
89.6%
No
Con-fuciuX [29]    RL Evolutionary
Search
Mediumnot givenNo
GAMMA [30]Genetic
Algorithm-Based  
Method
Mediumnot givenTPU
GPU
Eyeriss
Table 2. Composition of a minimalist systolic array resources.
Table 2. Composition of a minimalist systolic array resources.
BRAMDSPFFLUT
Expression--- L U T s e x p
FIFO-- F F s f i f o L U T s f i f o
Instance- D S P s i n s F F s i n s L U T s i n s
Multiplexer--- L U T s m u l
Register-- F F s r e g -
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhao, H.; Zhao, J. AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA. Electronics 2025, 14, 168. https://doi.org/10.3390/electronics14010168

AMA Style

Wang Y, Zhao H, Zhao J. AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA. Electronics. 2025; 14(1):168. https://doi.org/10.3390/electronics14010168

Chicago/Turabian Style

Wang, Yongchang, Hongzhi Zhao, and Jinyao Zhao. 2025. "AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA" Electronics 14, no. 1: 168. https://doi.org/10.3390/electronics14010168

APA Style

Wang, Y., Zhao, H., & Zhao, J. (2025). AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA. Electronics, 14(1), 168. https://doi.org/10.3390/electronics14010168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop