A Throughput-Optimized Accelerator For Submanifold

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Throughput-Optimized Accelerator for Submanifold Sparse

Convolutional Networks
Shanq-Jang Ruan1 , Yu-Hsuan Lai1 , Ming Fang1 , Edwin Naroska1 , and Jeng-Lun Shieh1
1
Affiliation not available
Posted on 29 Jan 2024 — CC-BY-NC-SA 4 — https://doi.org/10.36227/techrxiv.170654629.99998482/v1 — e-Prints posted on TechRxiv are preliminary reports that are not peer reviewed. They should not b...

January 29, 2024

1
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 1

A Throughput-Optimized Accelerator for


Submanifold Sparse Convolutional Networks
Yu-Hsuan Lai, Shanq-Jang Ruan, Senior Member, IEEE, Ming Fang, Edwin Naroska, and Jeng-Lun Shieh

Abstract—The 3D point cloud plays a crucial role in deep To handle these challenges, several studies have investigated
learning-based vision tasks by providing precise spatial and the conversion of 3D point clouds to 2D images using
depth information, leading to its increasing importance in various projection-based methods, including bird’s-eye view (BEV)
applications. However, the sparse nature of 3D point clouds poses
computational challenges. Researchers have explored the Sub- projection [3], spherical projection [4], and multi-view pro-
manifold Sparse Convolutional Network (SSCN) for processing jection [5]. The projected 2D images can then train 2D
point cloud data while preserving sparsity. Nevertheless, existing convolutional neural networks (CNNs). Alternatively, there are
Convolutional Neural Network (CNN) accelerators encounter other studies that directly operate on the 3D data. PointNet [6]
difficulties in effectively handling SSCNs, prompting recent extracted the point features on unordered point clouds with the
studies to focus on developing dedicated accelerators for point
cloud networks to improve processing performance. This brief multi-layer perceptron (MLP) layers. VoxelNet [7] voxelized
presents a specialized hardware architecture designed for SSCNs point clouds into regular 3D voxel grids and applied CNNs to
to address the challenges of effectively processing sparse 3D point extract features.
clouds. The proposed accelerator achieves a significant 2.51× In [8], the authors emphasized the significance of the spar-
improvement in throughput density compared to previous works, sity property of point clouds, which integrates real-world phys-
highlighting its effectiveness in point cloud processing.
ical information. However, when applying traditional CNNs
Index Terms—Point clouds, Submanifold sparse convolutional to process point clouds, the inherent sparsity of point clouds
Networks (SSCNs), Hardware accelerators, Systolic array. would be lost. This is due to the fact that the features of
each layer expand outwards in the shape of the convolution
I. I NTRODUCTION kernel in traditional convolutional operations. To mitigate
the problem mentioned above, [9] proposed the Submanifold
T HE 3D point cloud is a set of data that contains 3D
coordinates. Each point in the point cloud represents a
position on the surface of an object. By connecting these
Sparse Convolutional Network (SSCN), which preserves the
sparsity of the raw data by ensuring that computations of the
points, it can represent the appearance of a 3D object or non-zero points do not dilate.
scene. Depending on the different sensors, each point may In the last decade, there has been a dramatic proliferation of
contain RGB or the intensity of the object’s reflecting surface. research concerned with specialized CNN accelerators, driven
Compared with 2D RGB images, 3D point clouds are better by the imperative need for real-time interactions in neural
suited for deep learning-based vision tasks due to several network applications. For example, Eyeriss [10] introduced a
advantages. They provide exact spatial and depth information power-efficient hardware accelerator for deep CNNs, inspiring
that is ideal for tasks requiring meticulous analysis and is the development of subsequently specialized hardware accel-
typically more resilient to changes in lighting conditions, en- erators. Another work [11] presented a sparse-wise dataflow
suring consistent and reliable performance [1]. Consequently, and an FPGA-based accelerator designed to efficiently process
3D point clouds are increasingly crucial in deep learning sparse CNNs. Additionally, a throughput-optimized channel-
applications like autonomous driving, augmented reality (AR), oriented processing element (PE) array for CNNs was pro-
and virtual reality (VR). posed in [12], demonstrating superior performance through
Despite these advantages, 3D point clouds present chal- parallelism and data reuse in convolutional layers. Neverthe-
lenges due to their properties of being unordered, sparse, less, the existing CNN accelerators encounter challenges in
and locality-sensitive [2]. Moreover, extensive data and high effectively handling SSCNs. The limitation arises from the fact
computational requirements impede point cloud processing. that not all input feature points in SSCNs require computation
with the convolution kernel. Consequently, mapping operations
Yu-Hsuan Lai, Shanq-Jang Ruan, Ming Fang, and Jeng-Lun Shieh are are necessary to determine which input feature points and their
with the Department of Electronic and Computer Engineering, National corresponding kernel weights require computation.
Taiwan University of Science and Technology, Taipei 10607, Taiwan (e-mail:
sjruan@mail.ntust.edu.tw). More recently, there has been a growing body of literature
Edwin Naroska is with the Faculty of Electrical Engineering and Comput- devoted to the development of accelerators designed explicitly
erScience, Hochschule Niederrhein University of Applied Science, Krefeld, for point cloud networks. PointAcc [13] is a pioneering accel-
Germany.
This research was funded by VIA Labs, Inc. (VLI) under the project ”Pre- erator for point cloud networks that introduced mapping units
development project of low power gesture recognition and tracking system” and memory management. Mesorasi [14] proposed delayed-
(Grant No. NTUST-VLI-NO. 09779). aggregation to enhance the efficiency of neighbor point search
We thank to National Center for High-performance Computing (NCHC) of
National Applied Research Laboratories (NARLabs) in Taiwan for providing in PointNet++. While delayed-aggregation can be utilized in
computational and storage resources. point cloud networks, its application in SSCNs is not feasible.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 2

computation causes the features to lose their original sparsity,


which is unsuitable for point cloud processing. In contrast,
submanifold sparse convolution maintains the same sparsity in
the output feature map as the input feature map. By preserving
the sparsity of the feature, submanifold sparse convolution
effectively retains the intrinsic characteristics of point clouds.
Consequently, it is particularly well-suited for point cloud
processing.
In submanifold sparse convolution, the output coordinates
correspond directly to the input coordinates when the stride
is one. By employing kernel offsets, the non-zero neighboring
features are identified for computation. Subsequently, the cor-
responding feature vectors are calculated for each neighboring
Fig. 1. Schematic diagram of two distinct convolution operations. feature. The submanifold sparse convolution can be formulated
as
XX
In contrast, [15] presented an FPGA-based accelerator for fjout = α(Wλ · fiin ) (1)
SSCNs that incorporates a tile-based zero removal strategy, λ∈β i
significantly enhancing GOPS performance. However, there is (
still potential for further improvement in enhancing throughput 1, if cin
i = scj
out

α= (2)
within this context. 0, otherwise.
In light of the challenges mentioned above, this brief Where s denotes the stride, Wλ represents the weight asso-
proposes an accelerator explicitly designed for SSCNs. This ciated with the convolution. The kernel offset is denoted as
work’s main contributions include: β = (a, b, c), where a, b, c ∈ Z are variables satisfying the
• We design a specialized hardware architecture that sup- constraint |a, b, c| ≤ k. The value k 3 corresponds to the size
ports the rule table generation and gather/scatter opera- of the kernel.
tion, facilitating the identification of corresponding input Fig. 2 depicts the computational workflow of submanifold
and output point clouds for each weight. Furthermore, sparse convolution [16]. The initial step is mapping opera-
our design ensures scalability to handle point clouds of tions to establish a rule table for input-output associations.
any scale. This rule table indicates the indices necessary for convolu-
• We develop a weight-stationary PE array architecture tion computation with each weight. Subsequently, the input
based on the systolic array concept. Our design effec- features corresponding to different weights are gathered for
tively harnesses weight reuse, improving throughput and computation. Finally, the partial sums are scattered to their
optimizing resource utilization. respective output points. In this study, we have developed
• Taking into account the above two points, we introduce a a dedicated hardware system that enables rule table genera-
dedicated accelerator design for SSCNs implemented on tion and facilitates the gather/scatter operation. Moreover, a
the Xilinx ZCU104 platform. specialized computational unit has been designed to handle
the computational requirements of the submanifold sparse
II. S UBMANIFOLD S PARSE C ONVOLUTIONAL N ETWORKS convolution effectively.
In the era of widespread LiDAR sensor availability, 3D
III. P ROPOSED A RCHITECTURE
point clouds have become an indispensable component in deep
learning applications. These point clouds can be represented A. Architectural Overview
as pi = (ci , fi ), where ci = (xi , yi , zi ) denotes the 3D coor- The proposed architecture, depicted in Fig. 3, consists of
dinates and fi represents the feature vector. However, in the four key units: rule table generation unit (RTGU), gather
context of point cloud processing, submanifold sparse convo- unit (GTU), scatter unit (STU), and computation unit (CTU).
lution introduces a distinct computational paradigm compared RTGU is responsible for transforming the input point cloud
to traditional convolution. Fig. 1 illustrates the convolution for coordinates into a rule table, which is then stored in the rule
point clouds in comparison to traditional convolution. buffer for subsequent operations. By leveraging the generated
In traditional convolution, the computation involves sliding rule table, GTU effectively controls the selection of features
a kernel over the input features, which leads to the expansion to be computed and their corresponding weights. The selected
of sparse data in the output feature map. For instance, consider features and weights are transmitted to CTU, where the
an image consisting of a single non-zero pixel and the rest matrix multiplication operation is executed. Subsequently, the
being zeros. Applying a 3 × 3 kernel convolution operation resulting partial sums obtained from the computations are
would yield a feature map containing nine locations with directed to STU for distribution and accumulation, leading to
non-zero values. Subsequently, repeating the convolution with the generation of computed features as the final output. The
the same kernel would result in an output feature map with subsequent sections will provide in-depth explanations of the
25 positions containing non-zero values, and so on. Such functionalities and operations of each unit.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 3

Fig. 2. Computational workflow of an SCNN layer.

relationship between the output and the weight in the cor-


responding direction for each point. Through the iterative
application of this method, the rule table required for the
computation units can be constructed. Fig. 4 illustrates an
example scenario where the convolution kernel traverses the
input features and reaches the position C3 . In this context, the
output value at C3 is determined by summing the products
obtained from multiplying C3 and its neighboring non-zero
points with their corresponding weights. By shifting the input
point cloud features in the negative x-axis direction, a note-
worthy observation is made: the coordinates of C3 prior to
the shift coincide with the coordinates of C4 after the shift.
This correspondence implies that when the weight in that
specific direction is multiplied by the feature at position C4 ,
it contributes to a partial sum of the output.

Fig. 3. Overall architecture of the proposed hardware. C. Gather Unit and Scatter Unit
Gather and Scatter Unit coordinates the collection and distri-
bution of data based on the rules provided by RTGU, enabling
effective computations within CTU. GTU sequentially loads
the input features and weights for computation using the
coordinate and weight index messages derived from the rule
table. These loaded data are then passed to CTU for efficient
matrix operations. Upon completion of highly parallel compu-
tations within CTU, all the computed outputs are transmitted
to STU. STU performs the element-wise summation of the
corresponding partial sums, yielding the final output of the
computation.

D. Computation Unit
This study presents a specialized PE array, as depicted in
Fig. 5, designed specifically to enhance the efficiency of matrix
Fig. 4. Rule table generation algorithm. operations for SSCNs. Leveraging the output of GTU, the
input feature values that require multiplication with weights of
identical values are intelligently grouped together, optimizing
B. Rule Table Generation Unit computational efficiency. By adopting a weight-stationary PE
In submanifold sparse convolution, accurately identifying array architecture inspired by systolic dataflow, the proposed
the target points for computation, their neighboring non-zero design maximizes data reuse and effectively reduces data
points, and corresponding weights plays a crucial role. To access frequency, resulting in enhanced computational perfor-
achieve this task, a shifting operation is applied to the input mance.
point cloud features. By comparing the coordinates of points Input features and weights are efficiently transmitted to the
before and after the shift, we establish the computational PE array through First-In-First-Out (FIFO) channels, ensuring
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 4

TABLE I
R ESOURCE U TILIZATION ON X ILINX ZCU104 E VALUATION B OARD

LUT LUTRAM FF DSP


Utilization 49942 320 42115 256

[8] model is used as a benchmark. The arithmetic precision in


the proposed hardware employed an 8-bit fixed point.

B. Benchmark Results
Fig. 5. Weight-stationary PE array.
Throughput is an important metric for hardware accelera-
tors. In addition to that, for the purpose of comparison, the
throughput of each PE should also be considered because
increasing the number of PEs can significantly improve the
overall throughput. The throughput of each PE is referred
to as throughput density, which is obtained by dividing the
total throughput by the number of PEs. Furthermore, when
designing for implementation on an FPGA, the number of
PEs in the throughput density definition is modified to the
Fig. 6. PE architecture.
number of digital signal processors (DSPs). The utilization of
hardware resources in the proposed design is summarized in
proper data flow. Prior to computation, the weights are initial- Table I, which includes the number of lookup tables (LUTs),
ized and stored within each PE, enabling efficient processing. LUTRAMs, flip-flops (FFs), and DSPs.
Each PE conducts multiplication and addition operations on Table II shows the PE utilization, throughput, and through-
the input feature, including the output of previous stage and the put density for each layer in the SS U-Net processed on
corresponding weight. The resulting output is then seamlessly the CTU. The first layer exhibits a PE utilization of 6.25%,
forwarded to the subsequent PE in the array, enabling a smooth which is significantly lower than the other layers due to
cascade of computations. Ultimately, the final layer of PEs its single input channel. As a result, the throughput of the
generates the result, which is subsequently transmitted to STU first layer is only 1.593 GOP/s. However, the subsequent
for the summation of outputs. Notably, our architecture enables layers achieve 100% PE utilization, leading to an overall
simultaneous and parallel entry of up to 16 data, enhancing PE utilization of 91.48% and significantly higher throughput
computational efficiency and throughput. compared to the first layer. Since most model layers have
The detailed structure of each PE is illustrated in Fig.6. multiple channels, the lower PE utilization of the first layer,
Upon triggering the clock signal, input features are multiplied with its single channel, has a minimal impact on the overall
by pre-stored weights, and the resulting product is added system performance. Consequently, the overall throughput
to the output of the next-level PE. Simultaneously, input remains relatively unaffected, and the average total throughput
features are conveyed to the succeeding PE within the same achieved is 44.55 GOP/s. Furthermore, the total throughput
clock cycle. This enables continuous operation of subsequent density reaches 0.174 GOP/s.
weights on the same input feature, maximizing data reuse. The
PE architecture utilizes a combination of multiplexers and D C. Comparison with Previous Works
flip-flops to accomplish a weight-stationary PE array, where
Table III presents a comprehensive comparison with pre-
weights can be firmly stored within the PEs. This configuration
vious works. The ESCA [15] proposed by Z. Wang et al.
allows for the preloading of weights and effectively latching
achieved a throughput of 17.73 GOP/s and a throughput
them. Consequently, during each clock cycle, the PE can
density of 0.069 GOP/s on the SS U-Net at a clock frequency
repeatedly utilize fixed weights to process different input
of 270 MHz. In this brief, we introduce an enhanced design for
features, significantly enhancing computational efficiency.
the SS U-Net with the same number of PEs but at a lower clock
frequency of 100 MHz. Our proposed design achieves a signif-
IV. E VALUATION
icantly higher throughput and throughput density, specifically
A. Experimental Setup 44.55 GOP/s and 0.174 GOP/s, respectively. Notably, our work
The proposed architecture is implemented and verified demonstrates a 2.51× increase in throughput density compared
on the Xilinx Zynq UltraScale+ MPSoC ZCU104 device, to ESCA [15]. Moreover, to comprehensively evaluate our
utilizing the Verilog programming language and the Xilinx work, we also conducted a comparative analysis with another
Vivado 2023.1 toolchain. The architecture operates at a clock FPGA-based point cloud neural network accelerator. Y. Yu
frequency of 100MHz. To evaluate the performance of the et al. [17] proposed a low-latency accelerator specifically
proposed design, the 3D submanifold sparse U-Net (SS U-Net) designed for Light PointNet (LPN). In comparison to their
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 5

TABLE II R EFERENCES
E XPERIMENTAL R ESULTS OF E ACH L AYER IN SS U-N ET
[1] H. Tang, Z. Liu, X. Li, Y. Lin, and S. Han, “Torchsparse: Efficient point
cloud inference engine,” Proceedings of Machine Learning and Systems,
vol. 4, pp. 302–315, 2022.
Convolutional PE Utilization Throughput Throughput [2] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense
Layer (%) (GOP/s) Density (GOP/s) 3d object detector for point cloud,” in Proceedings of the IEEE/CVF
international conference on computer vision, 2019, pp. 1951–1960.
1-1 6.25 1.593 0.006 [3] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection
1-2 100.0 49.38 0.193 from point clouds,” in Proceedings of the IEEE conference on Computer
2-1 100.0 49.38 0.193 Vision and Pattern Recognition, 2018, pp. 7652–7660.
[4] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and
2-2 100.0 49.38 0.193 M. Tomizuka, “Squeezesegv3: Spatially-adaptive convolution for effi-
2-3 100.0 49.38 0.193 cient point-cloud segmentation,” in Computer Vision–ECCV 2020: 16th
3-1 100.0 49.38 0.193 European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part XXVIII 16. Springer, 2020, pp. 1–19.
3-2 100.0 49.38 0.193 [5] Q. Liu, H. Yuan, H. Su, H. Liu, Y. Wang, H. Yang, and J. Hou,
3-3 100.0 49.38 0.193 “Pqa-net: Deep no reference point cloud quality assessment via multi-
4-1 100.0 49.38 0.193 view projection,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 31, no. 12, pp. 4645–4660, 2021.
4-2 100.0 49.38 0.193 [6] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on
4-3 100.0 49.38 0.193 point sets for 3d classification and segmentation,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp.
Total 91.48 44.55 0.174 652–660.
[7] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
based 3d object detection,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 4490–4499.
TABLE III [8] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmen-
C OMPARISON WITH PREVIOUS WORKS ON FPGA tation with submanifold sparse convolutional networks,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2018, pp. 9224–9232.
Y. Yu et al. [17] ESCA [15] This Work [9] B. Graham and L. Van der Maaten, “Submanifold sparse convolutional
networks,” arXiv preprint arXiv:1706.01307, 2017.
Device KCU105 ZCU102 ZCU104 [10] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
Frequency
100 270 100 works,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138,
(MHz) 2016.
Precision 8-bit fixed 8-bit fixed 8-bit fixed [11] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, “An efficient
Number of PEs hardware accelerator for structured sparse convolutional neural networks
2400 256 256 on fpgas,” IEEE Transactions on Very Large Scale Integration (VLSI)
(DSPs)
Benchmark LPN SS U-Net SS U-Net Systems, vol. 28, no. 9, pp. 1953–1965, 2020.
[12] Y.-X. Chen and S.-J. Ruan, “A throughput-optimized channel-oriented
Throughput processing element array for convolutional neural networks,” IEEE
275.8 17.73 44.55
(GOP/s) Transactions on Circuits and Systems II: Express Briefs, vol. 68, no. 2,
Throughput pp. 752–756, 2020.
0.115 0.069 0.174
Density (GOP/s) [13] Y. Lin, Z. Zhang, H. Tang, H. Wang, and S. Han, “Pointacc: Efficient
point cloud accelerator,” in MICRO-54: 54th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, 2021, pp. 449–461.
[14] Y. Feng, B. Tian, T. Xu, P. Whatmough, and Y. Zhu, “Mesorasi:
Architecture support for point cloud analytics via delayed-aggregation,”
work, our accelerator exhibits a notable 1.51× improvement in 2020 53rd Annual IEEE/ACM International Symposium on Microar-
in throughput density. These substantial enhancements can chitecture (MICRO). IEEE, 2020, pp. 1037–1050.
primarily be attributed to the utilization of the rule table [15] Z. Wang, W. Mao, P. Yang, Z. Wang, and J. Lin, “An efficient fpga
accelerator for point cloud,” in 2022 IEEE 35th International System-
generated by RTGU in conjunction with the weight-stationary on-Chip Conference (SOCC). IEEE, 2022, pp. 1–6.
data pattern in CTU. This combination effectively enables the [16] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
efficient processing of point cloud data of varying sizes. detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
[17] Y. Yu, W. Mao, J. Luo, and Z. Wang, “A low-latency framework
with algorithm-hardware co-optimization for 3d point cloud,” IEEE
V. C ONCLUSION Transactions on Circuits and Systems II: Express Briefs, 2023.
In this brief, we present a throughput-optimized accelerator
for SSCNs. Our approach introduces a specialized hardware
design to encode the rule table, enabling the transformation of
submanifold sparse convolution on point cloud features into
matrix multiplication. Additionally, we proposed a dedicated
computation unit design tailored explicitly for point cloud
features. The proposed design is implemented on the Xilinx
Zynq ZCU104 device. Our evaluation results demonstrate a
significant 2.51× improvement in throughput density compared
to ESCA [15], a recent approach for running the 3D SS U-
Net. Furthermore, compared to the other point cloud neural
network accelerator targeting LPN [17], our proposed acceler-
ator demonstrates a 1.51× improvement in throughput density.

You might also like