A Throughput-Optimized Accelerator For Submanifold
A Throughput-Optimized Accelerator For Submanifold
A Throughput-Optimized Accelerator For Submanifold
Convolutional Networks
Shanq-Jang Ruan1 , Yu-Hsuan Lai1 , Ming Fang1 , Edwin Naroska1 , and Jeng-Lun Shieh1
1
Affiliation not available
Posted on 29 Jan 2024 — CC-BY-NC-SA 4 — https://doi.org/10.36227/techrxiv.170654629.99998482/v1 — e-Prints posted on TechRxiv are preliminary reports that are not peer reviewed. They should not b...
1
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 1
Abstract—The 3D point cloud plays a crucial role in deep To handle these challenges, several studies have investigated
learning-based vision tasks by providing precise spatial and the conversion of 3D point clouds to 2D images using
depth information, leading to its increasing importance in various projection-based methods, including bird’s-eye view (BEV)
applications. However, the sparse nature of 3D point clouds poses
computational challenges. Researchers have explored the Sub- projection [3], spherical projection [4], and multi-view pro-
manifold Sparse Convolutional Network (SSCN) for processing jection [5]. The projected 2D images can then train 2D
point cloud data while preserving sparsity. Nevertheless, existing convolutional neural networks (CNNs). Alternatively, there are
Convolutional Neural Network (CNN) accelerators encounter other studies that directly operate on the 3D data. PointNet [6]
difficulties in effectively handling SSCNs, prompting recent extracted the point features on unordered point clouds with the
studies to focus on developing dedicated accelerators for point
cloud networks to improve processing performance. This brief multi-layer perceptron (MLP) layers. VoxelNet [7] voxelized
presents a specialized hardware architecture designed for SSCNs point clouds into regular 3D voxel grids and applied CNNs to
to address the challenges of effectively processing sparse 3D point extract features.
clouds. The proposed accelerator achieves a significant 2.51× In [8], the authors emphasized the significance of the spar-
improvement in throughput density compared to previous works, sity property of point clouds, which integrates real-world phys-
highlighting its effectiveness in point cloud processing.
ical information. However, when applying traditional CNNs
Index Terms—Point clouds, Submanifold sparse convolutional to process point clouds, the inherent sparsity of point clouds
Networks (SSCNs), Hardware accelerators, Systolic array. would be lost. This is due to the fact that the features of
each layer expand outwards in the shape of the convolution
I. I NTRODUCTION kernel in traditional convolutional operations. To mitigate
the problem mentioned above, [9] proposed the Submanifold
T HE 3D point cloud is a set of data that contains 3D
coordinates. Each point in the point cloud represents a
position on the surface of an object. By connecting these
Sparse Convolutional Network (SSCN), which preserves the
sparsity of the raw data by ensuring that computations of the
points, it can represent the appearance of a 3D object or non-zero points do not dilate.
scene. Depending on the different sensors, each point may In the last decade, there has been a dramatic proliferation of
contain RGB or the intensity of the object’s reflecting surface. research concerned with specialized CNN accelerators, driven
Compared with 2D RGB images, 3D point clouds are better by the imperative need for real-time interactions in neural
suited for deep learning-based vision tasks due to several network applications. For example, Eyeriss [10] introduced a
advantages. They provide exact spatial and depth information power-efficient hardware accelerator for deep CNNs, inspiring
that is ideal for tasks requiring meticulous analysis and is the development of subsequently specialized hardware accel-
typically more resilient to changes in lighting conditions, en- erators. Another work [11] presented a sparse-wise dataflow
suring consistent and reliable performance [1]. Consequently, and an FPGA-based accelerator designed to efficiently process
3D point clouds are increasingly crucial in deep learning sparse CNNs. Additionally, a throughput-optimized channel-
applications like autonomous driving, augmented reality (AR), oriented processing element (PE) array for CNNs was pro-
and virtual reality (VR). posed in [12], demonstrating superior performance through
Despite these advantages, 3D point clouds present chal- parallelism and data reuse in convolutional layers. Neverthe-
lenges due to their properties of being unordered, sparse, less, the existing CNN accelerators encounter challenges in
and locality-sensitive [2]. Moreover, extensive data and high effectively handling SSCNs. The limitation arises from the fact
computational requirements impede point cloud processing. that not all input feature points in SSCNs require computation
with the convolution kernel. Consequently, mapping operations
Yu-Hsuan Lai, Shanq-Jang Ruan, Ming Fang, and Jeng-Lun Shieh are are necessary to determine which input feature points and their
with the Department of Electronic and Computer Engineering, National corresponding kernel weights require computation.
Taiwan University of Science and Technology, Taipei 10607, Taiwan (e-mail:
sjruan@mail.ntust.edu.tw). More recently, there has been a growing body of literature
Edwin Naroska is with the Faculty of Electrical Engineering and Comput- devoted to the development of accelerators designed explicitly
erScience, Hochschule Niederrhein University of Applied Science, Krefeld, for point cloud networks. PointAcc [13] is a pioneering accel-
Germany.
This research was funded by VIA Labs, Inc. (VLI) under the project ”Pre- erator for point cloud networks that introduced mapping units
development project of low power gesture recognition and tracking system” and memory management. Mesorasi [14] proposed delayed-
(Grant No. NTUST-VLI-NO. 09779). aggregation to enhance the efficiency of neighbor point search
We thank to National Center for High-performance Computing (NCHC) of
National Applied Research Laboratories (NARLabs) in Taiwan for providing in PointNet++. While delayed-aggregation can be utilized in
computational and storage resources. point cloud networks, its application in SSCNs is not feasible.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 2
Fig. 3. Overall architecture of the proposed hardware. C. Gather Unit and Scatter Unit
Gather and Scatter Unit coordinates the collection and distri-
bution of data based on the rules provided by RTGU, enabling
effective computations within CTU. GTU sequentially loads
the input features and weights for computation using the
coordinate and weight index messages derived from the rule
table. These loaded data are then passed to CTU for efficient
matrix operations. Upon completion of highly parallel compu-
tations within CTU, all the computed outputs are transmitted
to STU. STU performs the element-wise summation of the
corresponding partial sums, yielding the final output of the
computation.
D. Computation Unit
This study presents a specialized PE array, as depicted in
Fig. 5, designed specifically to enhance the efficiency of matrix
Fig. 4. Rule table generation algorithm. operations for SSCNs. Leveraging the output of GTU, the
input feature values that require multiplication with weights of
identical values are intelligently grouped together, optimizing
B. Rule Table Generation Unit computational efficiency. By adopting a weight-stationary PE
In submanifold sparse convolution, accurately identifying array architecture inspired by systolic dataflow, the proposed
the target points for computation, their neighboring non-zero design maximizes data reuse and effectively reduces data
points, and corresponding weights plays a crucial role. To access frequency, resulting in enhanced computational perfor-
achieve this task, a shifting operation is applied to the input mance.
point cloud features. By comparing the coordinates of points Input features and weights are efficiently transmitted to the
before and after the shift, we establish the computational PE array through First-In-First-Out (FIFO) channels, ensuring
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 4
TABLE I
R ESOURCE U TILIZATION ON X ILINX ZCU104 E VALUATION B OARD
B. Benchmark Results
Fig. 5. Weight-stationary PE array.
Throughput is an important metric for hardware accelera-
tors. In addition to that, for the purpose of comparison, the
throughput of each PE should also be considered because
increasing the number of PEs can significantly improve the
overall throughput. The throughput of each PE is referred
to as throughput density, which is obtained by dividing the
total throughput by the number of PEs. Furthermore, when
designing for implementation on an FPGA, the number of
PEs in the throughput density definition is modified to the
Fig. 6. PE architecture.
number of digital signal processors (DSPs). The utilization of
hardware resources in the proposed design is summarized in
proper data flow. Prior to computation, the weights are initial- Table I, which includes the number of lookup tables (LUTs),
ized and stored within each PE, enabling efficient processing. LUTRAMs, flip-flops (FFs), and DSPs.
Each PE conducts multiplication and addition operations on Table II shows the PE utilization, throughput, and through-
the input feature, including the output of previous stage and the put density for each layer in the SS U-Net processed on
corresponding weight. The resulting output is then seamlessly the CTU. The first layer exhibits a PE utilization of 6.25%,
forwarded to the subsequent PE in the array, enabling a smooth which is significantly lower than the other layers due to
cascade of computations. Ultimately, the final layer of PEs its single input channel. As a result, the throughput of the
generates the result, which is subsequently transmitted to STU first layer is only 1.593 GOP/s. However, the subsequent
for the summation of outputs. Notably, our architecture enables layers achieve 100% PE utilization, leading to an overall
simultaneous and parallel entry of up to 16 data, enhancing PE utilization of 91.48% and significantly higher throughput
computational efficiency and throughput. compared to the first layer. Since most model layers have
The detailed structure of each PE is illustrated in Fig.6. multiple channels, the lower PE utilization of the first layer,
Upon triggering the clock signal, input features are multiplied with its single channel, has a minimal impact on the overall
by pre-stored weights, and the resulting product is added system performance. Consequently, the overall throughput
to the output of the next-level PE. Simultaneously, input remains relatively unaffected, and the average total throughput
features are conveyed to the succeeding PE within the same achieved is 44.55 GOP/s. Furthermore, the total throughput
clock cycle. This enables continuous operation of subsequent density reaches 0.174 GOP/s.
weights on the same input feature, maximizing data reuse. The
PE architecture utilizes a combination of multiplexers and D C. Comparison with Previous Works
flip-flops to accomplish a weight-stationary PE array, where
Table III presents a comprehensive comparison with pre-
weights can be firmly stored within the PEs. This configuration
vious works. The ESCA [15] proposed by Z. Wang et al.
allows for the preloading of weights and effectively latching
achieved a throughput of 17.73 GOP/s and a throughput
them. Consequently, during each clock cycle, the PE can
density of 0.069 GOP/s on the SS U-Net at a clock frequency
repeatedly utilize fixed weights to process different input
of 270 MHz. In this brief, we introduce an enhanced design for
features, significantly enhancing computational efficiency.
the SS U-Net with the same number of PEs but at a lower clock
frequency of 100 MHz. Our proposed design achieves a signif-
IV. E VALUATION
icantly higher throughput and throughput density, specifically
A. Experimental Setup 44.55 GOP/s and 0.174 GOP/s, respectively. Notably, our work
The proposed architecture is implemented and verified demonstrates a 2.51× increase in throughput density compared
on the Xilinx Zynq UltraScale+ MPSoC ZCU104 device, to ESCA [15]. Moreover, to comprehensively evaluate our
utilizing the Verilog programming language and the Xilinx work, we also conducted a comparative analysis with another
Vivado 2023.1 toolchain. The architecture operates at a clock FPGA-based point cloud neural network accelerator. Y. Yu
frequency of 100MHz. To evaluate the performance of the et al. [17] proposed a low-latency accelerator specifically
proposed design, the 3D submanifold sparse U-Net (SS U-Net) designed for Light PointNet (LPN). In comparison to their
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, MANUSCRIPT 5
TABLE II R EFERENCES
E XPERIMENTAL R ESULTS OF E ACH L AYER IN SS U-N ET
[1] H. Tang, Z. Liu, X. Li, Y. Lin, and S. Han, “Torchsparse: Efficient point
cloud inference engine,” Proceedings of Machine Learning and Systems,
vol. 4, pp. 302–315, 2022.
Convolutional PE Utilization Throughput Throughput [2] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense
Layer (%) (GOP/s) Density (GOP/s) 3d object detector for point cloud,” in Proceedings of the IEEE/CVF
international conference on computer vision, 2019, pp. 1951–1960.
1-1 6.25 1.593 0.006 [3] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection
1-2 100.0 49.38 0.193 from point clouds,” in Proceedings of the IEEE conference on Computer
2-1 100.0 49.38 0.193 Vision and Pattern Recognition, 2018, pp. 7652–7660.
[4] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and
2-2 100.0 49.38 0.193 M. Tomizuka, “Squeezesegv3: Spatially-adaptive convolution for effi-
2-3 100.0 49.38 0.193 cient point-cloud segmentation,” in Computer Vision–ECCV 2020: 16th
3-1 100.0 49.38 0.193 European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part XXVIII 16. Springer, 2020, pp. 1–19.
3-2 100.0 49.38 0.193 [5] Q. Liu, H. Yuan, H. Su, H. Liu, Y. Wang, H. Yang, and J. Hou,
3-3 100.0 49.38 0.193 “Pqa-net: Deep no reference point cloud quality assessment via multi-
4-1 100.0 49.38 0.193 view projection,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 31, no. 12, pp. 4645–4660, 2021.
4-2 100.0 49.38 0.193 [6] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on
4-3 100.0 49.38 0.193 point sets for 3d classification and segmentation,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp.
Total 91.48 44.55 0.174 652–660.
[7] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
based 3d object detection,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 4490–4499.
TABLE III [8] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmen-
C OMPARISON WITH PREVIOUS WORKS ON FPGA tation with submanifold sparse convolutional networks,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2018, pp. 9224–9232.
Y. Yu et al. [17] ESCA [15] This Work [9] B. Graham and L. Van der Maaten, “Submanifold sparse convolutional
networks,” arXiv preprint arXiv:1706.01307, 2017.
Device KCU105 ZCU102 ZCU104 [10] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
Frequency
100 270 100 works,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138,
(MHz) 2016.
Precision 8-bit fixed 8-bit fixed 8-bit fixed [11] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, “An efficient
Number of PEs hardware accelerator for structured sparse convolutional neural networks
2400 256 256 on fpgas,” IEEE Transactions on Very Large Scale Integration (VLSI)
(DSPs)
Benchmark LPN SS U-Net SS U-Net Systems, vol. 28, no. 9, pp. 1953–1965, 2020.
[12] Y.-X. Chen and S.-J. Ruan, “A throughput-optimized channel-oriented
Throughput processing element array for convolutional neural networks,” IEEE
275.8 17.73 44.55
(GOP/s) Transactions on Circuits and Systems II: Express Briefs, vol. 68, no. 2,
Throughput pp. 752–756, 2020.
0.115 0.069 0.174
Density (GOP/s) [13] Y. Lin, Z. Zhang, H. Tang, H. Wang, and S. Han, “Pointacc: Efficient
point cloud accelerator,” in MICRO-54: 54th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, 2021, pp. 449–461.
[14] Y. Feng, B. Tian, T. Xu, P. Whatmough, and Y. Zhu, “Mesorasi:
Architecture support for point cloud analytics via delayed-aggregation,”
work, our accelerator exhibits a notable 1.51× improvement in 2020 53rd Annual IEEE/ACM International Symposium on Microar-
in throughput density. These substantial enhancements can chitecture (MICRO). IEEE, 2020, pp. 1037–1050.
primarily be attributed to the utilization of the rule table [15] Z. Wang, W. Mao, P. Yang, Z. Wang, and J. Lin, “An efficient fpga
accelerator for point cloud,” in 2022 IEEE 35th International System-
generated by RTGU in conjunction with the weight-stationary on-Chip Conference (SOCC). IEEE, 2022, pp. 1–6.
data pattern in CTU. This combination effectively enables the [16] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
efficient processing of point cloud data of varying sizes. detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
[17] Y. Yu, W. Mao, J. Luo, and Z. Wang, “A low-latency framework
with algorithm-hardware co-optimization for 3d point cloud,” IEEE
V. C ONCLUSION Transactions on Circuits and Systems II: Express Briefs, 2023.
In this brief, we present a throughput-optimized accelerator
for SSCNs. Our approach introduces a specialized hardware
design to encode the rule table, enabling the transformation of
submanifold sparse convolution on point cloud features into
matrix multiplication. Additionally, we proposed a dedicated
computation unit design tailored explicitly for point cloud
features. The proposed design is implemented on the Xilinx
Zynq ZCU104 device. Our evaluation results demonstrate a
significant 2.51× improvement in throughput density compared
to ESCA [15], a recent approach for running the 3D SS U-
Net. Furthermore, compared to the other point cloud neural
network accelerator targeting LPN [17], our proposed acceler-
ator demonstrates a 1.51× improvement in throughput density.