Design of FPGA accelerator with high parallelism for convolution neural network

doi:10.11772/j.issn.1001-9081.2020060996

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (3): 812-819.DOI: 10.11772/j.issn.1001-9081.2020060996

Special Issue: 先进计算

• Advanced computing • Previous Articles Next Articles

Design of FPGA accelerator with high parallelism for convolution neural network

WANG Xiaofeng^1,2, JIANG Penglong^1,2, ZHOU Hui^1,2, ZHAO Xiongbo^1,2

1. Beijing Aerospace Automatic Control Institute, Beijing 100854, China;
2. National Key Laboratory of Science and Technology on Aerospace Intelligence Control, Beijing 100854, China

Received:2020-07-09 Revised:2020-10-12 Online:2020-12-17 Published:2021-03-10
Supported by:
This work is partially supported by the Military Scientific Research Project, the Innovative Research and Development Project of China Academy of Launch Vehicle Technology.

面向卷积神经网络的高并行度FPGA加速器设计

王晓峰^1,2, 蒋彭龙^1,2, 周辉^1,2, 赵雄波^1,2

1. 北京航天自动控制研究所, 北京 100854;
2. 宇航智能控制技术国家级重点实验室, 北京 100854

通讯作者: 王晓峰
作者简介:王晓峰(1995-),男,宁夏固原人,硕士研究生,主要研究方向:高性能计算;蒋彭龙(1978-),男,浙江奉化人,研究员,硕士,主要研究方向:飞行器系统综合设计;周辉(1984-),男,陕西咸阳人,高级工程师,硕士,主要研究方向:微系统集成;赵雄波(1986-),男,湖北仙桃人,高级工程师,硕士,主要研究方向:微系统集成。
基金资助:
军队科研资助项目；中国运载火箭技术研究院创新研发项目。

Abstract

Abstract: Most of the algorithms based on Convolutional Neural Network (CNN) are computation-intensive and memory-intensive, so they are difficult to be applied in embedded fields such as aerospace, mobile robotics and smartphones which have low-power requirements. To solve this problem, a Field Programmable Gate Array (FPGA) accelerator with high parallelism for CNN was proposed. Firstly, four kinds of parallelism in CNN algorithm that can be used for FPGA acceleration were compared and studied. Then, a Multi-channel Convolutional Rotating-register Pipeline (MCRP) structure was proposed to concisely and effectively utilize the convolution kernel parallelism of CNN algorithm. Finally, using the strategy of input/output channel parallelism+convolution kernel parallelism, a CNN accelerator architecture with high parallelism was proposed based on MCRP structure, and to verify the design rationality of the architecture, it was deployed on the XCZU9EG chip of XILINX. Under the condition of making full use of the on-chip Digital Signal Processor (DSP) resources, the peak computing capacity of the proposed CNN accelerator reached 2 304 GOPS(Giga Operations Per Second). Taking SSD-300 algorithm as the test object, this CNN accelerator had the actual computing capacity of 1 830.33 GOPS, and the hardware utilization rate of 79.44%. Experimental results show that, the MCRP structure can effectively improve the computing capacity of CNN accelerator, and the CNN accelerator based on MCRP structure can generally meet the computing capacity requirements of most applications in the embedded fields.

Key words: Convolutional Neural Network (CNN), high performance, hardware accelerator, parallelism, Field Programmable Gate Array (FPGA)

摘要： 大多数基于卷积神经网络（CNN）的算法都是计算密集型和存储密集型的，很难应用于具有低功耗要求的航天、移动机器人、智能手机等嵌入式领域。针对这一问题，提出一种面向CNN的高并行度现场可编程逻辑门阵列（FPGA）加速器。首先，比较研究CNN算法中可用于FPGA加速的4类并行度；然后，提出多通道卷积旋转寄存流水（MCRP）结构，简洁有效地利用了CNN算法的卷积核内并行；最后，采用输入输出通道并行+卷积核内并行的方案提出一种基于MCRP结构的高并行度CNN加速器架构，并将其部署到XILINX的XCZU9EG芯片上，在充分利用片上数字信号处理器（DPS）资源的情况下，峰值算力达到2 304 GOPS。以SSD-300算法为测试对象，该CNN加速器的实际算力为1 830.33 GOPS，硬件利用率达79.44%。实验结果表明，MCRP结构可有效提高CNN加速器的算力，基于MCRP结构的CNN加速器可基本满足嵌入式领域大部分应用的算力需求。

关键词: 卷积神经网络, 高性能, 硬件加速器, 并行度, 现场可编程逻辑门阵列

CLC Number:

TP391

WANG Xiaofeng, JIANG Penglong, ZHOU Hui, ZHAO Xiongbo. Design of FPGA accelerator with high parallelism for convolution neural network[J]. Journal of Computer Applications, 2021, 41(3): 812-819.

王晓峰, 蒋彭龙, 周辉, 赵雄波. 面向卷积神经网络的高并行度FPGA加速器设计[J]. 计算机应用, 2021, 41(3): 812-819.

References

[1] LECUN Y,BENGIO Y,HINTON G. Deep learning[J]. Nature, 2015,521(7553):436-444.
[2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2012:1097-1105.
[3] RUSSAKOVSKY O,DENG J,SU H,et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision,2015,115(3):211-252.
[4] SIMONYAN K,ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2020-03-06]. https://arxiv.org/pdf/1409.1556.pdf.
[5] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[6] LIU W,ANGUELOV D,ERHAN D,et al. SSD:single shot multibox detector[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9905. Cham:Springer, 2016:21-37.
[7] 吴艳霞, 梁楷, 刘颖, 等. 深度学习FPGA加速器的进展与趋势[J]. 计算机学报,2019,42(11):2461-2480.(WU Y X,LIANG K, LIU Y, et al. The progress and trends of FPGA-based accelerators in deep learning[J]. Chinese Journal of Computers, 2019,42(11):2461-2480.)
[8] CHEN T,DU Z,SUN N,et al. DianNao:a small-footprint highthroughput accelerator for ubiquitous machine-learning[C]//Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. New York:ACM,2014:269-284.
[9] JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture. New York:ACM,2017:1-12.
[10] CHEN Y H, EMER J, SZE V, et al. Eyeriss:a spatial architecture for energy-efficient dataflow for convolutional neural networks[C]//Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture. Piscataway:IEEE,2016:367-379.
[11] QIU J,WANG J,YAO S,et al. Going deeper with embedded FPGA platform for convolutional neural network[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. New York:ACM,2016:26-35.
[12] CHEN X,HAN Y,WANG Y. Communication lower bound in convolution accelerators[C]//Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture. Piscataway:IEEE,2020:529-541.
[13] GUO K,SUI L,QIU J,et al. Angel-Eye:a complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2018, 37(1):35-47.
[14] BOSI B,BOIS G,SAVARIA Y. Reconfigurable pipelined 2-D convolvers for fast digital signal processing[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems,1999,7(3):299-308.
[15] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:1-9.
[16] HOWARD A G,ZHU M,CHEN B,et al. MobileNets:efficient convolutional neural networks for mobile vision applications[EB/OL].[2020-10-10]. https://arxiv.org/pdf/1704.04861.pdf.
[17] GUO K,ZENG S,YU J,et al. A survey of FPGA-based neural network inference accelerators[J]. ACM Transactions on Reconfigurable Technology and Systems,2019,12(1):No. 2.
[18] HAN S,POOL J,TRAN J,et al. Learning both weights and connections for efficient neural networks[C]//Proceedings of the 2015 28th International Conference on Neural Information Processing Systems. Cambridge:MIT Press,2015:1135-1143.
[19] REDMON J,FARHADI A. YOLO9000:better,faster,stronger[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition。Piscataway:IEEE,2017:6517-6525.
[20] COURBARIAUX M,DAVID J P,BENGIO Y. Training deep neural networks with low precision multiplications[EB/OL].[2020-10-10]. https://arxiv.org/pdf/1412.7024.pdf.
[21] HAN S,LIU X,MAO H,et al. EIE:efficient inference engine on compressed deep neural network[C]//Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture. Piscataway:IEEE,2016:243-254.
[22] LIU Z,DOU Y,JIANG J,et al. Automatic code generation of convolutional neural networks in FPGA implementation[C]//Proceedings of the 2016 International Conference on FieldProgrammable Technology. Piscataway:IEEE,2016:61-68.
[23] LI H,FAN X,JIAO L,et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]//Proceedings of the 26th International Conference on Field Programmable Logic and Applications. Piscataway:IEEE,2016:1-9.
[24] SHEN J, HUANG Y, WANG Z, et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA[C]//Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York:ACM,2018:97-106.

Design of FPGA accelerator with high parallelism for convolution neural network

面向卷积神经网络的高并行度FPGA加速器设计

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Yun LI, Fuyou WANG, Peiguang JING, Su WANG, Ao XIAO. Uncertainty-based frame associated short video event detection method [J]. Journal of Computer Applications, 2024, 44(9): 2903-2910.
[2]	Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499.
[3]	Yangyi GAO, Tao LEI, Xiaogang DU, Suiyong LI, Yingbo WANG, Chongdan MIN. Crowd counting and locating method based on pixel distance map and four-dimensional dynamic convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2233-2242.
[4]	Dongwei WANG, Baichen LIU, Zhi HAN, Yanmei WANG, Yandong TANG. Deep network compression method based on low-rank decomposition and vector quantization [J]. Journal of Computer Applications, 2024, 44(7): 1987-1994.
[5]	Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919.
[6]	Jianjing LI, Guanfeng LI, Feizhou QIN, Weijun LI. Multi-relation approximate reasoning model based on uncertain knowledge graph embedding [J]. Journal of Computer Applications, 2024, 44(6): 1751-1759.
[7]	Min SUN, Qian CHENG, Xining DING. CBAM-CGRU-SVM based malware detection method for Android [J]. Journal of Computer Applications, 2024, 44(5): 1539-1545.
[8]	Wenshuo GAO, Xiaoyun CHEN. Point cloud classification network based on node structure [J]. Journal of Computer Applications, 2024, 44(5): 1471-1478.
[9]	Jie WANG, Hua MENG. Image classification algorithm based on overall topological structure of point cloud [J]. Journal of Computer Applications, 2024, 44(4): 1107-1113.
[10]	Tianhua CHEN, Jiaxuan ZHU, Jie YIN. Bird recognition algorithm based on attention mechanism [J]. Journal of Computer Applications, 2024, 44(4): 1114-1120.
[11]	Lijun XU, Hui LI, Zuyang LIU, Kansong CHEN, Weixuan MA. 3D-GA-Unet： MRI image segmentation algorithm for glioma based on 3D-Ghost CNN [J]. Journal of Computer Applications, 2024, 44(4): 1294-1302.
[12]	Jingxian ZHOU, Xina LI. UAV detection and recognition based on improved convolutional neural network and radio frequency fingerprint [J]. Journal of Computer Applications, 2024, 44(3): 876-882.
[13]	Ruifeng HOU, Pengcheng ZHANG, Liyuan ZHANG, Zhiguo GUI, Yi LIU, Haowen ZHANG, Shubin WANG. Iterative denoising network based on total variation regular term expansion [J]. Journal of Computer Applications, 2024, 44(3): 916-921.
[14]	Yongfeng DONG, Jiaming BAI, Liqin WANG, Xu WANG. Chinese named entity recognition combining prior knowledge and glyph features [J]. Journal of Computer Applications, 2024, 44(3): 702-708.
[15]	Jiawei ZHANG, Guandong GAO, Ke XIAO, Shengzun SONG. Violent crime hierarchy algorithm by joint modeling of improved hierarchical attention network and TextCNN [J]. Journal of Computer Applications, 2024, 44(2): 403-410.