面向卷积神经网络的高并行度FPGA加速器设计

doi:10.11772/j.issn.1001-9081.2020060996

计算机应用 ›› 2021, Vol. 41 ›› Issue (3): 812-819.DOI: 10.11772/j.issn.1001-9081.2020060996

所属专题：先进计算

面向卷积神经网络的高并行度FPGA加速器设计

王晓峰^1,2, 蒋彭龙^1,2, 周辉^1,2, 赵雄波^1,2

1. 北京航天自动控制研究所, 北京 100854;
2. 宇航智能控制技术国家级重点实验室, 北京 100854

收稿日期:2020-07-09 修回日期:2020-10-12 出版日期:2021-03-10 发布日期:2020-12-17
通讯作者: 王晓峰
作者简介:王晓峰(1995-),男,宁夏固原人,硕士研究生,主要研究方向:高性能计算;蒋彭龙(1978-),男,浙江奉化人,研究员,硕士,主要研究方向:飞行器系统综合设计;周辉(1984-),男,陕西咸阳人,高级工程师,硕士,主要研究方向:微系统集成;赵雄波(1986-),男,湖北仙桃人,高级工程师,硕士,主要研究方向:微系统集成。
基金资助:
军队科研资助项目；中国运载火箭技术研究院创新研发项目。

Design of FPGA accelerator with high parallelism for convolution neural network

WANG Xiaofeng^1,2, JIANG Penglong^1,2, ZHOU Hui^1,2, ZHAO Xiongbo^1,2

1. Beijing Aerospace Automatic Control Institute, Beijing 100854, China;
2. National Key Laboratory of Science and Technology on Aerospace Intelligence Control, Beijing 100854, China

Received:2020-07-09 Revised:2020-10-12 Online:2021-03-10 Published:2020-12-17
Supported by:
This work is partially supported by the Military Scientific Research Project, the Innovative Research and Development Project of China Academy of Launch Vehicle Technology.

摘要/Abstract

摘要： 大多数基于卷积神经网络（CNN）的算法都是计算密集型和存储密集型的，很难应用于具有低功耗要求的航天、移动机器人、智能手机等嵌入式领域。针对这一问题，提出一种面向CNN的高并行度现场可编程逻辑门阵列（FPGA）加速器。首先，比较研究CNN算法中可用于FPGA加速的4类并行度；然后，提出多通道卷积旋转寄存流水（MCRP）结构，简洁有效地利用了CNN算法的卷积核内并行；最后，采用输入输出通道并行+卷积核内并行的方案提出一种基于MCRP结构的高并行度CNN加速器架构，并将其部署到XILINX的XCZU9EG芯片上，在充分利用片上数字信号处理器（DPS）资源的情况下，峰值算力达到2 304 GOPS。以SSD-300算法为测试对象，该CNN加速器的实际算力为1 830.33 GOPS，硬件利用率达79.44%。实验结果表明，MCRP结构可有效提高CNN加速器的算力，基于MCRP结构的CNN加速器可基本满足嵌入式领域大部分应用的算力需求。

关键词: 卷积神经网络, 高性能, 硬件加速器, 并行度, 现场可编程逻辑门阵列

Abstract: Most of the algorithms based on Convolutional Neural Network (CNN) are computation-intensive and memory-intensive, so they are difficult to be applied in embedded fields such as aerospace, mobile robotics and smartphones which have low-power requirements. To solve this problem, a Field Programmable Gate Array (FPGA) accelerator with high parallelism for CNN was proposed. Firstly, four kinds of parallelism in CNN algorithm that can be used for FPGA acceleration were compared and studied. Then, a Multi-channel Convolutional Rotating-register Pipeline (MCRP) structure was proposed to concisely and effectively utilize the convolution kernel parallelism of CNN algorithm. Finally, using the strategy of input/output channel parallelism+convolution kernel parallelism, a CNN accelerator architecture with high parallelism was proposed based on MCRP structure, and to verify the design rationality of the architecture, it was deployed on the XCZU9EG chip of XILINX. Under the condition of making full use of the on-chip Digital Signal Processor (DSP) resources, the peak computing capacity of the proposed CNN accelerator reached 2 304 GOPS(Giga Operations Per Second). Taking SSD-300 algorithm as the test object, this CNN accelerator had the actual computing capacity of 1 830.33 GOPS, and the hardware utilization rate of 79.44%. Experimental results show that, the MCRP structure can effectively improve the computing capacity of CNN accelerator, and the CNN accelerator based on MCRP structure can generally meet the computing capacity requirements of most applications in the embedded fields.

Key words: Convolutional Neural Network (CNN), high performance, hardware accelerator, parallelism, Field Programmable Gate Array (FPGA)

中图分类号:

TP391

王晓峰, 蒋彭龙, 周辉, 赵雄波. 面向卷积神经网络的高并行度FPGA加速器设计[J]. 计算机应用, 2021, 41(3): 812-819.

WANG Xiaofeng, JIANG Penglong, ZHOU Hui, ZHAO Xiongbo. Design of FPGA accelerator with high parallelism for convolution neural network[J]. Journal of Computer Applications, 2021, 41(3): 812-819.

参考文献

[1] LECUN Y,BENGIO Y,HINTON G. Deep learning[J]. Nature, 2015,521(7553):436-444.
[2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2012:1097-1105.
[3] RUSSAKOVSKY O,DENG J,SU H,et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision,2015,115(3):211-252.
[4] SIMONYAN K,ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2020-03-06]. https://arxiv.org/pdf/1409.1556.pdf.
[5] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[6] LIU W,ANGUELOV D,ERHAN D,et al. SSD:single shot multibox detector[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9905. Cham:Springer, 2016:21-37.
[7] 吴艳霞, 梁楷, 刘颖, 等. 深度学习FPGA加速器的进展与趋势[J]. 计算机学报,2019,42(11):2461-2480.(WU Y X,LIANG K, LIU Y, et al. The progress and trends of FPGA-based accelerators in deep learning[J]. Chinese Journal of Computers, 2019,42(11):2461-2480.)
[8] CHEN T,DU Z,SUN N,et al. DianNao:a small-footprint highthroughput accelerator for ubiquitous machine-learning[C]//Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. New York:ACM,2014:269-284.
[9] JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture. New York:ACM,2017:1-12.
[10] CHEN Y H, EMER J, SZE V, et al. Eyeriss:a spatial architecture for energy-efficient dataflow for convolutional neural networks[C]//Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture. Piscataway:IEEE,2016:367-379.
[11] QIU J,WANG J,YAO S,et al. Going deeper with embedded FPGA platform for convolutional neural network[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. New York:ACM,2016:26-35.
[12] CHEN X,HAN Y,WANG Y. Communication lower bound in convolution accelerators[C]//Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture. Piscataway:IEEE,2020:529-541.
[13] GUO K,SUI L,QIU J,et al. Angel-Eye:a complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2018, 37(1):35-47.
[14] BOSI B,BOIS G,SAVARIA Y. Reconfigurable pipelined 2-D convolvers for fast digital signal processing[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems,1999,7(3):299-308.
[15] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:1-9.
[16] HOWARD A G,ZHU M,CHEN B,et al. MobileNets:efficient convolutional neural networks for mobile vision applications[EB/OL].[2020-10-10]. https://arxiv.org/pdf/1704.04861.pdf.
[17] GUO K,ZENG S,YU J,et al. A survey of FPGA-based neural network inference accelerators[J]. ACM Transactions on Reconfigurable Technology and Systems,2019,12(1):No. 2.
[18] HAN S,POOL J,TRAN J,et al. Learning both weights and connections for efficient neural networks[C]//Proceedings of the 2015 28th International Conference on Neural Information Processing Systems. Cambridge:MIT Press,2015:1135-1143.
[19] REDMON J,FARHADI A. YOLO9000:better,faster,stronger[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition。Piscataway:IEEE,2017:6517-6525.
[20] COURBARIAUX M,DAVID J P,BENGIO Y. Training deep neural networks with low precision multiplications[EB/OL].[2020-10-10]. https://arxiv.org/pdf/1412.7024.pdf.
[21] HAN S,LIU X,MAO H,et al. EIE:efficient inference engine on compressed deep neural network[C]//Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture. Piscataway:IEEE,2016:243-254.
[22] LIU Z,DOU Y,JIANG J,et al. Automatic code generation of convolutional neural networks in FPGA implementation[C]//Proceedings of the 2016 International Conference on FieldProgrammable Technology. Piscataway:IEEE,2016:61-68.
[23] LI H,FAN X,JIAO L,et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]//Proceedings of the 26th International Conference on Field Programmable Logic and Applications. Piscataway:IEEE,2016:1-9.
[24] SHEN J, HUANG Y, WANG Z, et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA[C]//Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York:ACM,2018:97-106.

面向卷积神经网络的高并行度FPGA加速器设计

Design of FPGA accelerator with high parallelism for convolution neural network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	宋中山, 梁家锐, 郑禄, 刘振宇, 帖军. 基于双向门控尺度特征融合的遥感场景分类[J]. 计算机应用, 2021, 41(9): 2726-2735.
[2]	李康康, 张静. 基于注意力机制的多层次编码和解码的图像描述模型[J]. 计算机应用, 2021, 41(9): 2504-2509.
[3]	张永斌, 常文欣, 孙连山, 张航. 基于字典的域名生成算法生成域名的检测方法[J]. 计算机应用, 2021, 41(9): 2609-2614.
[4]	赵宏, 孔东一. 图像特征注意力与自适应注意力融合的图像内容中文描述[J]. 计算机应用, 2021, 41(9): 2496-2503.
[5]	徐江浪, 李林燕, 万新军, 胡伏原. 结合目标检测的室内场景识别方法[J]. 计算机应用, 2021, 41(9): 2720-2725.
[6]	牟长宁, 王海鹏, 周丕宇, 侯鑫行. 基于图卷积神经网络的串联质谱从头测序[J]. 计算机应用, 2021, 41(9): 2773-2779.
[7]	王贺兵, 张春梅. 基于非对称卷积-压缩激发-次代残差网络的人脸关键点检测[J]. 计算机应用, 2021, 41(9): 2741-2747.
[8]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[9]	曹玉红, 徐海, 刘荪傲, 王紫霄, 李宏亮. 基于深度学习的医学影像分割研究综述[J]. 计算机应用, 2021, 41(8): 2273-2287.
[10]	秦斌斌, 彭良康, 卢向明, 钱江波. 司机分心驾驶检测研究进展[J]. 计算机应用, 2021, 41(8): 2330-2337.
[11]	黄程程, 董霄霄, 李钊. 基于二维Winograd算法的深流水线5×5卷积方法[J]. 计算机应用, 2021, 41(8): 2258-2264.
[12]	谭道强, 曾诚, 乔金霞, 张俊. 基于混合注意力模型的阴影检测方法[J]. 计算机应用, 2021, 41(7): 2076-2081.
[13]	吴则举, 焦翠娟, 陈亮. 基于改进Faster R-CNN的轮胎缺陷检测方法[J]. 计算机应用, 2021, 41(7): 1939-1946.
[14]	杨粟, 欧阳智, 杜逆索. 基于相关度距离的无监督并行哈希图像检索[J]. 计算机应用, 2021, 41(7): 1902-1907.
[15]	高钦泉, 黄炳城, 刘文哲, 童同. 基于改进CenterNet的竹条表面缺陷检测方法[J]. 计算机应用, 2021, 41(7): 1933-1938.