计算机应用 ›› 2021, Vol. 41 ›› Issue (3): 812-819.DOI: 10.11772/j.issn.1001-9081.2020060996

所属专题: 先进计算

• 先进计算 • 上一篇    下一篇

面向卷积神经网络的高并行度FPGA加速器设计

王晓峰1,2, 蒋彭龙1,2, 周辉1,2, 赵雄波1,2   

  1. 1. 北京航天自动控制研究所, 北京 100854;
    2. 宇航智能控制技术国家级重点实验室, 北京 100854
  • 收稿日期:2020-07-09 修回日期:2020-10-12 出版日期:2021-03-10 发布日期:2020-12-17
  • 通讯作者: 王晓峰
  • 作者简介:王晓峰(1995-),男,宁夏固原人,硕士研究生,主要研究方向:高性能计算;蒋彭龙(1978-),男,浙江奉化人,研究员,硕士,主要研究方向:飞行器系统综合设计;周辉(1984-),男,陕西咸阳人,高级工程师,硕士,主要研究方向:微系统集成;赵雄波(1986-),男,湖北仙桃人,高级工程师,硕士,主要研究方向:微系统集成。
  • 基金资助:
    军队科研资助项目;中国运载火箭技术研究院创新研发项目。

Design of FPGA accelerator with high parallelism for convolution neural network

WANG Xiaofeng1,2, JIANG Penglong1,2, ZHOU Hui1,2, ZHAO Xiongbo1,2   

  1. 1. Beijing Aerospace Automatic Control Institute, Beijing 100854, China;
    2. National Key Laboratory of Science and Technology on Aerospace Intelligence Control, Beijing 100854, China
  • Received:2020-07-09 Revised:2020-10-12 Online:2021-03-10 Published:2020-12-17
  • Supported by:
    This work is partially supported by the Military Scientific Research Project, the Innovative Research and Development Project of China Academy of Launch Vehicle Technology.

摘要: 大多数基于卷积神经网络(CNN)的算法都是计算密集型和存储密集型的,很难应用于具有低功耗要求的航天、移动机器人、智能手机等嵌入式领域。针对这一问题,提出一种面向CNN的高并行度现场可编程逻辑门阵列(FPGA)加速器。首先,比较研究CNN算法中可用于FPGA加速的4类并行度;然后,提出多通道卷积旋转寄存流水(MCRP)结构,简洁有效地利用了CNN算法的卷积核内并行;最后,采用输入输出通道并行+卷积核内并行的方案提出一种基于MCRP结构的高并行度CNN加速器架构,并将其部署到XILINX的XCZU9EG芯片上,在充分利用片上数字信号处理器(DPS)资源的情况下,峰值算力达到2 304 GOPS。以SSD-300算法为测试对象,该CNN加速器的实际算力为1 830.33 GOPS,硬件利用率达79.44%。实验结果表明,MCRP结构可有效提高CNN加速器的算力,基于MCRP结构的CNN加速器可基本满足嵌入式领域大部分应用的算力需求。

关键词: 卷积神经网络, 高性能, 硬件加速器, 并行度, 现场可编程逻辑门阵列

Abstract: Most of the algorithms based on Convolutional Neural Network (CNN) are computation-intensive and memory-intensive, so they are difficult to be applied in embedded fields such as aerospace, mobile robotics and smartphones which have low-power requirements. To solve this problem, a Field Programmable Gate Array (FPGA) accelerator with high parallelism for CNN was proposed. Firstly, four kinds of parallelism in CNN algorithm that can be used for FPGA acceleration were compared and studied. Then, a Multi-channel Convolutional Rotating-register Pipeline (MCRP) structure was proposed to concisely and effectively utilize the convolution kernel parallelism of CNN algorithm. Finally, using the strategy of input/output channel parallelism+convolution kernel parallelism, a CNN accelerator architecture with high parallelism was proposed based on MCRP structure, and to verify the design rationality of the architecture, it was deployed on the XCZU9EG chip of XILINX. Under the condition of making full use of the on-chip Digital Signal Processor (DSP) resources, the peak computing capacity of the proposed CNN accelerator reached 2 304 GOPS(Giga Operations Per Second). Taking SSD-300 algorithm as the test object, this CNN accelerator had the actual computing capacity of 1 830.33 GOPS, and the hardware utilization rate of 79.44%. Experimental results show that, the MCRP structure can effectively improve the computing capacity of CNN accelerator, and the CNN accelerator based on MCRP structure can generally meet the computing capacity requirements of most applications in the embedded fields.

Key words: Convolutional Neural Network (CNN), high performance, hardware accelerator, parallelism, Field Programmable Gate Array (FPGA)

中图分类号: