基于ZYNQ平台的YOLOv3压缩和加速

doi:10.11772/j.issn.1001-9081.2020060994

摘要/Abstract

摘要： 高精度物体检测网络急剧增加的参数和计算量使得它们很难在车辆和无人机等端侧设备上直接部署使用。针对这一问题，从网络压缩和计算加速两方面入手，提出了一种面向残差网络的新型压缩方案来实现YOLOv3的压缩，并通过ZYNQ平台对这一压缩后的网络进行加速。首先，提出了包括网络裁剪和网络量化两方面的网络压缩算法。网络裁剪方面，给出了针对残差结构的裁剪策略来将网络剪枝分为通道剪枝和残差链剪枝两个粒度，解决了通道剪枝无法应对残差连接的局限性，进一步降低了模型的参数量；网络量化方面，实现了一种基于相对熵的模拟量化方法，以通道为单位对参数进行量化，在线统计模型的参数分布与参数量化造成的信息损失，从而辅助选择最优量化策略来减少量化过程的精度损失。然后，在ZYNQ平台上设计并改进了8比特的卷积加速模块，从而优化了片上缓存结构并结合Winograd算法实现了压缩后YOLOv3的加速。实验结果表明，所提压缩算法较YOLOv3 tiny能够进一步降低模型尺寸，但检测精度提升了7个百分点；同时ZYNQ平台上的硬件加速方法获得了比其他平台更高的能耗比，从而推进了YOLOv3以及其他残差网络在ZYNQ端侧的实际部署。

关键词: 物体检测, 神经网络压缩, 计算加速, 网络剪枝, 网络量化, ZYNQ平台

Abstract: The object detection networks with high accuracy are hard to be directly deployed on end-devices such as vehicles and drones due to their significant increase of parameters and computational cost. In order to solve the problem, by considering network compression and computation acceleration, a new compression scheme for residual networks was proposed to compress YOLOv3 (You Only Look Once v3), and this compressed network was then accelerated on ZYNQ platform. Firstly, a network compression algorithm containing both network pruning and network quantization was proposed. In the aspect of network pruning, a strategy for residual structure was introduced to divide the network pruning into two granularities:channel pruning and residual connection pruning, which overcame the limitations of the channel pruning on residual connections and further reduced the parameter number of the model. In the aspect of network quantization, a relative entropy-based simulated quantization was utilized to quantize the parameters channel by channel, and perform the online statistics of the parameter distribution and the information loss caused by the parameter quantization, so as to assist to choose the best quantization strategy to reduce the precision loss during the quantization process. Secondly, the 8-bit convolution acceleration module was designed and optimized on ZYNQ platform, which optimized the on-chip cache structure and accelerate the compressed YOLOv3 with combining the Winograd algorithm. Experimental results show that the proposed solution can achieve smaller model scale and higher accuracy (7 percent points increased) compared to YOLOv3 tiny. Meanwhile, the hardware acceleration method on ZYNQ platform achieves higher energy efficiency ratio than other platforms, thus helping the actual deployment of YOLOv3 and other residual networks on the end sides of ZYNQ.

Key words: object detection, neural network compression, computation acceleration, network pruning, network quantization, ZYNQ platform

中图分类号:

TP391.4

郭文旭, 苏远歧, 刘跃虎. 基于ZYNQ平台的YOLOv3压缩和加速[J]. 计算机应用, 2021, 41(3): 669-676.

GUO Wenxu, SU Yuanqi, LIU Yuehu. YOLOv3 compression and acceleration based on ZYNQ platform[J]. Journal of Computer Applications, 2021, 41(3): 669-676.

参考文献

[1] 马旗, 朱斌, 张宏伟, 等. 基于优化YOLOv3的低空无人机检测识别方法[J]. 激光与光电子学进展,2019,56(20):No. 201006. (MA Q,ZHU B,ZHANG H W, et al. Low-altitude UAV detection and recognition method based on optimized YOLOv3[J]. Laser and Optoelectronics Progress,2019,56(20):No. 201006.)
[2] 李云鹏, 侯凌燕, 王超. 基于YOLOv3的自动驾驶中运动目标检测[J]. 计算机工程与设计,2019,40(4):1139-1144.(LI Y P, HOU L Y,WANG C. Moving objects detection in automatic driving based on YOLOv3[J]. Computer Engineering and Design,2019, 40(4):1139-1144.)
[3] GIRSHICK R,DONAHUE J,DARRELL T,et al. Region-based convolutional networks for accurate object detection and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,38(1):142-158.
[4] REN S,HE K,GIRSHICK R,et al. Faster R-CNN:towards realtime object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017, 39(6):1137-1149.
[5] LIU W,ANGUELOV D,ERHAN D,et al. SSD:single shot multibox detector[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9905. Cham:Springer, 2016:21-37.
[6] REDMON J,DIVVALA S,GIRSHICK R,et al. You only look once:unified,real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2016:779-788.
[7] YURTSEVER E,LAMBERT J,CARBALLO A,et al. A survey of autonomous driving:common practices and emerging technologies[J]. IEEE Access,2020,8:58443-58469.
[8] ZHANG X,ZHOU X,LIN M,et al. ShuffleNet:an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:6848-6856.
[9] HOWARD A G,ZHU M,CHEN B,et al. MobileNets:efficient convolutional neural networks for mobile vision applications[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1704.04861.pdf.
[10] SANDLER M,HOWARD A,ZHU M,et al. MobileNetV2:inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:4510-4520.
[11] HOWARD A,SANDLER M,CHEN B,et al. Searching for MobileNetV3[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2019:1314-1324.
[12] 邹月娴, 余嘉胜, 陈泽晗, 等. 图像分类卷积神经网络的特征选择模型压缩方法[J]. 控制理论与应用,2017,34(6):746-752. (ZOU Y X,YU J S,CHEN Z H,et al. Convolutional neural networks model compression based on feature selection for image classification[J]. Control Theory and Applications,2017,34(6):746-752.)
[13] HAN S,POOL J,TRAN J,et al. Learning both weights and connections for efficient neural network[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge:MIT Press,2015:1135-1143.
[14] LI H,KADAV A,DURDANOVIC I,et al. Pruning filters for efficient ConvNets[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1608.08710.pdf.
[15] HE Y,ZHANG X,SUN J. Channel pruning for accelerating very deep neural networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:1398-1406.
[16] LIU Z,LI J,SHEN Z,et al. Learning efficient convolutional networks through network slimming[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE,2017:2755-2763.
[17] LIU Z,SUN M,ZHOU T,et al. Rethinking the value of network pruning[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1810.05270.pdf.
[18] HAN S,MAO H,DALLY W J. Deep compression:compressing deep neural networks with pruning,trained quantization,and Huffman coding[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1510.00149.pdf.
[19] COURBARIAUX M,HUBARA I,SOUDRY D,et al. Binarized neural networks:training deep neural networks with weights and activations constrained to +1 or -1[EB/OL].[2020-09-19]. https://arxiv.org/pdf/1602.02830.pdf.
[20] RASTEGARI M,ORDONEZ V,REDMON J,et al. XNOR-Net:ImageNet classification using binary convolutional neural networks[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9908. Cham:Springer,2016:525-542.
[21] JACOB B,KLIGYS S,CHEN B,et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:2704-2713.
[22] WU S,LI G,CHEN F,et al. Training and inference with integers in deep neural networks[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1802.04680.pdf.
[23] SAU B B, BALASUBRAMANIAN V N. Deep model compression:distilling knowledge from noisy teachers[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1610.09650.pdf.
[24] XU Z,HSU Y C,HUANG J. Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1709.00513.pdf.
[25] CROWLEY E J,GRAY G,STORKEY A. Moonshine:distilling with cheap convolutions[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2018:2893-2903.
[26] JIA Y, SHELHAMER E, DONAHUE J, et al. Caffe:convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. New York:ACM,2014:675-678.
[27] 盛荣菊, 马建伟. 人工神经网络FPGA硬件实现的研究进展[J]. 电气自动化,2009,31(5):53-54,67.(SHENG R J,MA J W. Research progress of FPGA hardware implementation of artificial neural network[J]. Electrical Automation,2009,31(5):53-54,67.)
[28] 余子健, 马德, 严晓浪, 等. 基于FPGA的卷积神经网络加速器[J]. 计算机工程,2017,43(1):109-114,119.(YU Z J,MA D,YAN X L,et al. FPGA-based accelerator for convolutional neural network[J]. Computer Engineering,2017,43(1):109-114,119.)
[29] GSCHWEND D. ZynqNet:an FPGA-accelerated embedded convolutional neural network[EB/OL].[2020-06-02]. https://arxiv.org/pdf/2005.06892.pdf.
[30] VASILACHE N, JOHNSON J, MATHIEU M, et al. Fast convolutional nets with fbfft:a GPU performance evaluation[EB/OL].[2020-09-19]. https://arxiv.org/pdf/1412.7580.pdf.
[31] LAVIN A,GRAY S. Fast algorithms for convolutional neural networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:4013-4021.
[32] LIU X, POOL J, HAN S, et al. Efficient sparse-Winograd convolutional neural networks[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1802.06367.pdf.
[33] IOFFE S,SZEGEDY C. Batch normalization:accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. New York:JMLR. org,2015:448-456.
[34] REDMON J, FARHADI A. YOLOv3:an incremental improvement[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1804.02767.pdf.
[35] CHENG Y,WANG D,ZHOU P,et al. A survey of model compression and acceleration for deep neural networks[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1710.09282.pdf.
[36] MIGACZ S. 8-bit inference with TensorRT[EB/OL].[2020-04-20]. http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf.
[37] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[38] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:1-9.
[39] XILINX. Vivado HLS optimize methodology guide(UG1270)[EB/OL].[2020-04-20]. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug1270-vivado-hls-optmethodology-guide.pdf.
[40] HE K,ZHANG X,REN S,et al. Delving deep into rectifiers:surpassing human-level performance on ImageNet classification[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE,2015:1026-1034.
[41] GEIGER A,LENZ P,STILLER C,et al. Vision meets robotics:the KITTI dataset[J]. The International Journal of Robotics Research,2013,32(11):1231-1237.
[42] WANG T,WANG C,ZHOU X,et al. A survey of FPGA based deep learning accelerators:challenges and opportunities[EB/OL].[2020-06-02]. https://arxiv.org/pdf/1901.04988v1.pdf.
[43] MA Y,CAO Y,VRUDHULA S,et al. Automatic compilation of diverse CNNs onto high-performance FPGA accelerators[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2020,39(2):424-437.