Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture

doi:10.11772/j.issn.1001-9081.2018122608

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (6): 1557-1562.DOI: 10.11772/j.issn.1001-9081.2018122608

• 2018 National Annual Conference on High Performance Computing (HPC China 2018) • Next Articles

Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture

GONG Mingqing^1,2,3, YE Huang¹, ZHANG Jian¹, LU Xingjing³, CHEN Wei³

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. Beijing Sogou Technology Development Company Limited, Beijing 100084, China

Received:2018-12-12 Revised:2019-02-28 Online:2019-06-17 Published:2019-06-10
Supported by:
This work is partially supported by the National Key R&D Program of China (2016YFB0201100, 2017YFB0202803), the National Natural Science Foundation of China (11871454, 91630204, 61531166003), the Strategic Priority Research Program of Chinese Academy of Sciences (B) (XDB22020102), the e-Science Foundation of Chinese Academy of Sciences (XXH13506-204).

基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化

龚鸣清^1,2,3, 叶煌¹, 张鉴¹, 卢兴敬³, 陈伟³

1. 中国科学院计算机网络信息中心, 北京 100190;
2. 中国科学院大学, 北京 100049;
3. 北京搜狗科技发展有限公司, 北京 100084

通讯作者: 张鉴
作者简介:龚鸣清(1994-),男,湖北黄冈人,硕士研究生,主要研究方向:高性能计算、机器学习;叶煌(1979-),男,江西铜鼓人,副研究员,博士,主要研究方向:高性能计算;张鉴(1972-),男,北京人,研究员,博士,博士生导师,CCF会员,主要研究方向:高性能计算、科学计算、科学计算可视化;卢兴敬(1983-),男,山东临沂人,博士,CCF会员,主要研究方向:高性能计算、深度学习、并行编程、编译技术;陈伟(1984-),男,内蒙古呼和浩特人,博士,CCF会员,主要研究方向:人机交互、机器翻译、深度学习。
基金资助:
国家重点研发计划项目（2016YFB0201100，2017YFB0202803）；国家自然科学基金资助项目（11871454，91630204，61531166003）；中国科学院战略性先导科技专项（B类）（XDB22020102）；中国科学院信息化专项（XXH13506-204）。

Abstract

Abstract: Aiming at the inefficiency of neural network inferential calculation executed by mobile intelligent devices using ARM processor, a set of Single precision floating GEneral Matrix Multiply (SGEMM) algorithm optimization scheme based on ARMv8 architecture was proposed. Firstly, it was determined that the computational efficiency of the processor based on ARMv8 architecture executing SGEMM algorithm was limited by the vectorized computation unit usage scheme, the instruction pipeline, and the probability of occurrence of cache miss. Secondly, three optimization techniques:vector instruction inline assembly, data rearrangement and data prefetching were implemented for the three reasons that the computational efficiency was limited. Finally, the test experiments were designed based on three matrix patterns commonly used in the neural network of speech direction and the programs were run on the RK3399 hardware platform. The experimental results show that, the single-core computing speed is 10.23 GFLOPS in square matrix mode, reaching 78.2% of the measured floating-point peak value; the single-core computing speed is 6.35 GFLOPS in slender matrix mode, reaching 48.1% of the measured floating-point peak value; and the single-core computing speed is 2.53 GFLOPS in continuous small matrix mode, reaching 19.2% of the measured floating-point peak value. With the optimized SGEMM algorithm deployed into the speech recognition neural network program, the actual speech recognition speed of program is significantly improved.

Key words: ARMv8, single instruction multiple data, basic linear algebra subprogram, high performance computation

摘要： 针对使用ARM处理器的移动智能设备执行神经网络推理计算效率不高的问题，提出了一套基于ARMv8架构的单精度浮点通用矩阵乘法（SGEMM）算法优化方案。首先，确定ARMv8架构的处理器执行SGEMM算法的计算效率受限于向量化计算单元使用方案、指令流水线和缓存未命中的发生概率；其次，针对三点导致计算效率受限的原因实现向量指令内联汇编、数据重排和数据预取三条优化技术；最后，根据语音方向的神经网络中常见的三种矩阵模式设计测试实验，实验中使用RK3399硬件平台运行程序。实验结果表示：方阵模式下单核计算速度为10.23 GFLOPS，达到实测浮点峰值的78.2%；在细长矩阵模式下单核计算速度为6.35 GFLOPS，达到实测浮点峰值的48.1%；在连续小矩阵模式下单核计算速度为2.53 GFLOPS，达到实测浮点峰值19.2%。将优化后的SGEMM算法部署到语音识别神经网络程序中，程序的实际语音识别速度取得了显著提高。

关键词: ARMv8, 单指令多数据流计算, 基础线性代数子程序库, 高性能计算

CLC Number:

TP332

GONG Mingqing, YE Huang, ZHANG Jian, LU Xingjing, CHEN Wei. Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture[J]. Journal of Computer Applications, 2019, 39(6): 1557-1562.

龚鸣清, 叶煌, 张鉴, 卢兴敬, 陈伟. 基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化[J]. 计算机应用, 2019, 39(6): 1557-1562.

References

[1] AMD. AMD Core Math Library (ACML)[EB/OL].[2018-09-12]. http://developer.amd.com/acml.jsp.
[2] FILIPPONE S. The IBM parallel engineering and scientific subroutine library[C]//Proceedings of the 1995 International Workshop on Applied Parallel Computing, LNCS 1041. Berlin:Springer, 1995:199-206.
[3] QUINTANA-ORTI E S, IGUAL F D, CASTILLO M, et al. Evaluation and tuning of the level 3 CUBLAS for graphics processors[C]//Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing Symposium. Piscataway, NJ:IEEE, 2008:1-8.
[4] GOTO K, van der GEIJN R A. Anatomy of high-performance matrix multiplication[J]. ACM Transactions on Mathematical Software, 2008, 34(3):Article No. 12.
[5] 蒋孟奇,张云泉,宋刚,等.GOTOBLAS一般矩阵乘法高效实现机制的研究[J].计算机工程,2008,34(7):84-86,103.(JIANG M Q, ZHANG Y Q, SONG G, et al. Research on high performance implementation mechanism of GOTOBLAS general matrix-matrix multiplication[J]. Computer Engineering, 2008, 34(7):84-86, 103.)
[6] 张先轶,王茜,张云泉.OpenBLAS:龙芯3A CPU的高性能BLAS库[J].软件学报,2011,22(增刊2):208-216.(ZHANG X Y, WANG Q, ZHANG Y Q. OpenBLAS:a high performance BLAS library on Loongson 3A CPU[J]. Journal of Software, 2011, 22(Suppl. 2):208-216.)
[7] CHEN B, WANG L, WU Q, et al. Cross hardware-software boundary exploration for scalable and optimized deep learning platform design[J]. IEEE Embedded Systems Letters, 2018, 10(4):107-110.
[8] LIN I, JEFF B, RICKARD I. ARM platform for performance and power efficiency - hardware and software perspectives[C]//Proceedings of the 2016 International Symposium on VLSI Design, Automation and Test. Piscataway, NJ:ⅡEEE, 2016:1-5.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st Conference on Neural Information Processing Systems. North Miami Beach, FL:Curran Associates Inc., 2017:5998-6008.
[10] WANG F, JIANG H, ZUO K, et al. Design and implementation of a highly efficient DGEMM for 64-bit ARMv8 multi-core processors[C]//Proceedings of the 44th International Conference on Parallel Processing. Piscataway, NJ:IEEE, 2015:200-209.
[11] RUSITORU R. ARMv8 micro-architectural design space exploration for high performance computing using fractional factorial[C]//Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. New York:ACM, 2015:Article No. 8.
[12] FLUR S, GRAY K E, PULTE C, et al. Modelling the ARMv8 architecture, operationally:concurrency and ISA[C]//Proceedings of the 43rd Annual ACM SIGPLAN Symposium on Principles of Programming Languages. New York:ACM, 2016:608-621.
[13] LIU Z, JARVINEN K, LIU W, et al. Multiprecision multiplication on ARMv8[C]//Proceedings of the IEEE 24th Symposium on Computer Arithmetic. Piscataway, NJ:IEEE, 2017:10-17.
[14] XU X, CLARKE C T, JONES S R. High performance code compression architecture for the embedded ARM/ThUMB processor[C]//Proceedings of the 1st Conference on Computing Frontiers. New York:ACM, 2004:451-456.
[15] 姜浩,杜琦,郭敏,等.面向ARMv864位多核处理器的QGEMM设计与实现[J].计算机学报,2017,40(9):2018-2029.(JIANG H, DU Q, GUO M, et al. Design and implementation of QGEMM on ARMv864-bit multi-core processor[J]. Chinese Journal of Computers, 2017, 40(9):2018-2029.)

Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture

基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 7

Recommended Articles

Metrics

[1]	LI Yingying, GAO Wei, GAO Yuchen, ZHAI Shengwei, LI Pengyuan. Method for exploiting function level vectorization on simple instruction multiple data extensions [J]. Journal of Computer Applications, 2017, 37(8): 2200-2208.
[2]	LYU Hongwu, GU Lei, WANG Huiqiang, ZOU Shichen, FENG Guangsheng. Quasi-optimal period computation model for hierarchical checkpoint protocol [J]. Journal of Computer Applications, 2017, 37(1): 103-107.
[3]	HUANG Shengbing, ZHENG Qilong, GUO Lianwei. SIMD compiler optimization by selecting single or double word mode for clustered VLIW DSP [J]. Journal of Computer Applications, 2015, 35(8): 2371-2374.
[4]	ZHANG Yongjun, CHEN Ting. Parallel multiple input and multiple output equalization based on software defined radio [J]. Journal of Computer Applications, 2015, 35(4): 1179-1184.
[5]	HOU Yongsheng ZHAO Rongcai GAO Wei GAO Wei. Single instruction multiple data vectorization of non-normalized loops [J]. Journal of Computer Applications, 2013, 33(11): 3149-3154.
[6]	SUO Wei-yi ZHAO Rong-cai YAO Yuan LIU Peng. Superword level parallelism instruction analysis and redundancy optimization algorithm on DSP [J]. Journal of Computer Applications, 2012, 32(12): 3303-3307.
[7]	. Hybrid high performance model of computer algebra system [J]. Journal of Computer Applications, 2007, 27(11): 2834-2837.