Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (6): 1557-1562.DOI: 10.11772/j.issn.1001-9081.2018122608

• 2018 National Annual Conference on High Performance Computing (HPC China 2018) •     Next Articles

Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture

GONG Mingqing1,2,3, YE Huang1, ZHANG Jian1, LU Xingjing3, CHEN Wei3   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China;
    3. Beijing Sogou Technology Development Company Limited, Beijing 100084, China
  • Received:2018-12-12 Revised:2019-02-28 Online:2019-06-17 Published:2019-06-10
  • Supported by:
    This work is partially supported by the National Key R&D Program of China (2016YFB0201100, 2017YFB0202803), the National Natural Science Foundation of China (11871454, 91630204, 61531166003), the Strategic Priority Research Program of Chinese Academy of Sciences (B) (XDB22020102), the e-Science Foundation of Chinese Academy of Sciences (XXH13506-204).

基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化

龚鸣清1,2,3, 叶煌1, 张鉴1, 卢兴敬3, 陈伟3   

  1. 1. 中国科学院 计算机网络信息中心, 北京 100190;
    2. 中国科学院大学, 北京 100049;
    3. 北京搜狗科技发展有限公司, 北京 100084
  • 通讯作者: 张鉴
  • 作者简介:龚鸣清(1994-),男,湖北黄冈人,硕士研究生,主要研究方向:高性能计算、机器学习;叶煌(1979-),男,江西铜鼓人,副研究员,博士,主要研究方向:高性能计算;张鉴(1972-),男,北京人,研究员,博士,博士生导师,CCF会员,主要研究方向:高性能计算、科学计算、科学计算可视化;卢兴敬(1983-),男,山东临沂人,博士,CCF会员,主要研究方向:高性能计算、深度学习、并行编程、编译技术;陈伟(1984-),男,内蒙古呼和浩特人,博士,CCF会员,主要研究方向:人机交互、机器翻译、深度学习。
  • 基金资助:
    国家重点研发计划项目(2016YFB0201100,2017YFB0202803);国家自然科学基金资助项目(11871454,91630204,61531166003);中国科学院战略性先导科技专项(B类)(XDB22020102);中国科学院信息化专项(XXH13506-204)。

Abstract: Aiming at the inefficiency of neural network inferential calculation executed by mobile intelligent devices using ARM processor, a set of Single precision floating GEneral Matrix Multiply (SGEMM) algorithm optimization scheme based on ARMv8 architecture was proposed. Firstly, it was determined that the computational efficiency of the processor based on ARMv8 architecture executing SGEMM algorithm was limited by the vectorized computation unit usage scheme, the instruction pipeline, and the probability of occurrence of cache miss. Secondly, three optimization techniques:vector instruction inline assembly, data rearrangement and data prefetching were implemented for the three reasons that the computational efficiency was limited. Finally, the test experiments were designed based on three matrix patterns commonly used in the neural network of speech direction and the programs were run on the RK3399 hardware platform. The experimental results show that, the single-core computing speed is 10.23 GFLOPS in square matrix mode, reaching 78.2% of the measured floating-point peak value; the single-core computing speed is 6.35 GFLOPS in slender matrix mode, reaching 48.1% of the measured floating-point peak value; and the single-core computing speed is 2.53 GFLOPS in continuous small matrix mode, reaching 19.2% of the measured floating-point peak value. With the optimized SGEMM algorithm deployed into the speech recognition neural network program, the actual speech recognition speed of program is significantly improved.

Key words: ARMv8, single instruction multiple data, basic linear algebra subprogram, high performance computation

摘要: 针对使用ARM处理器的移动智能设备执行神经网络推理计算效率不高的问题,提出了一套基于ARMv8架构的单精度浮点通用矩阵乘法(SGEMM)算法优化方案。首先,确定ARMv8架构的处理器执行SGEMM算法的计算效率受限于向量化计算单元使用方案、指令流水线和缓存未命中的发生概率;其次,针对三点导致计算效率受限的原因实现向量指令内联汇编、数据重排和数据预取三条优化技术;最后,根据语音方向的神经网络中常见的三种矩阵模式设计测试实验,实验中使用RK3399硬件平台运行程序。实验结果表示:方阵模式下单核计算速度为10.23 GFLOPS,达到实测浮点峰值的78.2%;在细长矩阵模式下单核计算速度为6.35 GFLOPS,达到实测浮点峰值的48.1%;在连续小矩阵模式下单核计算速度为2.53 GFLOPS,达到实测浮点峰值19.2%。将优化后的SGEMM算法部署到语音识别神经网络程序中,程序的实际语音识别速度取得了显著提高。

关键词: ARMv8, 单指令多数据流计算, 基础线性代数子程序库, 高性能计算

CLC Number: