Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture
GONG Mingqing, YE Huang, ZHANG Jian, LU Xingjing, CHEN Wei
Journal of Computer Applications    2019, 39 (6): 1557-1562.   DOI: 10.11772/j.issn.1001-9081.2018122608
Abstract777)      PDF (1002KB)(677)       Save
Aiming at the inefficiency of neural network inferential calculation executed by mobile intelligent devices using ARM processor, a set of Single precision floating GEneral Matrix Multiply (SGEMM) algorithm optimization scheme based on ARMv8 architecture was proposed. Firstly, it was determined that the computational efficiency of the processor based on ARMv8 architecture executing SGEMM algorithm was limited by the vectorized computation unit usage scheme, the instruction pipeline, and the probability of occurrence of cache miss. Secondly, three optimization techniques:vector instruction inline assembly, data rearrangement and data prefetching were implemented for the three reasons that the computational efficiency was limited. Finally, the test experiments were designed based on three matrix patterns commonly used in the neural network of speech direction and the programs were run on the RK3399 hardware platform. The experimental results show that, the single-core computing speed is 10.23 GFLOPS in square matrix mode, reaching 78.2% of the measured floating-point peak value; the single-core computing speed is 6.35 GFLOPS in slender matrix mode, reaching 48.1% of the measured floating-point peak value; and the single-core computing speed is 2.53 GFLOPS in continuous small matrix mode, reaching 19.2% of the measured floating-point peak value. With the optimized SGEMM algorithm deployed into the speech recognition neural network program, the actual speech recognition speed of program is significantly improved.
Reference | Related Articles | Metrics
First-principle nonlocal projector potential calculation on GPU cluster
FU Jiyun JIA Weile CAO Zongyan WANG Long YE Huang CHI Xuebin
Journal of Computer Applications    2013, 33 (06): 1540-1552.   DOI: 10.3724/SP.J.1087.2013.01540
Abstract1244)      PDF (793KB)(774)       Save
Plane Wave Pseudopotential (PWP) Density Functional Theory (DFT) calculation is the most widely used method for material calculation. The projector calculation plays an important part in PWP-DFT calculation for the self-consistent iteration solution, while it often becomes a hinder to the speed-up of software. Therefore, according to the features of Graphic Processing Unit (GPU), a speed-up algorithm was proposed: 1) using a new parallel mechanism to solve the potential energy of nonlocal projector, 2) redesigning the distribution structure of data, 3) reducing the use of computer memory, 4) Proposing a solution to the related data problems of the algorithm. Eventually got 18-57 times acceleration, and reached the 12 seconds per step of the molecular dynamics simulation. In this paper, the testing time of running this model on GPU platform was analysed in detail, meanwhile the calculation bottleneck of the implementation of this method into GPU clusters was discussed
Reference | Related Articles | Metrics