Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture

GONG Mingqing, YE Huang, ZHANG Jian, LU Xingjing, CHEN Wei

Journal of Computer Applications 2019, 39 (6): 1557-1562. DOI: 10.11772/j.issn.1001-9081.2018122608

Abstract （777）

PDF （1002KB）（677）

Save

Aiming at the inefficiency of neural network inferential calculation executed by mobile intelligent devices using ARM processor, a set of Single precision floating GEneral Matrix Multiply (SGEMM) algorithm optimization scheme based on ARMv8 architecture was proposed. Firstly, it was determined that the computational efficiency of the processor based on ARMv8 architecture executing SGEMM algorithm was limited by the vectorized computation unit usage scheme, the instruction pipeline, and the probability of occurrence of cache miss. Secondly, three optimization techniques:vector instruction inline assembly, data rearrangement and data prefetching were implemented for the three reasons that the computational efficiency was limited. Finally, the test experiments were designed based on three matrix patterns commonly used in the neural network of speech direction and the programs were run on the RK3399 hardware platform. The experimental results show that, the single-core computing speed is 10.23 GFLOPS in square matrix mode, reaching 78.2% of the measured floating-point peak value; the single-core computing speed is 6.35 GFLOPS in slender matrix mode, reaching 48.1% of the measured floating-point peak value; and the single-core computing speed is 2.53 GFLOPS in continuous small matrix mode, reaching 19.2% of the measured floating-point peak value. With the optimized SGEMM algorithm deployed into the speech recognition neural network program, the actual speech recognition speed of program is significantly improved.

Reference | Related Articles | Metrics

Select

First-principle nonlocal projector potential calculation on GPU cluster

FU Jiyun JIA Weile CAO Zongyan WANG Long YE Huang CHI Xuebin

Journal of Computer Applications 2013, 33 (06): 1540-1552. DOI: 10.3724/SP.J.1087.2013.01540

Abstract （1244）

PDF （793KB）（774）

Save

Plane Wave Pseudopotential (PWP) Density Functional Theory (DFT) calculation is the most widely used method for material calculation. The projector calculation plays an important part in PWP-DFT calculation for the self-consistent iteration solution, while it often becomes a hinder to the speed-up of software. Therefore, according to the features of Graphic Processing Unit (GPU), a speed-up algorithm was proposed: 1) using a new parallel mechanism to solve the potential energy of nonlocal projector, 2) redesigning the distribution structure of data, 3) reducing the use of computer memory, 4) Proposing a solution to the related data problems of the algorithm. Eventually got 18-57 times acceleration, and reached the 12 seconds per step of the molecular dynamics simulation. In this paper, the testing time of running this model on GPU platform was analysed in detail, meanwhile the calculation bottleneck of the implementation of this method into GPU clusters was discussed

Reference | Related Articles | Metrics