[1] AMD. AMD Core Math Library (ACML)[EB/OL].[2018-09-12]. http://developer.amd.com/acml.jsp. [2] FILIPPONE S. The IBM parallel engineering and scientific subroutine library[C]//Proceedings of the 1995 International Workshop on Applied Parallel Computing, LNCS 1041. Berlin:Springer, 1995:199-206. [3] QUINTANA-ORTI E S, IGUAL F D, CASTILLO M, et al. Evaluation and tuning of the level 3 CUBLAS for graphics processors[C]//Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing Symposium. Piscataway, NJ:IEEE, 2008:1-8. [4] GOTO K, van der GEIJN R A. Anatomy of high-performance matrix multiplication[J]. ACM Transactions on Mathematical Software, 2008, 34(3):Article No. 12. [5] 蒋孟奇,张云泉,宋刚,等.GOTOBLAS一般矩阵乘法高效实现机制的研究[J].计算机工程,2008,34(7):84-86,103.(JIANG M Q, ZHANG Y Q, SONG G, et al. Research on high performance implementation mechanism of GOTOBLAS general matrix-matrix multiplication[J]. Computer Engineering, 2008, 34(7):84-86, 103.) [6] 张先轶,王茜,张云泉.OpenBLAS:龙芯3A CPU的高性能BLAS库[J].软件学报,2011,22(增刊2):208-216.(ZHANG X Y, WANG Q, ZHANG Y Q. OpenBLAS:a high performance BLAS library on Loongson 3A CPU[J]. Journal of Software, 2011, 22(Suppl. 2):208-216.) [7] CHEN B, WANG L, WU Q, et al. Cross hardware-software boundary exploration for scalable and optimized deep learning platform design[J]. IEEE Embedded Systems Letters, 2018, 10(4):107-110. [8] LIN I, JEFF B, RICKARD I. ARM platform for performance and power efficiency - hardware and software perspectives[C]//Proceedings of the 2016 International Symposium on VLSI Design, Automation and Test. Piscataway, NJ:ⅡEEE, 2016:1-5. [9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st Conference on Neural Information Processing Systems. North Miami Beach, FL:Curran Associates Inc., 2017:5998-6008. [10] WANG F, JIANG H, ZUO K, et al. Design and implementation of a highly efficient DGEMM for 64-bit ARMv8 multi-core processors[C]//Proceedings of the 44th International Conference on Parallel Processing. Piscataway, NJ:IEEE, 2015:200-209. [11] RUSITORU R. ARMv8 micro-architectural design space exploration for high performance computing using fractional factorial[C]//Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. New York:ACM, 2015:Article No. 8. [12] FLUR S, GRAY K E, PULTE C, et al. Modelling the ARMv8 architecture, operationally:concurrency and ISA[C]//Proceedings of the 43rd Annual ACM SIGPLAN Symposium on Principles of Programming Languages. New York:ACM, 2016:608-621. [13] LIU Z, JARVINEN K, LIU W, et al. Multiprecision multiplication on ARMv8[C]//Proceedings of the IEEE 24th Symposium on Computer Arithmetic. Piscataway, NJ:IEEE, 2017:10-17. [14] XU X, CLARKE C T, JONES S R. High performance code compression architecture for the embedded ARM/ThUMB processor[C]//Proceedings of the 1st Conference on Computing Frontiers. New York:ACM, 2004:451-456. [15] 姜浩,杜琦,郭敏,等.面向ARMv864位多核处理器的QGEMM设计与实现[J].计算机学报,2017,40(9):2018-2029.(JIANG H, DU Q, GUO M, et al. Design and implementation of QGEMM on ARMv864-bit multi-core processor[J]. Chinese Journal of Computers, 2017, 40(9):2018-2029.) |