Method for exploiting function level vectorization on simple instruction multiple data extensions

doi:10.11772/j.issn.1001-9081.2017.08.2200

Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (8): 2200-2208.DOI: 10.11772/j.issn.1001-9081.2017.08.2200

Previous Articles Next Articles

Method for exploiting function level vectorization on simple instruction multiple data extensions

LI Yingying^1,2, GAO Wei^1,2, GAO Yuchen^1,2, ZHAI Shengwei³, LI Pengyuan⁴

1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou Henan 450002, China;
2. PLA Information Engineering University, Zhengzhou Henan 450002, China;
3. The 27 th Research Institute, China Electronics Technology Group Corporation, Zhengzhou Henan 450047, China;
4. Beijing Institute of Tracking and Telecommunications Technology, Beijing 100094, China

Received:2016-12-29 Revised:2017-03-21 Online:2017-08-12 Published:2017-08-10

发掘函数级单指令多数据向量化的方法

李颖颖^1,2, 高伟^1,2, 高雨辰^1,2, 翟胜伟³, 李朋远⁴

1. 数学工程与先进计算国家重点实验室, 郑州 450002;
2. 信息工程大学, 郑州 450002;
3. 中国电子科技集团公司第二十七研究所, 郑州 450047;
4. 北京跟踪与通信技术研究所, 北京 100094

通讯作者: 高雨辰
作者简介:李颖颖(1984-),女,河南郑州人,讲师,硕士,主要研究方向:先进编译技术;高伟(1988-),男,黑龙江齐齐哈尔人,博士研究生,主要研究方向:高性能计算、先进编译技术;高雨辰(1988-),男,河南郑州人,硕士研究生,主要研究方向:先进编译技术;翟胜伟(1982-),男,河南郑州人,工程师,硕士,主要研究方向:先进计算;李朋远(1989-),男,河南焦作人,研究员,硕士,主要研究方向:高性能计算。

Abstract

Abstract: Currently, two vectorization methods which exploit Simple Instruction Multiple Data (SIMD) parallelism are loop-based method and Superword Level Parallel (SLP) method. Focusing on the problem that the current compiler cannot realize function level vectorization, a method of function level vectorization based on static single assignment was proposed. Firstly, the variable properties of program were analysed, and then a set of compiling directives including SIMD function annotations, uniform clauses, linear clauses were used to realize function level vectorization. Finally, the vectorized code was optimized by using the variable attribute result. Some test cases from the field of multimedia and image processing were selected to test the function and performance of the proposed function level vectorization on Sunway platform. Compared with the scalar program execution results, the execution of the program after the function level vectorization is more efficient. The experimental results show that the function level vectorization can achieve the same effect of task level parallelism, which is instructive to realize the automatic function level vectorization.

Key words: Single Instruction Multiple Data (SIMD) extension, parallelism, function level vectorization, compiler directive, static single assignment

摘要： 当前面向单指令多数据（SIMD）扩展部件的两类向量化方法分别是循环级向量化方法和超字级并行（SLP）方法。针对当前编译器不能实现函数级向量化的问题，提出一种基于静态单赋值的函数级向量化方法。该方法首先分析程序的变量属性，然后利用一组包括向量函数子句、一致子句、线性子句等编译指示子句指导编译器实现函数级向量化，最后利用变量属性结果对向量化代码进行了优化。从多媒体和图像处理领域选择部分测试用例对所提的函数级向量化的功能和性能在国产申威平台上进行测试，与程序串行执行相比，采用函数级向量化后程序的执行效率更高。实验结果表明函数级向量化可以取得类似任务级并行的加速效果，该方法可以指导自动函数级向量化的实现。

关键词: 单指令多数据扩展, 并行性, 函数级向量化, 编译指示, 静态单赋值

CLC Number:

LI Yingying, GAO Wei, GAO Yuchen, ZHAI Shengwei, LI Pengyuan. Method for exploiting function level vectorization on simple instruction multiple data extensions[J]. Journal of Computer Applications, 2017, 37(8): 2200-2208.

李颖颖, 高伟, 高雨辰, 翟胜伟, 李朋远. 发掘函数级单指令多数据向量化的方法[J]. 计算机应用, 2017, 37(8): 2200-2208.

References

[1] 高伟,赵荣彩,韩林,等.SIMD自动向量化编译优化概述[J].软件学报,2015,26(6):1265-1284. (GAO W, ZHAO R C, HAN L,et al. Research on SIMD auto-vectorization compiling optimization[J]. Journal of Software, 2015,26(6):1265-1284.)
[2] 彭飞,顾乃杰,高翔,等.龙芯3B的SIMD编译优化及分析[J].小型微型计算机系统,2012,33(12):2733-2737. (PENG F, GU N J, GAO X, et al. SIMD compiler optimization and analysis based on Godson-3B processor[J]. Journal of Chinese Computer Systems, 2012, 33(12):2733-2737.)
[3] 陈书明,刘胜,万江华,等.协同多核DSP YHFT-QMBase:体系结构及实现[J].中国科学:信息科学,2015,45(4):560-573. (CHEN S M, LIU S, WAN J H, et al. Coordinate multi-core DSP YHFT-QMBase:architecture and implementation[J]. SCIENCE CHINA (Informationis), 2015, 45(4):560-573.)
[4] 王向前,洪一,王昊,等.魂芯DSP的编译器设计与优化[J].电子学报,2015,43(8):1656-1661. (WANG X Q, HONG Y, WANG H, et al. Compiler design and optimization for BWDSP[J]. Acta Electronica Sinica, 2015, 43(8):1656-1661.)
[5] CHEN L, JIANG P, AGRAWAL G. Exploiting recent SIMD architectural advances for irregular applications[C]//Proceedings of the 2016 IEEE/ACM International Symposium on Code Generation and Optimization. Piscataway, NJ:IEEE, 2016:47-58.
[6] LEIßA R, HAFFNER I, HACK S. Sierra:a SIMD extension for C++[C]//WPMVP' 14:Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing. New York:ACM, 2014:17-24.
[7] HUO X, REN B, AGRAWAL G. A programming system for Xeon Phis with runtime SIMD parallelization[C]//ICS' 14:Proceedings of the 28th ACM International Conference on Supercomputing. New York:ACM, 2014:283-292.
[8] EVANS G C, ABRAHAM S, KUHN B, et al. Vector seeker:a tool for finding vector potential[C]//WPMVP' 14:Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing. New York:ACM, 2014:41-48.
[9] KENNEDY K, ALLEN J R. Optimizing Compilers for Modern Architectures:A Dependence-based Approach[M]. San Francisco, CA:Morgan Kaufmann, 2002.
[10] NUZMAN D, ZAKS A. Outer-loop vectorization:revisited for short SIMD architectures[C]//PACT' 08:Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. Piscataway, NJ:IEEE, 2008:2-11.
[11] TRIFUNOVIC K, NUZMAN D, COHEN A, et al. Polyhedral-model guided loop-nest auto-vectorization[C]//PACT' 09:Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques. Piscataway, NJ:IEEE, 2009:327-337.
[12] KONG M, VERAS R, STOCK K, et al. When polyhedral transformations meet SIMD code generation[C]//PLDI' 13:Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design & Implementation. New York:ACM, 2013:127-138.
[13] LARSEN S, AMARASINGHE S. Exploiting superword level parallelism with multimedia instruction sets[J]. ACM Sigplan Notices, 2000, 35(5):145-156.
[14] WANG Y, WANG D, CHEN S, et al. Iteration interleaving-based SIMD lane partition[J]. ACM Transactions on Architecture & Code Optimization, 2016, 12(4):Article No. 58.

[1]	WANG Zhoukai, ZHANG Jiong, MA Weigang, WANG Huaijun. Parallel decompression algorithm for high-speed train monitoring data [J]. Journal of Computer Applications, 2021, 41(9): 2586-2593.
[2]	WANG Xiaofeng, JIANG Penglong, ZHOU Hui, ZHAO Xiongbo. Design of FPGA accelerator with high parallelism for convolution neural network [J]. Journal of Computer Applications, 2021, 41(3): 812-819.
[3]	YUAN Kaijian, ZHANG Xingming, GAO Yanzhao. Task partitioning algorithm based on parallelism maximization with multi-objective optimization [J]. Journal of Computer Applications, 2017, 37(7): 1916-1920.
[4]	WANG Shoucheng, XU Jinhui, YAN Yingjian, LI Gongli, JIA Yongwang. Software pipelining realization method of AES algorithm based on cipher stream processor [J]. Journal of Computer Applications, 2017, 37(6): 1620-1624.
[5]	ZHANG Suping, HAN Lin, DING Lili, WANG Pengxiang. New improved algorithm for superword level parallelism [J]. Journal of Computer Applications, 2017, 37(2): 450-456.
[6]	XU Chuanpei, WANG Guang. Parallel design and implementation of scale invariant feature transform algorithm based on OpenCL [J]. Journal of Computer Applications, 2016, 36(7): 1801-1806.
[7]	HUANG Shengbing, ZHENG Qilong, GUO Lianwei. SIMD compiler optimization by selecting single or double word mode for clustered VLIW DSP [J]. Journal of Computer Applications, 2015, 35(8): 2371-2374.
[8]	WANG Wanguo, ZHANG Jingjing, HAN Jun, LIU Liang, ZHU Mingwu. Broken strand and foreign body fault detection method for power transmission line based on unmanned aerial vehicle image [J]. Journal of Computer Applications, 2015, 35(8): 2404-2408.
[9]	XU Jinlong, ZHAO Rongcai, HAN Lin. Vector exploring path optimization algorithm of superword level parallelism with subsection constraints [J]. Journal of Computer Applications, 2015, 35(4): 950-955.
[10]	YAN Youmei, LI Tao, WANG Pengbo, HAN Jungang, LI Xuedan, YAO Jing, QIAO Hong. Parallel implementation of OpenVX and 3D rendering on polymorphic graphics processing unit [J]. Journal of Computer Applications, 2015, 35(1): 53-57.
[11]	MA Jin XIE Jiang DAI Dongbo TAN Jun ZHANG Wu. Parallelism of adaptive Hungary greedy algorithm for biomolecular networks alignment [J]. Journal of Computer Applications, 2013, 33(12): 3321-3325.
[12]	LU Dongdong HE Jun YANG Jianxin WANG Biao. Memory dependence prediction method based on instruction distance [J]. Journal of Computer Applications, 2013, 33(07): 1903-1907.
[13]	WANG Shasha GAO Fei WEN Yingxin YU Jing. Design of digital watermark extraction system based on FPGA [J]. Journal of Computer Applications, 2013, 33(03): 756-758.
[14]	ZHENG Hong-ying LI Wen-jie XIAO Di. Novel image blocking Encryption algorithm based on spatiotemporal chaos system [J]. Journal of Computer Applications, 2011, 31(11): 3053-3055.
[15]	Peng-fei QIU Yi HONG Rui GENG Yun XU. Operation partitioning for heterogeneous VLIW DSP based on dataflow graph [J]. Journal of Computer Applications, 2011, 31(04): 935-937.

Method for exploiting function level vectorization on simple instruction multiple data extensions

发掘函数级单指令多数据向量化的方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics