面向DCU非一致控制流的编译优化

doi:10.11772/j.issn.1001-9081.2022091338

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3170-3177.DOI: 10.11772/j.issn.1001-9081.2022091338

所属专题：先进计算

面向DCU非一致控制流的编译优化

杨小艺¹^,², 赵荣彩¹^,², 王洪生²(), 韩林¹^,², 徐坤坤¹^,²

^1.郑州大学计算机与人工智能学院，郑州 450001
^2.国家超级计算郑州中心，郑州 450001

收稿日期:2022-09-12 修回日期:2022-12-21 接受日期:2022-12-25 发布日期:2023-02-28 出版日期:2023-10-10
通讯作者: 王洪生
作者简介:杨小艺（1997—），女，河南南阳人，硕士研究生，主要研究方向：先进编译技术
赵荣彩（1957—），男，河南洛阳人，教授，博士，CCF会员，主要研究方向：先进编译技术、高性能计算
王洪生（1994—），山东滨州人，硕士，主要研究方向：先进编译技术、高性能计算 whs1814@foxmail.com
韩林（1978—），男，山东临沂人，副教授，博士，CCF会员，主要研究方向：先进编译技术、高性能计算
徐坤坤（1991—），男，河南信阳人，硕士研究生，主要研究方向：先进编译技术。
基金资助:
河南省重大科技专项(221100210600)

Compilation optimizations for inconsistent control flow on deep computer unit

Xiaoyi YANG¹^,², Rongcai ZHAO¹^,², Hongsheng WANG²(), Lin HAN¹^,², Kunkun XU¹^,²

^1.School of Computer and Artificial Intelligence，Zhengzhou University，Zhengzhou Henan 450001，China
^2.National Supercomputing Center in Zhengzhou，Zhengzhou Henan 450001，China

Received:2022-09-12 Revised:2022-12-21 Accepted:2022-12-25 Online:2023-02-28 Published:2023-10-10
Contact: Hongsheng WANG
About author:YANG Xiaoyi， born in 1997， M. S. candidate. Her research interests include advanced compilation technology.
ZHAO Rongcai， born in 1957， Ph. D.， professor. His research interests include advanced compilation technology， high-performance computing.
HAN Lin， born in 1978， Ph. D.， associate professor. His research interests include advanced compilation technology， high-performance computing.
XU Kunkun， born in 1991， M. S. candidate. His research interests include advanced compilation technology.
Supported by:
Major Science and Technology Special Project of Henan Province(221100210600)

摘要/Abstract

摘要：

国产DCU采用单指令多线程（SIMT）的并行执行模型，在程序执行时核函数内会产生非一致控制流，导致线程束中的线程部分只能串行执行，即线程束分化。针对核函数的性能因线程束分化受到严重制约的问题，提出一种减少线程束分化时间的编译优化方法——部分控制流合并（PCFM）。首先，通过散度分析找到同构且含有大量相同指令和相似指令的可融合发散区域；其次，统计合并后节省的指令周期百分比，从而评估可融合发散区域的融合盈利；最后，查找对齐序列，并合并有收益的可融合发散区域。在DCU上使用PCFM测试从图形处理器（GPU）基准测试套件Rodinia和经典的排序算法中选择的测试用例，实验结果表明，PCFM对测试用例能够取得1.146的平均加速比，与分支融合+尾合并方法相比，使用PCFM的加速比平均提高了5.72%。可见，所提方法减少线程束分化的效果更好。

关键词: DCU, 单指令多线程, 线程束分化, 复杂控制流, 编译优化

Abstract:

The domestic DCU （Deep Computer Unit） adopts the parallel execution model of Single Instruction Multiple Thread （SIMT）. When the programs are executed， inconsistent control flow is generated in the kernel function， which causes the threads in the warp be executed serially. And that is warp divergence. Aiming at the problem that the performance of the kernel function is severely restricted by warp divergence， a compilation optimization method to reduce the warp divergence time — Partial-Control-Flow-Merging （PCFM） was proposed. Firstly， divergence analysis was performed to find the fusible divergent regions that are isomorphic and contained a large number of same instructions and similar instructions. Then， the fusion profit of the fusible divergent regions was evaluated by counting the percentage of instruction cycles saved after merging. Finally， the alignment sequence was searched， the profitable fusible divergent regions were merged. Some test cases from Graphics Processing Unit （GPU） benchmark suite Rodinia and the classic sorting algorithm were selected to test PCFM on DCU. Experimental results show that PCFM can achieve an average speedup ratio of 1.146 for the test cases. And the speedup of PCFM is increased by 5.72% compared to that of the branch fusion + tail merging method. It can be seen that the proposed method has a better effect on reducing warp divergence.

Key words: Deep Computer Unit (DCU), Single Instruction Multiple Thread (SIMT), warp divergence, complex control flow, compilation optimization

中图分类号:

TP314

杨小艺, 赵荣彩, 王洪生, 韩林, 徐坤坤. 面向DCU非一致控制流的编译优化[J]. 计算机应用, 2023, 43(10): 3170-3177.

Xiaoyi YANG, Rongcai ZHAO, Hongsheng WANG, Lin HAN, Kunkun XU. Compilation optimizations for inconsistent control flow on deep computer unit[J]. Journal of Computer Applications, 2023, 43(10): 3170-3177.

图/表 16

图1 非一致控制流执行示例

Fig. 1 Execution example of inconsistent control flow

图2 分支融合示例

Fig. 2 Example of branch fusion

图3 尾合并示例

Fig. 3 Example of tail merging

图4 DCU的线程层次结构

Fig. 4 Thread hierarchical structure of DCU

图5 传统编译器架构与LLVM架构

Fig. 5 Traditional compiler architecture and LLVM architecture

图6 复杂控制流示例

Fig. 6 Complex control flow examples

图7 合并示例

Fig. 7 Merging examples

图8 散度分析流程

Fig. 8 Flow of divergence analysis

图9 指令对齐算法

Fig. 9 Instruction alignment algorithm

图10 复杂控制流示例1对齐指令的合并

Fig. 10 Merging aligned instructions of complex control flow example 1

图11 复杂控制流示例1非对齐指令的合并

Fig. 11 Merging unaligned instructions of complex control flow example 1

图12 退出块的合并

Fig. 12 Exit block merging

图13 PCFM的核心算法的应用

Fig. 13 Application of core algorithm of PCFM

表1 不同算法的加速比性能对比

Tab. 1 Performance comparison of speedup ratio of different algorithms

测试用例	块大小	加速比
测试用例	块大小	分支融合+尾合并	PCFM
BIT	32	1.005	1.100
	64	0.990	1.360
	128	1.010	1.270
	256	1.003	1.210
SRAD	16	1.050	1.170
SRAD	32	1.040	1.090
MS	32	1.250	1.260
	64	1.250	1.260
	128	1.250	1.260
	256	1.240	1.250
NQU	64	0.980	1.010
	128	1.020	1.070
	256	1.000	1.020
PCM	32	1.290	1.290
	64	1.210	1.210
	128	1.220	1.220
	256	1.008	1.008
LUD	16	0.980	1.100
	32	1.010	1.080
	64	1.110	1.130
DCT	4	1.000	1.000
	8	1.000	1.000
	16	1.010	1.000

表2 不同融合阈值下的加速比

Tab. 2 Speedup ratio under different fusion threshold

测试用例	融合阈值
测试用例	0.1	0.2	0.3	0.4	0.5
BIT	1.37	1.36	1.36	1.34	1.34
SRAD	1.17	1.14	1.10	1.14	1.05
MS	1.27	1.25	1.25	1.26	1.26
NQU	1.07	1.07	1.07	1.04	1.02
PCM	1.33	1.33	1.33	1.33	1.27
LUD	1.10	1.10	1.15	1.10	1.05
DCT	1.00	1.01	1.00	1.00	1.00

表3 不同算法的归一化共享内存指令数

Tab. 3 Instruction numbers of normalized shared memory of different algorithms

测试用例	PCFM	分支融合+ 尾合并	测试用例	PCFM	分支融合+ 尾合并
BIT	0.74	1.00	PCM	0.89	1.00
SRAD	1.00	1.00	LUD	1.00	1.00
MS	1.00	1.00	DCT	0.00	0.00
NQU	1.00	1.00

参考文献 16

1	胡伟方. 面向DCU的多面体编译优化技术研究［D］. 郑州：郑州大学， 2021： 11-19.
	HU W F. Research on DCU-oriented polyhedron compiler optimization technology［D］. Zhengzhou： Zhengzhou University， 2021：11-19.
2	HOLZINGER P， REICHENBACH M， FEY D. A new generic HLS approach for heterogeneous computing： on the feasibility of high-level synthesis in HSA-compatible systems［C］// Proceedings of the 18th International Conference on Embedded Computer Systems： Architectures， Modeling， and Simulation. New York： ACM， 2018： 18-27. 10.1145/3229631.3229634
3	CHEN W K， LI B G， GUPTA R. Code compaction of matching single-entry multiple-exit regions［C］// Proceedings of the 2003 International Static Analysis Symposium， LNCS 2694. Berlin： Springer， 2003： 401-417.
4	COUTINHO B， SAMPAIO D， PEREIRA F M Q， et al. Divergence analysis and optimizations［C］// Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. Piscataway： IEEE， 2011： 320-329. 10.1109/pact.2011.63
5	LATTNER C， ADVE V. LLVM： a compilation framework for lifelong program analysis & transformation［C］// Proceedings of the 2004 International Symposium on Code Generation and Optimization. Piscataway： IEEE， 2004： 75-86. 10.1109/cgo.2004.1281650
6	HAN T D， ABDELRAHMAN T S. Reducing branch divergence in GPU programs［C］// Proceedings of the 4th Workshop on General Purpose Processing on Graphics Processing Units. New York： ACM， 2011： No.3. 10.1145/1964179.1964184
7	KARRENBERG R， HACK S. Improving performance of OpenCL on CPUs［C］// Proceedings of the 2012 International Conference on Compiler Construction， LNCS 7210. Berlin： Springer， 2012： 1-20.
8	ROSEMANN J， MOLL S， HACK S. An abstract interpretation for SPMD divergence on reducible control flow graphs［J］. Proceedings of the ACM on Programming Languages and Systems， 2021， 5（POPL）： No.31. 10.1145/3434312
9	ROCHA R C O， PETOUMENOS P， WANG Z， et al. Function merging by sequence alignment［C］// Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization. Piscataway： IEEE， 2019： 149-163. 10.1109/cgo.2019.8661174
10	SMITH T F， WATERMAN M S. Identification of common molecular subsequences［J］. Journal of Molecular Biology， 1981， 147（1）： 195-197. 10.1016/0022-2836(81)90087-5
11	ROCHA R C O， PETOUMENOS P， WANG Z， et al. Effective function merging in the SSA form［C］// Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. New York： ACM， 2020： 854-868. 10.1145/3385412.3386030
12	CYTRON R， FERRANTE J， ROSEN B K， et al. Efficiently computing static single assignment form and the control dependence graph［J］. ACM Transactions on Programming Languages and Systems， 1991， 13（4）：451-490. 10.1145/115372.115320
13	CHE S， BOYER M， MENG J， et al. Rodinia： a benchmark suite for heterogeneous computing［C］// Proceedings of the 2009 IEEE International Symposium on Workload Characterization. Piscataway： IEEE， 2009： 44-54. 10.1109/iiswc.2009.5306797
14	BATCHER K E. Sorting networks and their applications［C］// Proceedings of the 1968 Spring Joint Computer Conference. New York： ACM， 1968： 307-314. 10.1145/1468075.1468121
15	BAKHODA A， YUAN G L， FUNG W W L， et al. Analyzing CUDA workloads using a detailed GPU simulator［C］// Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software. Piscataway： IEEE， 2009： 163-174. 10.1109/ispass.2009.4919648
16	HERRUZO E， RUÍZ G， BENAVIDES J I， et al. A new parallel sorting algorithm based on odd-even mergesort［C］// Proceedings of the 15th EUROMICRO International Conference on Parallel， Distributed and Network-Based Processing. Piscataway： IEEE， 2007： 18-22. 10.1109/pdp.2007.10

[1]	黄胜兵, 郑启龙, 郭连伟. 分簇VLIW DSP上支持单双字模式选择的SIMD编译优化[J]. 计算机应用, 2015, 35(8): 2371-2374.
[2]	闫国昌何炎祥李清安. 降低寄存器软错误的静态寄存器重分配方法[J]. 计算机应用, 2014, 34(9): 2730-2733.

面向DCU非一致控制流的编译优化

Compilation optimizations for inconsistent control flow on deep computer unit

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 16

相关文章 2

编辑推荐

Metrics