《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3170-3177.DOI: 10.11772/j.issn.1001-9081.2022091338

所属专题: 先进计算

• 先进计算 • 上一篇    下一篇

面向DCU非一致控制流的编译优化

杨小艺1,2, 赵荣彩1,2, 王洪生2(), 韩林1,2, 徐坤坤1,2   

  1. 1.郑州大学 计算机与人工智能学院,郑州 450001
    2.国家超级计算郑州中心,郑州 450001
  • 收稿日期:2022-09-12 修回日期:2022-12-21 接受日期:2022-12-25 发布日期:2023-02-28 出版日期:2023-10-10
  • 通讯作者: 王洪生
  • 作者简介:杨小艺(1997—),女,河南南阳人,硕士研究生,主要研究方向:先进编译技术
    赵荣彩(1957—),男,河南洛阳人,教授,博士,CCF会员,主要研究方向:先进编译技术、高性能计算
    王洪生(1994—),山东滨州人,硕士,主要研究方向:先进编译技术、高性能计算 whs1814@foxmail.com
    韩林(1978—),男,山东临沂人,副教授,博士,CCF会员,主要研究方向:先进编译技术、高性能计算
    徐坤坤(1991—),男,河南信阳人,硕士研究生,主要研究方向:先进编译技术。
  • 基金资助:
    河南省重大科技专项(221100210600)

Compilation optimizations for inconsistent control flow on deep computer unit

Xiaoyi YANG1,2, Rongcai ZHAO1,2, Hongsheng WANG2(), Lin HAN1,2, Kunkun XU1,2   

  1. 1.School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou Henan 450001,China
    2.National Supercomputing Center in Zhengzhou,Zhengzhou Henan 450001,China
  • Received:2022-09-12 Revised:2022-12-21 Accepted:2022-12-25 Online:2023-02-28 Published:2023-10-10
  • Contact: Hongsheng WANG
  • About author:YANG Xiaoyi, born in 1997, M. S. candidate. Her research interests include advanced compilation technology.
    ZHAO Rongcai, born in 1957, Ph. D., professor. His research interests include advanced compilation technology, high-performance computing.
    HAN Lin, born in 1978, Ph. D., associate professor. His research interests include advanced compilation technology, high-performance computing.
    XU Kunkun, born in 1991, M. S. candidate. His research interests include advanced compilation technology.
  • Supported by:
    Major Science and Technology Special Project of Henan Province(221100210600)

摘要:

国产DCU采用单指令多线程(SIMT)的并行执行模型,在程序执行时核函数内会产生非一致控制流,导致线程束中的线程部分只能串行执行,即线程束分化。针对核函数的性能因线程束分化受到严重制约的问题,提出一种减少线程束分化时间的编译优化方法——部分控制流合并(PCFM)。首先,通过散度分析找到同构且含有大量相同指令和相似指令的可融合发散区域;其次,统计合并后节省的指令周期百分比,从而评估可融合发散区域的融合盈利;最后,查找对齐序列,并合并有收益的可融合发散区域。在DCU上使用PCFM测试从图形处理器(GPU)基准测试套件Rodinia和经典的排序算法中选择的测试用例,实验结果表明,PCFM对测试用例能够取得1.146的平均加速比,与分支融合+尾合并方法相比,使用PCFM的加速比平均提高了5.72%。可见,所提方法减少线程束分化的效果更好。

关键词: DCU, 单指令多线程, 线程束分化, 复杂控制流, 编译优化

Abstract:

The domestic DCU (Deep Computer Unit) adopts the parallel execution model of Single Instruction Multiple Thread (SIMT). When the programs are executed, inconsistent control flow is generated in the kernel function, which causes the threads in the warp be executed serially. And that is warp divergence. Aiming at the problem that the performance of the kernel function is severely restricted by warp divergence, a compilation optimization method to reduce the warp divergence time — Partial-Control-Flow-Merging (PCFM) was proposed. Firstly, divergence analysis was performed to find the fusible divergent regions that are isomorphic and contained a large number of same instructions and similar instructions. Then, the fusion profit of the fusible divergent regions was evaluated by counting the percentage of instruction cycles saved after merging. Finally, the alignment sequence was searched, the profitable fusible divergent regions were merged. Some test cases from Graphics Processing Unit (GPU) benchmark suite Rodinia and the classic sorting algorithm were selected to test PCFM on DCU. Experimental results show that PCFM can achieve an average speedup ratio of 1.146 for the test cases. And the speedup of PCFM is increased by 5.72% compared to that of the branch fusion + tail merging method. It can be seen that the proposed method has a better effect on reducing warp divergence.

Key words: Deep Computer Unit (DCU), Single Instruction Multiple Thread (SIMT), warp divergence, complex control flow, compilation optimization

中图分类号: