《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (8): 2409-2420.DOI: 10.11772/j.issn.1001-9081.2024081140

• 2024年全国开放式分布与并行计算学术年会 (DPCS 2024) • 上一篇    

基于分布式环境的图神经网络模型训练效率与训练性能评估

涂银川, 郭勇, 毛恒, 任怡(), 张建锋, 李宝   

  1. 国防科技大学 计算机学院,长沙 410073
  • 收稿日期:2024-08-14 修回日期:2024-09-14 接受日期:2024-09-23 发布日期:2024-09-25 出版日期:2025-08-10
  • 通讯作者: 任怡
  • 作者简介:涂银川(1996—),男,湖北武汉人,硕士研究生,主要研究方向:图神经网络、云计算、分布式机器学习
    郭勇(1988—),男,湖北荆州人,助理研究员,博士,主要研究方向:分布式计算、图计算
    毛恒(1996—),男,安徽宿州人,硕士研究生,主要研究方向:图神经网络、分布式机器学习、边缘智能
    张建锋(1984—),男,陕西宝鸡人,副研究员,博士,CCF会员,主要研究方向:操作系统、云计算、网络安全、智能处理
    李宝(1982—),男,山东济南人,副研究员,博士,CCF会员,主要研究方向:云边协同计算、操作系统、智能处理。
  • 基金资助:
    湖南自然科学基金青年项目(2021JJ40678)

Evaluation of training efficiency and training performance of graph neural network models based on distributed environment

Yinchuan TU, Yong GUO, Heng MAO, Yi REN(), Jianfeng ZHANG, Bao LI   

  1. College of Computer Science and Technology,National University of Defense Technology,Changsha Hunan 410073,China
  • Received:2024-08-14 Revised:2024-09-14 Accepted:2024-09-23 Online:2024-09-25 Published:2025-08-10
  • Contact: Yi REN
  • About author:TU Yinchuan, born in 1996, M. S. candidate. His research interests include graph neural network, cloud computing, distributed machine learning.
    GUO Yong, born in 1988, Ph. D., assistant research fellow. His research interests include distributed computing, graph computing.
    MAO Heng, born in 1996, M. S. candidate. His research interests include graph neural network, distributed machine learning, edge intelligence.
    ZHANG Jianfeng, born in 1984, Ph. D., associate research fellow. His research interests include operating system, cloud computing, network security, intelligent processing.
    LI Bao, born in 1982, Ph. D., associate research fellow. His research interests include cloud-edge collaborative computing, operating system, intelligent processing.
  • Supported by:
    Youth Program of Natural Science Foundation of Hunan(2021JJ40678)

摘要:

随着图数据规模的快速增长,图神经网络(GNN)在处理大规模图结构数据时面临计算和存储方面的挑战。传统的单机训练方法已不足以应对日益庞大的数据集和复杂的GNN模型,分布式训练凭借并行计算能力和可扩展性,成为解决这些问题的有效途径。然而,一方面,已有的分布式GNN训练评估主要关注以模型精度为代表的性能指标和以训练时间为代表的效率指标,而较少关注数据处理效率和计算资源利用方面的指标;另一方面,算法效率评估的主要场景为单机单卡或单机多卡,而已有的评估方法在分布式环境中的应用相对简单。针对这些不足,提出针对分布式场景的模型训练的评估方法,涵盖评估指标、数据集和模型这3个方面。根据评估方法,选取3个代表性GNN模型,在4个具有不同数据特征的大型公开图数据集上进行分布式训练实验,并收集和分析得到的评估指标。实验结果表明,分布式训练中的模型架构和数据结构特征对模型复杂度、训练时间、计算节点吞吐量和计算节点平均吞吐量之比(NATR)均有影响;样本处理与数据拷贝占用了模型训练较多时间,计算节点互相等待的时间也不容忽视;相较于单机训练,分布式训练的计算节点吞吐量有显著降低,且需要进一步优化分布式系统中的资源利用。可见,所提评估方法为GNN模型在分布式环境中的训练性能优化提供了参考依据,并为模型的进一步优化和算法的改进奠定了实验基础。

关键词: 模型评估, 图神经网络, 分布式训练, 训练效率, 训练性能

Abstract:

With the rapid growth of graph data sizes, Graph Neural Network (GNN) faces computational and storage challenges in processing large-scale graph-structured data. Traditional stand-alone training methods are no longer sufficient to cope with increasingly large datasets and complex GNN models. Distributed training is an effective way to address these problems due to its parallel computing power and scalability. However, on one hand, the existing distributed GNN training evaluations mainly focus on the performance metrics represented by model accuracy and the efficiency metrics represented by training time, but pay less attention to the metrics of data processing efficiency and computational resource utilization; on the other hand, the main scenarios for algorithm efficiency evaluation are single machine with one card or single machine with multiple cards, and the existing evaluation methods are relatively simple in a distributed environment. To address these shortcomings, an evaluation method for model training in distributed scenarios was proposed, which includes three aspects: evaluation metrics, datasets, and models. Three representative GNN models were selected according to the evaluation method, and distributed training experiments were conducted on four large open graph datasets with different data characteristics to collect and analyze the obtained evaluation metrics. Experimental results show that all of model complexity, training time, computing node throughput and computing Node Average Throughput Ratio (NATR) are influenced by model architecture and data structure characteristics in distributed training; sample processing and data copying take up much time in training, and the time of one computing node waiting for other computing nodes cannot be ignored either; compared with stand-alone training, distributed training reduces the computing node throughput significantly, and further optimization of resource utilization for distributed systems is needed. It can be seen that the proposed evaluation method provides a reference for optimizing the performance of GNN model training in a distributed environment, and establishes an experimental foundation for further model optimization and algorithm improvement.

Key words: model evaluation, Graph Neural Network (GNN), distributed training, training efficiency, training performance

中图分类号: