Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Evaluation of training efficiency and training performance of graph neural network models based on distributed environment
Yinchuan TU, Yong GUO, Heng MAO, Yi REN, Jianfeng ZHANG, Bao LI
Journal of Computer Applications    2025, 45 (8): 2409-2420.   DOI: 10.11772/j.issn.1001-9081.2024081140
Abstract49)   HTML2)    PDF (1623KB)(22)       Save

With the rapid growth of graph data sizes, Graph Neural Network (GNN) faces computational and storage challenges in processing large-scale graph-structured data. Traditional stand-alone training methods are no longer sufficient to cope with increasingly large datasets and complex GNN models. Distributed training is an effective way to address these problems due to its parallel computing power and scalability. However, on one hand, the existing distributed GNN training evaluations mainly focus on the performance metrics represented by model accuracy and the efficiency metrics represented by training time, but pay less attention to the metrics of data processing efficiency and computational resource utilization; on the other hand, the main scenarios for algorithm efficiency evaluation are single machine with one card or single machine with multiple cards, and the existing evaluation methods are relatively simple in a distributed environment. To address these shortcomings, an evaluation method for model training in distributed scenarios was proposed, which includes three aspects: evaluation metrics, datasets, and models. Three representative GNN models were selected according to the evaluation method, and distributed training experiments were conducted on four large open graph datasets with different data characteristics to collect and analyze the obtained evaluation metrics. Experimental results show that all of model complexity, training time, computing node throughput and computing Node Average Throughput Ratio (NATR) are influenced by model architecture and data structure characteristics in distributed training; sample processing and data copying take up much time in training, and the time of one computing node waiting for other computing nodes cannot be ignored either; compared with stand-alone training, distributed training reduces the computing node throughput significantly, and further optimization of resource utilization for distributed systems is needed. It can be seen that the proposed evaluation method provides a reference for optimizing the performance of GNN model training in a distributed environment, and establishes an experimental foundation for further model optimization and algorithm improvement.

Table and Figures | Reference | Related Articles | Metrics