Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (9): 2913-2918.DOI: 10.11772/j.issn.1001-9081.2024091299

• Advanced computing • Previous Articles    

PCIe bus transmission bandwidth optimization in embedded heterogeneous intelligent computing system

Xubang YU1, Jiwen WU2(), Hong XIA1, Hao MO1, Erhu ZHAO2   

  1. 1.School of Control and Computer Engineering,North China Electric Power University,Beijing 102206,China
    2.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2024-09-12 Revised:2024-12-23 Accepted:2025-01-07 Online:2025-03-19 Published:2025-09-10
  • Contact: Jiwen WU
  • About author:YU Xubang, born in 1998, M. S. His research interests include computer system architecture, embedded system.
    XIA Hong, born in 1965, Ph. D., associate professor. His research interests include computer system architecture, embedded system.
    MO Hao, born in 2000, M. S. candidate. His research interests include embedded system, deep learning.
    ZHAO Erhu, born in 1985, Ph. D., senior engineer. His research interests include embedded intelligent computing system.
  • Supported by:
    Fundamental Research Funds for the Central Universities(2023JC007);Open Fund of Hangzhou Innovation Institute of Beihang University Qianjiang Laboratory(2020-Y8-A-023);Technical Support Talent Program of Chinese Academy of Sciences

嵌入式异构智能计算系统的PCIe总线传输带宽优化

喻绪邦1, 吴济文2(), 夏宏1, 莫昊1, 赵二虎2   

  1. 1.华北电力大学 控制与计算机工程学院,北京 102206
    2.中国科学院 计算技术研究所,北京 100190
  • 通讯作者: 吴济文
  • 作者简介:喻绪邦(1998—),男,江西南昌人,硕士,主要研究方向:计算机系统架构、嵌入式系统
    夏宏(1965—),男,北京人,副教授,博士,CCF会员,主要研究方向:计算机系统架构、嵌入式系统
    莫昊(2000—),男,湖南长沙人,硕士研究生,主要研究方向:嵌入式系统、深度学习
    赵二虎(1985—),男,河北邢台人,高级工程师,博士,CCF会员,主要研究方向:嵌入式智能计算系统。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(2023JC007);中国科学院技术支撑人才项目;北航杭州创新研究院钱江实验室开放基金资助项目(2020-Y8-A-023)

Abstract:

In recent years, with development of Artificial Intelligence (AI) technology, deep learning algorithms and specialized AI processor chips are applied to edge and device data signal processing systems more and more widely. A key technical challenge is how to achieve high-bandwidth and low-latency data transmission between heterogeneous processors while enabling the system with high-level intelligent computing capabilities. Therefore, an embedded heterogeneous intelligent computing system was designed, which integrated Cambricon MLU220 chip, domestic Feiteng FT2000/4 CPU, and Xilinx XC7K325T Field Programmable Gate Array (FPGA). High-speed interconnection and data transmission between the system’s heterogeneous processors were realized through the PCIe (Peripheral Component Interconnect express) bus. In addition, a PCIe bus Scatter-Gather DMA (Direct Memory Access) transmission optimization technique under Linux was proposed, which improved the PCIe bus data transmission bandwidth between the CPU and FPGA heterogeneous processors effectively through a prefetch technique based on double buffering and an interrupt handling based on work queues. Test results of system’s image transmission show that when 10 grayscale images of 2 048×1 024 size are transferred between the CPU and FPGA heterogeneous processors via a PCIe2.0 X4 bus, the proposed system achieves read/write speeds of 1 610 MB/s and 1 655 MB/s in dual-channel DMA mode, respectively, achieving 81% and 83% of the theoretical PCIe2.0 X4 bus bandwidth, respectively. These results verify practicality and advancement of the designed system.

Key words: Peripheral Component Interconnect express (PCIe) bus, heterogeneous computing system, Scatter-Gather DMA (Direct Memory Access), DMA multi-channel, image transmission

摘要:

近年来随着人工智能(AI)技术的发展,深度学习算法和专用AI处理器芯片已日益广泛地应用于边缘端和物端数据信号处理系统,如何在赋能系统高智能计算能力的同时,实现异构处理器之间高带宽、低延时的数据传输,已成为亟须解决的核心技术之一。因此,设计一种集成寒武纪MLU220芯片、国产飞腾FT2000/4型CPU和Xilinx XC7K325T现场可编程门阵列(FPGA)的嵌入式异构智能计算系统,系统异构处理器之间采用PCIe(Peripheral Component Interconnect express)总线实现高速互联与数据传输。此外,提出一种Linux下的PCIe总线Scatter-Gather DMA(Direct Memory Access)传输优化技术,通过基于双缓冲的预取技术和基于工作队列的中断处理,有效提高CPU与FPGA异构处理器之间的PCIe总线数据传输带宽。系统图像传输测试结果表明,CPU与FPGA异构处理器之间的PCIe2.0 X4总线传输10张2 048×1 024灰度图像数据时,所提系统在DMA双通道上的读写速率分别达到了1 610 MB/s和1 655 MB/s,为PCIe2.0 X4总线理论带宽值的81%和83%,验证了所设计系统的实用性和先进性。

关键词: PCIe总线, 异构计算系统, Scatter-Gather DMA, DMA多通道, 图像传输

CLC Number: