Design and implementation of component-based development framework for deep learning applications

doi:10.11772/j.issn.1001-9081.2023020213

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (2): 526-535.DOI: 10.11772/j.issn.1001-9081.2023020213

• Computer software technology • Previous Articles

Design and implementation of component-based development framework for deep learning applications

Xiang LIU, Bei HUA(), Fei LIN, Hongyuan WEI

College of Computer Science and Technology，University of Science and Technology of China，Hefei Anhui 230027，China

Received:2023-03-03 Revised:2023-04-19 Accepted:2023-05-05 Online:2023-08-14 Published:2024-02-10
Contact: Bei HUA
About author:LIU Xiang， born in 1999， M. S. candidate. His research interests include high-performance computing system.
LIN Fei， born in 1994， M. S. candidate. His research interests include high-performance computing system.
WEI Hongyuan， born in 1996， M. S. candidate. Her research interests include high-performance computing system.
Supported by:
National Key Research & Development Program of China(2018AAA0101204)

面向深度学习应用的组件式开发框架的设计实现

刘祥, 华蓓(), 林飞, 魏宏原

中国科学技术大学计算机科学与技术学院，合肥 230027

通讯作者: 华蓓
作者简介:刘祥（1999—），男，安徽宿州人，硕士研究生，主要研究方向：高性能计算系统
林飞（1994—），男，安徽合肥人，硕士研究生，主要研究方向：高性能计算系统
魏宏原（1996—），女，河南新乡人，硕士研究生，主要研究方向：高性能计算系统。
基金资助:
国家重点研发计划项目(2018AAA0101204)

Abstract

Abstract:

Concerning the current lack of effective development and deployment tools for deep learning applications， a component-based development framework for deep learning applications was proposed. The framework splits functions according to the type of resource consumption， uses a review-guided resource allocation scheme for bottleneck elimination， and uses a step-by-step boxing scheme for function placement that takes into account high CPU utilization and low memory overhead. The real-time license plate number detection application developed based on this framework achieved 82% GPU utilization in throughput-first mode， 0.73 s average application latency in latency-first mode， and 68.8% average CPU utilization in three modes （throughput-first mode， latency-first mode， and balanced throughput/latency mode）. The experimental results show that based on this framework， a balanced configuration of hardware throughput and application latency can be performed to efficiently utilize the computing resources of the platform in the throughput-first mode and meet the low latency requirements of the applications in the latency-first mode. Compared with MediaPipe， the use of this framework enabled ultra-real-time multi-person pose estimation application development， and the detection frame rate of the application was improved by up to 1 077%. The experimental results show that the framework is an effective solution for deep learning application development and deployment on CPU-GPU heterogeneous servers.

Key words: deep learning application, development framework, Component-Based Development (CBD), pipeline deployment, CPU-GPU heterogeneity

摘要：

针对目前深度学习应用缺少有效的开发与部署工具的问题，提出一个面向深度学习应用的组件式开发框架。所提框架根据应用的资源消耗类型进行功能拆分，使用评测引导的资源分配方案进行瓶颈消除，使用分步装箱方案兼顾高CPU利用率和低显存开销的功能放置。基于此框架开发的实时车牌号检测应用，在吞吐优先模式下GPU利用率达到82%，在延迟优先模式下平均应用延迟达到0.73 s，在三种模式下（吞吐优先模式、延迟优先模式以及吞吐/延迟的均衡模式）下，CPU平均利用率达到68.8%。实验结果表明，基于此框架能够进行硬件吞吐与应用延迟的平衡型配置，在吞吐优先模式下高效利用平台的计算资源，在延迟优先模式下满足应用的低延迟需求。相较于MediaPipe，使用本框架能够进行超实时的多人姿态估计应用开发，应用的检测帧率最高提升了1 077%。实验结果表明，所提框架能够作为CPU-GPU异构服务器上面向深度学习应用开发部署的有效解决方案。

关键词: 深度学习应用, 开发框架, 基于组件的开发, 流水线部署, CPU-GPU异构

CLC Number:

TP311.5

Xiang LIU, Bei HUA, Fei LIN, Hongyuan WEI. Design and implementation of component-based development framework for deep learning applications[J]. Journal of Computer Applications, 2024, 44(2): 526-535.

刘祥, 华蓓, 林飞, 魏宏原. 面向深度学习应用的组件式开发框架的设计实现[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 526-535.

Figures/Tables 18

Fig.1 Hierarchical organization of component library

Fig.2 Component splitting of YOLO model inference process

Fig.3 Schematic diagram of pipeline bottleneck

Fig.4 Comparison of inference speedup ratios under different batch_sizes

Fig.5 Three-layer architecture of proposed framework

Fig.6 Schematic diagram of application development API

Fig.7 Rendering of license plate number recognition

Fig.8 DAG of license plate number recognition application

Fig.9 Key code for inference version on CPU

Fig.10 Key code for inference version on GPU

Fig.11 Application development using MarkDown

Fig.12 Key code for vehicle counting application

Fig.13 Key code for pure manual deployment application

Fig.14 Time consumption statistics of each stage

Fig.15 DAG of license plate number recognition application with four-way video input

Tab.1 Throughput and latency under different scale

scale

路数

CPU

核数

GPU

数

CPU

利用率/%

GPU

利用率/%

平均

延迟/s

延迟

标准差/s

Fig.16 Effect of multi-person pose estimation

Tab.2 Comparison with MediaPipe on development cost and operating efficiency

应用名称	框架	模式	代码行数	帧率/fps
单人姿态估计	MediaPipe	—	62	30.50
	本框架	延迟优先	76	50.00
	本框架	吞吐优先	76	400.00
多人姿态估计	MediaPipe	—	67	0.92
	本框架	延迟优先	86	30.00
	本框架	吞吐优先	86	100.00

References 19

1	郑远攀，李广阳，李晔.深度学习在图像识别中的应用研究综述［J］.计算机工程与应用，2019，55（12）：20-36. 10.3778/j.issn.1002-8331.1903-0031
	ZHENG Y P， LI G Y， LI Y. Survey of application of deep learning in image recognition［J］. Computer Engineering and Applications， 2019， 55（12）： 20-36. 10.3778/j.issn.1002-8331.1903-0031
2	张慧，王坤峰，王飞跃.深度学习在目标视觉检测中的应用进展与展望［J］.自动化学报，2017，43（8）：1289-1305. 10.16383/j.aas.2017.c160822
	ZHANG H， WANG K F， WANG F Y. Advances and perspectives of deep learning in visual object detection［J］. Acta Automatica Sinica， 2017， 43（8）： 1289-1305. 10.16383/j.aas.2017.c160822
3	管皓，薛向阳，安志勇.深度学习在视频目标跟踪中的应用进展与展望［J］.自动化学报，2016，42（6）：834-847. 10.16383/j.aas.2016.c150705
	GUAN H， XUE X Y， AN Z Y. Advances on application of deep learning for video object tracking［J］. Acta Automatica Sinica， 2016， 42（6）： 834-847. 10.16383/j.aas.2016.c150705
4	张政馗，庞为光，谢文静，等.面向实时应用的深度学习研究综述［J］.软件学报，2020，31（9）：2654-2677.
	ZHANG Z K， PANG W G， XIE W J， et al. Deep learning for real-time applications： a survey［J］. Journal of Software， 2020， 31（9）： 2654-2677.
5	LUGARESI C， TANG J， NASH H， et al. MediaPipe： a framework for building perception pipelines［EB/OL］. （2019-06-14）［2023-04-19］. .
6	GORBACHEV Y， FEDOROV M， SLAVUTIN I， et al. OpenVINO deep learning workbench： Comprehensive analysis and tuning of neural networks inference［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 783-787. 10.1109/iccvw.2019.00104
7	丁光耀，陈启航，徐辰，等.大数据处理系统中面向GPU加速DNN推理的模型共享［J］.清华大学学报（自然科学版）， 2022，62（9）： 1435-1441.
	DING G Y， CHEN Q H， XU C， et al. Model sharing for GPU-accelerated DNN inference in big data processing systems［J］. Journal of Tsinghua University （Science and Technology）， 2022， 62（9）： 1435-1441.
8	李欢，黄英，张付军，等.基于软件组件库的柴油机ECU软件设计与实现［J］.汽车工程，2016， 38（12）： 1420-1426. 10.3969/j.issn.1000-680X.2016.12.003
	LI H， HUANG Y， ZHANG F J， et al. Design and implementation of diesel engine ECU software based on software component library［J］. Automotive Engineering， 2016， 38（12）： 1420-1426. 10.3969/j.issn.1000-680X.2016.12.003
9	WENDE F， STEINKE T， CORDES F. Multi-threaded kernel offloading to GPGPU using Hyper-Q on Kepler architecture ［EB/OL］. （2014-06-12）［2023-04-19］. . 10.1109/saahpc.2012.12
10	CRANKSHAW D， WANG X， ZHOU G， et al. Clipper： a low-latency online prediction serving system［C］// Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation. Berkeley： USENIX Association， 2017： 613-627.
11	OLSTON C， FIEDEL N， GOROVOY K， et al. TensorFlow-Serving： flexible， high-performance ML serving［EB/OL］. （2017-12-17）［2023-04-19］. .
12	XIAO W， BHARDWAJ R， RAMJEE R， et al. Gandiva： introspective cluster scheduling for deep learning［C］// Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. Berkeley： USENIX Association， 2018： 595-610.
13	SHEN H， CHEN L， JIN Y， et al. Nexus： a GPU cluster engine for accelerating DNN-based video analysis［C］// Proceedings of the 27th ACM Symposium on Operating Systems Principles. New York： ACM， 2019： 322-337. 10.1145/3341301.3359658
14	XIAO W， REN S， LI Y， et al. AntMan： dynamic scaling on GPU clusters for deep learning［C］// Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. Berkeley： USENIX Association， 2020： 533-548.
15	BAI Z， ZHANG Z， ZHU Y， et al. PipeSwitch： fast pipelined context switching for deep learning applications［C］// Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. Berkeley： USENIX Association， 2020： 499-514.
16	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
17	FRANK A， AAMRI Y S K AL， ZAYEGH A. IoT based smart traffic density control using image processing［C］// Proceedings of the 2019 4th MEC International Conference on Big Data and Smart City. Piscataway： IEEE， 2019： 1-4. 10.1109/icbdsc.2019.8645568
18	CHEN Q， WANG W， WU F， et al. A survey on an emerging area： deep learning for smart city data［J］. IEEE Transactions on Emerging Topics in Computational Intelligence， 2019， 3（5）： 392-410. 10.1109/tetci.2019.2907718
19	BAZAREVSKY V， GRISHCHENKO I， RAVEENDRAN K， et al. BlazePose： on-device real-time body pose tracking［EB/OL］. （2020-06-17）［2023-04-19］. .

[1]	ZHANG Yongjun XU Xinhai. Fault-tolerance period optimization method for computational fluid dynamics-oriented application development frameworks [J]. Journal of Computer Applications, 2014, 34(2): 382-386.
[2]	ZHOU Yao-yu, LIU Qiang, ZHAO Ming-yang. Design and implementation of development framework for human-oriented workflow management system [J]. Journal of Computer Applications, 2005, 25(07): 1670-1673.

Design and implementation of component-based development framework for deep learning applications

面向深度学习应用的组件式开发框架的设计实现

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 18

References 19

Related Articles 2

Recommended Articles

Metrics