Lightweight human pose estimation based on merge state space model

doi:10.11772/j.issn.1001-9081.2024091351

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (10): 3179-3186.DOI: 10.11772/j.issn.1001-9081.2024091351

• Artificial intelligence • Previous Articles

Lightweight human pose estimation based on merge state space model

Zhuoran LI¹, Hua LI¹(), Tong WANG², Chaozhe JIANG²

^1.School of Information Science and Technology，Southwest Jiaotong University，Chengdu Sichuan 611756，China
^2.School of Transportation and Logistics，Southwest Jiaotong University，Chengdu Sichuan 611756，China

Received:2024-09-23 Revised:2024-11-27 Accepted:2024-12-02 Online:2024-12-20 Published:2025-10-10
Contact: Hua LI
About author:LI Zhuoran， born in 2000， M. S. candidate. His research interests include computer vision， human pose estimation.
LI Hua， born in 1979， Ph. D.， lecturer. His research interests include computer vision， deep learning.
WANG Tong， born in 1995， M. S. candidate. His research interests include machine learning， image processing.
JIANG Chaozhe， born in 1968， Ph. D.， professor. His research interests include big data， artificial intelligence， advanced manufacturing.
Supported by:
Jade Kirin Science and Innovation Fund(2019H010362)

基于融合特征状态空间模型的轻量化人体姿态估计

李卓然¹, 李华¹(), 王桐², 蒋朝哲²

^1.西南交通大学信息科学与技术学院，成都 611756
^2.西南交通大学交通运输与物流学院，成都 611756

通讯作者: 李华
作者简介:李卓然（2000—），男，四川达州人，硕士研究生，CCF会员，主要研究方向：计算机视觉、人体姿态估计
李华（1979—），男，四川成都人，讲师，博士，主要研究方向：计算机视觉、深度学习 Email:hli8@swjtu.edu.cn
王桐（1995—），男，河北石家庄人，硕士研究生，主要研究方向：机器学习、图像处理
蒋朝哲（1968—），男，四川达州人，教授，博士，主要研究方向：大数据、人工智能、先进制造。
基金资助:
玉麒麟科创基金资助项目(2019H010362)

Abstract

Abstract:

In the field of Human Pose Estimation （HPE）， heatmap-based methods suffer from the problems of big quantization error， high computational complexity， and the need to post-process the heatmap. To address the above issues， with SimCC method of coordinate regression as a baseline， a lightweight HPE model based on Merge State Space Model （MSSM） was proposed， namely Lite-SimCC. Firstly， ShuffleNet V2 was adopted as the backbone network to replace the original HRNet （High-Resolution Net）， which simplified to a structure of single-branch form and realized lightweight model. Secondly， to reduce the loss of precision， a large kernel convolution was introduced to extract global feature information. Thirdly， an MSSM was further designed to handle both local and full long sequence features， so as to enhance representational ability of the key points. Finally， a soft-label based loss function was proposed to replace the traditional one-hot loss calculation method. Experimental results show that compared with the baseline method SimCC， Lite-SimCC has the parameters decreased by 87.1%， and the Average Precision （AP） improved by 1.4% on COCO2017 test set， and it is proved on MPII dataset that Lite-SimCC reduces parameters of the model effectively while guaranteeing detection precision.

Key words: Human Pose Estimation (HPE), coordinate regression, State Space Model (SSM), lightweight, soft-label

摘要：

在人体姿态估计（HPE）领域中，基于热图的方法存在量化误差大、计算复杂度高和需要对热图进行后处理等问题。针对上述问题，以坐标回归的SimCC方法为基线，提出一种基于融合特征的状态空间模型（MSSM）的轻量化HPE方法Lite-SimCC。首先，采用ShuffleNet V2作为骨干网络，替代原有的HRNet（High-Resolution Net），简化为单分支形式结构，并实现模型的轻量化；其次，为了降低精确率的损失，引入大核卷积提取全局特征信息；然后，设计MSSM，用于处理局部和全局长序列特征，增强关键点的表征能力；最后，提出一种基于软标签的损失函数，替代传统的one-hot损失计算方式。实验结果表明，与基线方法SimCC相比，Lite-SimCC的参数量少了87.1%，在COCO2017测试集上的平均精确率（AP）提升了1.4%，在MPII数据集上验证了Lite-SimCC在保证检测精确率的基础上有效降低了模型的参数量。

关键词: 人体姿态估计, 坐标回归, 状态空间模型, 轻量化, 软标签

CLC Number:

TP389.1

Zhuoran LI, Hua LI, Tong WANG, Chaozhe JIANG. Lightweight human pose estimation based on merge state space model[J]. Journal of Computer Applications, 2025, 45(10): 3179-3186.

李卓然, 李华, 王桐, 蒋朝哲. 基于融合特征状态空间模型的轻量化人体姿态估计[J]. 《计算机应用》唯一官方网站, 2025, 45(10): 3179-3186.

Figures/Tables 11

Fig. 1 Structure of Lite-SimCC

Fig. 2 Structures of four kinds of branched HRNet

Fig. 3 Precision of four HRNet structures

Fig. 4 Components of ShuffleNet and ShuffleNet V2

Fig. 5 Precisions of four large kernel sizes

Fig. 6 Structure of MergeMamba module

Tab. 1 Experimental results of different methods on COCO2017 dataset

方法	参数量/ 10⁶	AP/%	AP⁵⁰/%	AP⁷⁵/%	AP^M/%	AP^L/%	AR/%
YOLO-Pose	15.1	63.8	87.6	69.6	—	73.1	70.4
KAPAO	12.6	64.4	—	—	—	—	71.5
Lite Pose	2.7	56.8	—	—	—	—	—
LiteDEKR	5.7	70.1	87.9	75.8	71.0	—	—
Lite‑HRNet	1.8	67.2	88.0	75.0	64.3	73.1	73.3
HF‑HRNet	7.4	70.8	88.9	78.0	67.6	77.3	76.5
EANet	1.9	68.8	88.3	76.9	65.9	74.8	74.8
Light‑HRNet	1.8	67.0	70.0	74.6	—	74.4	73.0
SimCC	25.7	70.8	86.4	77.5	66.5	75.5	75.1
HigherHRNet	63.8	70.5	89.3	77.2	66.6	75.8	74.9
DGLNet	1.8	68.4	89.7	76.1	65.9	74.2	73.8
IDPNet	4.2	72.6	91.6	80.4	69.8	76.9	75.4
Lite‑SimCC	3.3	71.8	91.7	79.5	69.7	76.0	74.8

Tab. 2 Experimental results of different methods on MPII dataset

方法	参数量/ 10⁶	精确率/%
方法	参数量/ 10⁶	头部	肩部	肘部	手腕	臀部	膝盖	脚踝	平均
Lite-HRNet	1.8	95.2	93.5	84.7	78.1	86.2	78.9	73.9	85.1
Dite-HRNet	1.8	—	—	—	—	—	—	—	87.6
HF-HRNet	7.4	—	—	—	—	—	—	—	88.5
Light-HRNet	1.8	93.0	—	—	86.4	88.5	84.1	88.4	87.9
SimCC	25.7	96.8	95.9	90.0	85.0	89.1	85.4	81.3	89.6
DGLNet	1.8	—	—	—	—	—	—	—	87.7
Lite-NIRNet	7.7	96.9	90.4	95.8	85.1	89.0	85.7	81.3	89.7
IDPNet	4.2	96.8	95.2	88.7	84.0	88.1	84.1	79.1	88.6
TokenPose-L	23.5	97.2	95.8	90.7	85.9	89.2	86.2	82.3	90.1
Lite-SimCC	3.3	95.5	94.8	90.1	86.6	89.4	85.4	81.2	89.2

Tab. 3 Ablation experimental results of long sequence processing module

长序列处理模块			参数量/10⁶	AP/%	AR/%
Transformer	Mamba	MergeMamba	参数量/10⁶	AP/%	AR/%
			1.9	68.5	70.4
√			3.7	69.8	72.5
	√		3.2	70.9	76.1
		√	3.3	71.8	74.8

Tab. 4 Ablation experimental results of loss function

损失函数	$b$	AP/%	AR/%
基于one-hot	—	70.6	72.5
基于软标签	1	70.9	72.8
	2	71.8	74.8
	3	71.6	74.5
	4	71.5	74.2

Tab. 4 Ablation experimental results of loss function

损失函数	$b$	AP/%	AR/%
基于one-hot	—	70.6	72.5
基于软标签	1	70.9	72.8
	2	71.8	74.8
	3	71.6	74.5
	4	71.5	74.2

Fig. 7 Visualization results of Lite-SimCC on COCO2017 dataset

References 42

[1]	DUAN H， ZHAO Y， CHEN K， et al. Revisiting skeleton-based action recognition［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 2959-2968.
[2]	LIU H， LIU T， ZHANG Z， et al. ARHPE： asymmetric relation-aware representation learning for head pose estimation in industrial human-computer interaction［J］. IEEE Transactions on Industrial Informatics， 2022， 18（10）： 7107-7117.
[3]	WEI W L， LIN J C， LIU T L， et al. Capturing humans in motion： temporal-attentive 3D human pose and shape estimation from monocular video［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 13201-13210.
[4]	陈俊颖，郭士杰，陈玲玲. 基于解耦注意力与幻影卷积的轻量级人体姿态估计［J］. 计算机应用， 2025， 45（1）： 223-233.
	CHEN J Y， GUO S J， CHEN L L. Lightweight human pose estimation based on decoupled attention and ghost convolution［J］. Journal of Computer Applications， 2025， 45（1）： 223-233.
[5]	WEI S E， RAMAKRISHNA V， KANADE T， et al. Convolutional pose machines［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 4724-4732.
[6]	CAO Z， SIMON T， WEI S E， et al. Realtime multi-person 2D pose estimation using part affinity fields［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1302-1310.
[7]	CHEN Y， WANG Z， PENG Y， et al. Cascaded pyramid network for multi-person pose estimation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7103-7112.
[8]	SUN K， XIAO B， LIU D， et al. Deep high-resolution representation learning for human pose estimation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 5686-5696.
[9]	CHENG B， XIAO B， WANG J， et al. HigherHRNet： scale-aware representation learning for bottom-up human pose estimation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 5385-5394.
[10]	LI K， WANG S， ZHANG X， et al. Pose recognition with cascade Transformers［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 1944-1953.
[11]	LI Y， ZHANG S， WANG Z， et al. TokenPose： learning keypoint tokens for human pose estimation［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 11293-11302.
[12]	LUDWIG K， HARZIG P， LIENHART R. Detecting arbitrary intermediate keypoints for human pose estimation with vision Transformers［C］// Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2022： 663-671.
[13]	HOULSBY N， WEISSENBORN D. Transformers for image recognition at scale［EB/OL］. ［2024-03-13］..
[14]	LI Y， YANG S， LIU P， et al. SimCC： a simple coordinate classification perspective for human pose estimation［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13666. Cham： Springer， 2022： 89-106.
[15]	YU W， LUO M， ZHOU P， et al. MetaFormer is actually what you need for vision［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 10809-10819.
[16]	YU C， XIAO B， GAO C， et al. Lite-HRNet： a lightweight high-resolution network［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 10440-10450.
[17]	ZHANG X， ZHOU X， LIN M， et al. ShuffleNet： an extremely efficient convolutional neural network for mobile devices［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6848-6856.
[18]	WANG Y， LI M， CAI H， et al. Lite Pose： efficient architecture design for 2D human pose estimation［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 13116-13126.
[19]	SANDLER M， HOWARD A， ZHU M， et al. MobileNetV2： inverted residuals and linear bottlenecks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 4510-4520.
[20]	CHEN B， WANG X， CHEN X， et al. EANet： towards lightweight human pose estimation with effective aggregation network［C］// Proceedings of the 2023 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2023： 2639-2644.
[21]	LIU H， WU J， HE R. IDPNet： a light-weight network and its variants for human pose estimation［J］. The Journal of Supercomputing， 2024， 80（5）： 6169-6191.
[22]	佘本杰，苏树智，朱彦敏，等. 基于非全局依赖积分回归的轻量姿态估计网络［J］. 计算机应用， 2025， 45（3）： 972-977.
	SHE B J， SU S Z， ZHU Y M， et al. Lightweight pose estimation network based on non-globally dependent integral regression［J］. Journal of Computer Applications， 2025， 45（3）： 972-977.
[23]	TOSHEV A， SZEGEDY C. DeepPose： human pose estimation via deep neural networks［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2014： 1653-1660.
[24]	XIAO B， WU H， WEI Y. Simple baselines for human pose estimation and tracking［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11210. Cham： Springer， 2018： 472-487.
[25]	GU A， GOEL K， RÉ C. Efficiently modeling long sequences with structured state spaces［EB/OL］. ［2024-05-12］..
[26]	GU A， JOHNSON I， GOEL K， et al. Combining recurrent， convolutional， and continuous-time models with linear state-space layers［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 572-585.
[27]	SMITH J T H， WARRINGTON A， LINDERMAN S W. Simplified state space layers for sequence modeling［EB/OL］. ［2024-08-13］..
[28]	FU D Y， DAO T， SAAB K K， et al. Hungry hungry hippos： towards language modeling with state space models［EB/OL］. ［2024-05-19］..
[29]	HAN K， XIAO A， WU E， et al. Transformer in Transformer［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 15908-15919.
[30]	GU A， DAO T. Mamba： linear-time sequence modeling with selective state spaces［EB/OL］. ［2024-04-11］..
[31]	ZHU L， LIAO B， ZHANG Q， et al. Vision Mamba： efficient visual representation learning with bidirectional state space model［EB/OL］. ［2025-01-14］..
[32]	HE X， CAO K， ZHANG J， et al. Pan-Mamba： effective pan-sharpening with state space model［J］. Information Fusion， 2025， 115： No.102779.
[33]	MA N， ZHANG X， ZHENG H T， et al. ShuffleNet V2： practical guidelines for efficient CNN architecture design［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11218. Cham： Springer， 2018： 122-138.
[34]	DÍAZ R， MARATHE A. Soft labels for ordinal regression［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 4733-4742.
[35]	LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： common objects in context［C］// Proceedings of the 2014 European Conference on Computer Vision， LNCS 8693. Cham： Springer， 2014： 740-755.
[36]	ANDRILUKA M， PISHCHULIN L， GEHLER P， et al. 2D human pose estimation： new benchmark and state of the art analysis［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2014： 3686-3693.
[37]	MAJI D， NAGORI S， MATHEW M， et al. YOLO-Pose： enhancing YOLO for multi person pose estimation using object keypoint similarity loss［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 2636-2645.
[38]	McNALLY W， VATS K， WONG A， et al. Rethinking keypoint representations： modeling keypoints and poses as objects for multi-person human pose estimation［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13666. Cham： Springer， 2022： 37-54.
[39]	LV X， HAO W， TIAN L， et al. LiteDEKR： end-to-end lite 2D human pose estimation network［J］. IET Image Processing， 2023， 17（12）： 3392-3400.
[40]	ZHANG H， DUN Y， PEI Y， et al. HF-HRNet： a simple hardware friendly high-resolution network［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2024， 34（8）： 7699-7711.
[41]	HAN F， DAI M， CHEN X. Lightweight human pose estimation with attention mechanism［C］// Proceedings of the 8th International Conference on Image， Vision and Computing. Piscataway： IEEE， 2023： 227-230.
[42]	LI Q， ZHANG Z， XIAO F， et al. Dite-HRNet： dynamic lightweight high-resolution network for human pose estimation［C］// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California： ijcai.org， 2022： 1095-1101.

Lightweight human pose estimation based on merge state space model

基于融合特征状态空间模型的轻量化人体姿态估计

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 42

Related Articles 15

Recommended Articles

Metrics

[1]	Haiteng MENG, Xiaole ZHAO, Tianrui LI. Lightweight image super-resolution reconstruction based on asymmetric information distillation network [J]. Journal of Computer Applications, 2025, 45(2): 601-609.
[2]	Songsen YU, Zhifan LIN, Guopeng XUE, Jianyu XU. Lightweight large-format tile defect detection algorithm based on improved YOLOv8 [J]. Journal of Computer Applications, 2025, 45(2): 647-654.
[3]	Junying CHEN, Shijie GUO, Lingling CHEN. Lightweight human pose estimation based on decoupled attention and ghost convolution [J]. Journal of Computer Applications, 2025, 45(1): 223-233.
[4]	Yanjun LI, Yaodong GE, Qi WANG, Weiguo ZHANG, Chen LIU. Improved KLEIN algorithm and its quantum analysis [J]. Journal of Computer Applications, 2024, 44(9): 2810-2817.
[5]	Yongjin ZHANG, Jian XU, Mingxing ZHANG. Lightweight algorithm for impurity detection in raw cotton based on improved YOLOv7 [J]. Journal of Computer Applications, 2024, 44(7): 2271-2278.
[6]	Xiaohui CHENG, Yuntian HUANG, Ruifang ZHANG. Lightweight infrared road scene detection model based on multiscale and weighted coordinate attention [J]. Journal of Computer Applications, 2024, 44(6): 1927-1934.
[7]	Xiaogang SONG, Dongdong ZHANG, Pengfei ZHANG, Li LIANG, Xinhong HEI. Real-time object detection algorithm for complex construction environments [J]. Journal of Computer Applications, 2024, 44(5): 1605-1612.
[8]	Jun FENG, Jiankang BI, Yiru HUO, Jiakuan LI. PIPNet： lightweight asphalt pavement crack image segmentation network [J]. Journal of Computer Applications, 2024, 44(5): 1520-1526.
[9]	Huantong GENG, Zhenyu LIU, Jun JIANG, Zichen FAN, Jiaxing LI. Embedded road crack detection algorithm based on improved YOLOv8 [J]. Journal of Computer Applications, 2024, 44(5): 1613-1618.
[10]	Bin XIAO, Yun GAN, Min WANG, Xingpeng ZHANG, Zhaoxing WANG. Network abnormal traffic detection based on port attention and convolutional block attention module [J]. Journal of Computer Applications, 2024, 44(4): 1027-1034.
[11]	Zijie HUANG, Yang OU, Degang JIANG, Cailing GUO, Bailin LI. Lightweight deep learning algorithm for weld seam surface quality detection of traction seat [J]. Journal of Computer Applications, 2024, 44(3): 983-988.
[12]	Chenghanyu ZHANG, Yuzhe LIN, Chengke TAN, Junfan WANG, Yeting GU, Zhekang DONG, Mingyu GAO. New dish recognition network based on lightweight YOLOv5 [J]. Journal of Computer Applications, 2024, 44(2): 638-644.
[13]	Yanran SHEN, Xin WEN, Jinhao ZHANG, Shuai ZHANG, Rui CAO, Baolu GAO. fMRI brain age prediction model with lightweight multi-scale convolutional network [J]. Journal of Computer Applications, 2024, 44(12): 3949-3957.
[14]	Yong XIANG, Yanjun LI, Dingyun HUANG, Yu CHEN, Huiqin XIE. Differential and linear characteristic analysis of full-round Shadow algorithm [J]. Journal of Computer Applications, 2024, 44(12): 3839-3843.
[15]	Ziqian CHEN, Kedi NIU, Zhongyuan YAO, Xueming SI. Review of blockchain lightweight technology applied to internet of things [J]. Journal of Computer Applications, 2024, 44(12): 3688-3698.