Sign language generation model based on Kolmogorov-Arnold network and diffusion Transformer

doi:10.11772/j.issn.1001-9081.2025060730

Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (6): 1801-1810.DOI: 10.11772/j.issn.1001-9081.2025060730

• Artificial intelligence • Previous Articles

Sign language generation model based on Kolmogorov-Arnold network and diffusion Transformer

Lili HE¹^,²^,³, Meng CAO¹^,²^,³, Lei ZHANG¹^,²^,³, Hongjun PAN³^,⁴(), Yi LIU¹^,²^,³, Chengxin SUN⁵

^1.College of Information and Electronic Technology，Jiamusi University，Jiamusi Heilongjiang 154007，China
^2.Heilongjiang Provincial Key Laboratory of Autonomous Intelligence and Information Processing （Jiamusi University），Jiamusi Heilongjiang 154007，China
^3.Jiamusi Key Laboratory of Satellite Navigation Technology and Equipment Engineering Technology，Jiamusi University，Jiamusi Heilongjiang 154007，China
^4.Handan Vocational College of Science and Technology，Handan Hebei 056046，China
^5.Experimental Training and Equipment Management Center，Jiamusi University，Jiamusi Heilongjiang 154007，China

Received:2025-07-02 Revised:2025-08-28 Accepted:2025-09-02 Online:2025-09-12 Published:2026-06-10
Contact: Hongjun PAN
About author:HE Lili， born in 1979， Ph. D.， professor. Her research interests include privacy protection， information security.
CAO Meng， born in 2001， M. S. candidate. His research interests include natural language processing， computer vision.
ZHANG Lei， born in 1982， Ph. D.， professor. His research interests include information security， privacy protection.
LIU Yi， born in 1979， M. S.， associate professor. His research interests include privacy protection， image processing.
SUN Chengxin， born in 1980， M. S. Her research interests include information construction， social management.
First author contact:PAN Hongjun， born in 1973， lecturer. His research interests include natural language processing， image processing.
Supported by:
Scientific Research Project of Fundamental Research Funds for the Heilongjiang Provincial Higher Education Institutions(18KYYWF0941);Research Special Project on Theoretical Course Teaching Reform of Ideological and Political Courses in Colleges and Universities in Heilongjiang Education and Teaching Reform Project(SJGSX2024008);Heilongjiang Provincial Undergraduate College Outstanding Young Teachers Basic Research Support Program(YQJH2024239);Joint Fund Cultivation Project of the Natural Science Foundation of Heilongjiang(PL2024F002);Excellent Innovation Team Construction Project of Fundamental Research Funds for the Heilongjiang Provincial Higher Education Institutions(2022-KYYWF-0654);Key Research Course of Economic and Social Development of Heilongjiang Province(WY2025012);Teaching Reform Project of Jiamusi University(2023JY6-36);“East Pole” Academic Team of Jiamusi University(DJXSTD202417)

基于Kolmogorov-Arnold网络与扩散Transformer的手语生成模型

何丽丽¹^,²^,³, 曹勐¹^,²^,³, 张磊¹^,²^,³, 潘洪军³^,⁴(), 刘义¹^,²^,³, 孙成心⁵

^1.佳木斯大学信息电子技术学院，黑龙江佳木斯 154007
^2.黑龙江省自主智能与信息处理重点实验室（佳木斯大学），黑龙江佳木斯 154007
^3.佳木斯大学佳木斯市卫星导航技术与装备工程技术重点实验室，黑龙江佳木斯 154007
^4.邯郸科技职业学院，河北邯郸 056046
^5.佳木斯大学实验实训及设备管理中心，黑龙江佳木斯 154007

通讯作者: 潘洪军
作者简介:何丽丽（1979—），女，黑龙江佳木斯人，教授，博士，CCF会员，主要研究方向：隐私保护、信息安全
曹勐（2001—），男，黑龙江大兴安岭人，硕士研究生，主要研究方向：自然语言处理、计算机视觉
张磊（1982—），男，黑龙江绥化人，教授，博士，CCF会员，主要研究方向：信息安全、隐私保护
刘义（1979—），男，黑龙江望奎人，副教授，硕士，CCF会员，主要研究方向：隐私保护、图像处理
孙成心（1980—），女，黑龙江佳木斯人，硕士，主要研究方向：信息化建设、社会管理。
第一联系人：潘洪军（1973—），河北大名人，讲师，主要研究方向：自然语言处理、图像处理
基金资助:
黑龙江省省属高等学校基本科研业务费科研项目(18KYYWF0941);黑龙江省教育教学改革项目高校思政课理论课教学改革研究专项(SJGSX2024008);黑龙江省省属本科高校优秀青年教师基础研究支持计划项目(YQJH2024239);黑龙江省自然科学基金联合基金培育项目(PL2024F002);黑龙江省省属高等学校基本科研业务费优秀创新团队建设项目(2022-KYYWF-0654);黑龙江省经济社会发展重点研究课(WY2025012);佳木斯大学教改项目(2023JY6-36);佳木斯大学“东极”学术团队(DJXSTD202417)

Abstract

Abstract:

To address the problems of blurry generation results， detail loss， and uneven feature distribution caused by insufficient local information extraction of the existing models in sign language generation tasks， a sign language generation model based on Kolmogorov-Arnold Network （KAN） and Diffusion Transformer （KDT） was proposed. Firstly， the nonlinear approximation capability of the KAN was utilized to fit complex data distribution， so as to enhance the detail representation and motion fluency between video frames， thereby addressing the blurriness problem of videos generated by traditional Multilayer Perceptron （MLP） models. Then， Contrast Normalization （ContraNorm） was used to replace the original normalization， so as to address the uneven feature distribution problem by calibrating differences in feature scales， thereby ensuring the model’s stability with poor data quality and interference. Finally， diffusion Transformer was employed to achieve refined evolution from random noise to the target sequence through multi-step iterative optimization， thereby addressing the detail loss problem of traditional models. Experimental results on the validation set of RWTH-Phoenix-2014T continuous sign language dataset show that compared to the Sign-IDD （Sign-Iconicity Disentangled Diffusion） model， this model has the BLEU-1 （Bilingual Evaluation Understudy 1-gram） and ROUGE （Recall-Oriented Understudy for Gisting Evaluation） metrics improved by 8.1% and 5.9%， respectively， and the Word Error Rate （WER） metric reduced by 4.5%. The above results verify the effectiveness of this model in enhancing the richness of video details and the fluency of sign language movements.

Key words: sign language video generation, machine translation, deep learning, Transformer, sequence modeling

摘要：

针对手语生成任务中现有模型在局部信息提取方面的不足导致的生成效果模糊、细节丢失和特征分布不均匀等问题，提出一种基于Kolmogorov-Arnold网络（KAN）与扩散Transformer的手语生成模型（KDT）。首先，利用KAN非线性逼近能力拟合复杂数据分布，提高视频帧间的细节表现力与运动流畅度，解决传统多层感知机（MLP）模型生成视频模糊的问题；其次，使用对比归一化（ContraNorm）替代原有归一化，通过校准特征尺度差异解决特征分布不均匀问题，在数据质量较差和存在干扰时使模型仍能保持稳定性；最后，通过扩散Transformer通过多步迭代优化实现从随机噪声出发向目标序列的精细化演化，解决传统模型丢失细节的问题。在RWTH-Phoenix-2014T连续手语数据集验证集上的实验结果表明，与Sign-IDD （Sign-Iconicity Disentangled Diffusion）模型相比，该模型在BLEU-1（Bilingual Evaluation Understudy 1-gram）和ROUGE （Recall-Oriented Understudy for Gisting Evaluation）指标上分别提高了8.1%和5.9%，错词率（WER）指标降低了4.5%。上述结果验证了该模型在提升视频细节丰富度与手语动作流畅性方面的有效性。

关键词: 手语视频生成, 机器翻译, 深度学习, Transformer, 序列建模

CLC Number:

TP391.4

Lili HE, Meng CAO, Lei ZHANG, Hongjun PAN, Yi LIU, Chengxin SUN. Sign language generation model based on Kolmogorov-Arnold network and diffusion Transformer[J]. Journal of Computer Applications, 2026, 46(6): 1801-1810.

何丽丽, 曹勐, 张磊, 潘洪军, 刘义, 孙成心. 基于Kolmogorov-Arnold网络与扩散Transformer的手语生成模型[J]. 《计算机应用》唯一官方网站, 2026, 46(6): 1801-1810.

Figures/Tables 14

References 34

[1]	张磊，王振宇，连帅帅，等. 基于深度学习的手语翻译：过去、现状与未来［J］. 计算机应用研究， 2025， 42（8）： 2241-2254.
	ZHANG L， WANG Z Y， LIAN S S， et al. Deep learning-based sign language translation： past， present， and future［J］. Application Research of Computers， 2025， 42（8）： 2241-2254.
[2]	郭丹，唐申庚，洪日昌，等. 手语识别、翻译与生成综述［J］. 计算机科学， 2021， 48（3）： 60-70.
	GUO D， TANG S G， HONG R C， et al. Review of sign language recognition， translation and generation［J］. Computer Science， 2021， 48（3）： 60-70.
[3]	杨晓文，张志纯，况立群，等. 基于虚拟手的人机交互关键技术［J］. 计算机应用， 2015， 35（10）： 2945-2949.
	YANG X W， ZHANG Z C， KUANG L Q， et al. Key technologies of human-computer interaction based on virtual hand［J］. Journal of Computer Applications， 2015， 35（10）： 2945-2949.
[4]	薛羽，张逸轩. 深层神经网络架构搜索综述［J］. 信息网络安全， 2023， 23（9）： 58-74.
	XUE Y， ZHANG Y X. Survey on deep neural architecture search［J］. Netinfo Security， 2023， 23（9）： 58-74.
[5]	龙广玉，陈益强，邢云冰. 连续手语识别中的文本纠正和补全方法［J］. 计算机应用， 2021， 41（3）： 694-698.
	LONG G Y， CHEN Y Q， XING Y B. Text correction and completion method in continuous sign language recognition［J］. Journal of Computer Applications， 2021， 41（3）： 694-698.
[6]	罗元，李丹，张毅. 基于时空注意力网络的中国手语识别［J］. 半导体光电， 2020， 41（3）： 414-419.
	LUO Y， LI D， ZHANG Y. Chinese sign language recognition based on spatial-temporal attention network［J］. Semiconductor Optoelectronics， 2020， 41（3）： 414-419.
[7]	GLAUERT J R W， ELLIOTT R， COX S J， et al. VANESSA： a system for communication between deaf and hearing people［J］. Technology and Disability， 2006， 18（4）： 207-216.
[8]	王兆其，高文. 基于虚拟人合成技术的中国手语合成方法［J］. 软件学报， 2002， 13（10）： 2051-2056.
	WANG Z Q， GAO W. A method to synthesize Chinese sign language based on virtual human technologies［J］. Journal of Software， 2002， 13（10）： 2051-2056.
[9]	FANG S， CHEN C， WANG L， et al. SignLLM： sign language production large language models［C］// Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway： IEEE， 2025： 6681-6693.
[10]	SAUNDERS B， CAMGOZ N C， BOWDEN R. Progressive Transformers for end-to-end sign language production［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12356. Cham： Springer， 2020： 687-705.
[11]	MA X， JIN R， WANG J， et al. Attentional bias for hands： cascade dual-decoder Transformer for sign language production［J］. IET Computer Vision， 2024， 18（5）： 696-708.
[12]	XIE P， PENG T， DU Y， et al. Sign language production with latent motion Transformer［C］// Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2024： 3012-3022.
[13]	LIU Z， WANG Y， VAIDYA S， et al. KAN： Kolmogorov-Arnold networks［EB/OL］. ［2025-05-09］..
[14]	刘灿锋，孙浩，东辉. 结合Transformer与Kolmogorov Arnold网络的分子扩增时序预测研究［J］. 图学学报， 2024， 45（6）： 1256-1265.
	LIU C F， SUN H， DONG H. Molecular amplification time series prediction research combining Transformer with Kolmogorov-Arnold network［J］. Journal of Graphics， 2024， 45（6）： 1256-1265.
[15]	YANG X， WANG X. Kolmogorov-Arnold Transformer［EB/OL］. ［2024-09-16］..
[16]	GUO X， WANG Y， DU T， et al. ContraNorm： a contrastive learning perspective on oversmoothing and beyond［EB/OL］. ［2023-05-02］..
[17]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[18]	张艳，马春明，刘树东，等. 基于多尺度特征增强的高效Transformer语义分割网络［J］. 光电工程， 2024， 51（12）： No.240237.
	ZHANG Y， MA C M， LIU S D， et al. Multi-scale feature enhanced Transformer network for efficient semantic segmentation［J］. Opto-Electronic Engineering， 2024， 51（12）： No.240237.
[19]	邢长友，王梓澎，张国敏，等. 基于预训练Transformers的物联网设备识别方法［J］. 信息网络安全， 2024， 24（8）： 1277-1290.
	XING C Y， WANG Z P， ZHANG G M， et al. IoT device identification method based on pre-trained Transformers［J］. Netinfo Security， 2024， 24（8）： 1277-1290.
[20]	KAPOOR P， MUKHOPADHYAY R， HEGDE S B， et al. Towards automatic speech to sign language generation［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 3700-3704.
[21]	HWANG E J， LEE H， PARK J C. A gloss-free approach with discrete representations［C］// Proceedings of the IEEE 18th International Conference on Automatic Face and Gesture Recognition. Piscataway： IEEE， 2024： 1-6.
[22]	YIN A， LI H， SHEN K， et al. T2S-GPT： dynamic vector quantization for autoregressive sign language production from text［C］// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2024： 3345-3356.
[23]	SAUNDERS B， CAMGOZ N C， BOWDEN R. Mixed signals： sign language production via a mixture of motion primitives［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 1899-1909.
[24]	XIE P， ZHANG Q， PENG T， et al. G2P-DDM： generating sign pose sequence from gloss sequence with discrete diffusion model［C］// Proceedings of the 38th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2024： 6234-6242.
[25]	MUGHAL M H， DABRAL R， HABIBIE I， et al. ConvoFusion： multi-modal conversational diffusion for co-speech gesture synthesis［C］// Proceedings of the 2024 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2024： 1388-1398.
[26]	CHEN J， LIU Y， WANG J， et al. DiffSHEG： a diffusion-based approach for real-time speech-driven holistic 3D expression and gesture generation［C］// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2024： 7352-7361.
[27]	TANG S， HE J， GUO D， et al. Sign-IDD： iconicity disentangled diffusion for sign language production［C］// Proceedings of the 39th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2025： 7266-7274.
[28]	MA J， WANG W， YANG Y， et al . M S2SL： multimodal spoken data-driven continuous sign language production［C］// Findings of the Association for Computational Linguistics： ACL 2024. Stroudsburg： ACL， 2024： 7241-7254.
[29]	FANG S， SUI C， ZHOU Y， et al. SignDiff： diffusion model for American sign language production［C］// Proceedings of the IEEE 19th International Conference on Automatic Face and Gesture Recognition. Piscataway： IEEE， 2025： 1-11.
[30]	CAMGOZ N C， HADFIELD S， KOLLER O， et al. Neural sign language translation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7784-7793.
[31]	DREUW P， RYBACH D， DESELAERS T， et al. Speech recognition techniques for a sign language recognition system［C］// Proceedings of the INTERSPEECH 2007. ［S.l.］： International Speech Communication Association， 2007： 2513-2516.
[32]	ZHOU H， ZHOU W， QI W， et al. Improving sign language translation with monolingual data by sign back-translation［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 1316-1325.
[33]	TANG S， XUE F， WU J， et al. Gloss-driven conditional diffusion models for sign language production［J］. ACM Transactions on Multimedia Computing Communications and Applications， 2025， 21（4）： No.105.
[34]	孙剑文，张斌，司念文，等. 基于知识蒸馏的轻量化恶意流量检测方法［J］. 信息网络安全， 2025， 25（6）： 859-871.
	SUN J W， ZHANG B， SI N W， et al. Lightweight malicious traffic detection method based on knowledge distillation［J］. Netinfo Security， 2025， 25（6）： 859-871.

数据集	评估指标得分
数据集	BLEU-1	BLEU-4	ROUGE	FID	WER
RWTH-Phoenix-2014T 验证集	26.14	8.65	26.34	2.05	74.24
RWTH-Phoenix-2014T 测试集	25.98	8.79	26.91	2.29	75.04
CSL-Daily验证集	51.41	23.41	48.14	1.35	42.51
CSL-Daily测试集	50.18	21.83	45.72	1.29	42.12

数据集	评估指标得分
数据集	BLEU-1	BLEU-4	ROUGE	FID	WER
RWTH-Phoenix-2014T 验证集	26.14	8.65	26.34	2.05	74.24
RWTH-Phoenix-2014T 测试集	25.98	8.79	26.91	2.29	75.04
CSL-Daily验证集	51.41	23.41	48.14	1.35	42.51
CSL-Daily测试集	50.18	21.83	45.72	1.29	42.12

模型	RWTH-BOSTON-104数据集得分
模型	BLEU-1	BLEU-4	ROUGE	FID	WER
PT^［10］	5.53	2.17	4.74	0.72	32.13
KDT	13.53	4.76	12.51	0.64	25.62

模型	RWTH-BOSTON-104数据集得分
模型	BLEU-1	BLEU-4	ROUGE	FID	WER
PT^［10］	5.53	2.17	4.74	0.72	32.13
KDT	13.53	4.76	12.51	0.64	25.62

模型	验证集得分					测试集得分
模型	BLEU-1	BLEU-4	ROUGE	FID	WER	BLEU-1	BLEU-4	ROUGE	FID	WER
PT（Baseline）	11.62	3.76	11.74	2.85	98.53	12.12	3.54	10.16	3.22	98.36
PT+KAN	14.42	4.38	15.35	2.78	86.42	13.91	4.45	14.87	2.69	83.62
KDT	26.14	8.65	26.34	2.05	74.24	25.98	8.79	26.91	2.29	75.04

Sign language generation model based on Kolmogorov-Arnold network and diffusion Transformer

基于Kolmogorov-Arnold网络与扩散Transformer的手语生成模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 14

References 34

Related Articles 15

Recommended Articles

Metrics

[1]	Yi DU, Mingjin XU, Jiayi KONG, Liyao WANG, Chen ZHAO. Low-rank adaptive parameter-efficient fine-tuning algorithm based on YOLOv11 [J]. Journal of Computer Applications, 2026, 46(6): 1738-1745.
[2]	Minqi WU, Yuanhua YANG, Hang LI, Yaqin HU, Zhihao TANG, Teng MEI. Lightweight underwater small object detection based on graph Transformer and RT-DETR [J]. Journal of Computer Applications, 2026, 46(5): 1586-1595.
[3]	Xinyao LIU, Jun LIANG, Jiahao LONG, Renliang YAN. Fine-grained Chinese herbal medicine image classification based on feature fusion and channel information compensation [J]. Journal of Computer Applications, 2026, 46(5): 1677-1683.
[4]	Huijie GUO, Tianfeng DOU, Zhenlin ZHANG, Kaiyuan QI, Dong WU, Zhijian QU, Zhao LI, Chongguang REN. Time-interdependency-aware dynamic Bayesian network for traffic prediction [J]. Journal of Computer Applications, 2026, 46(5): 1507-1517.
[5]	Yuanhao HE, Jun ZHAO. Defect detection algorithm for train bearing rollers based on FHC-DETR [J]. Journal of Computer Applications, 2026, 46(5): 1624-1633.
[6]	Xing SHENG, Sunxian WENG, Kuosong CHEN, Zhongping WANG, Ruifeng REN, Yong LIU. Deep learning-based patent value evaluation for power grid enterprises [J]. Journal of Computer Applications, 2026, 46(5): 1468-1474.
[7]	Shengwei XU, Jianbo WANG, Jijie HAN, Yijie BAI. Face forgery detection method based on tri-branch feature extraction [J]. Journal of Computer Applications, 2026, 46(4): 1292-1299.
[8]	Xinyi YAN, Linglong ZHU, Yonghong ZHANG. CDC-DETR： multi-scale real-time human-vehicle detection method for complex traffic scenarios [J]. Journal of Computer Applications, 2026, 46(4): 1283-1291.
[9]	Xiang BAI, Juchuan LI, Huimin WANG, Chao JING, Jian NIU, Xingzhong ZHANG, Yongqiang CHENG. Power image retrieval method based on improved Swin Transformer [J]. Journal of Computer Applications, 2026, 46(4): 1334-1343.
[10]	Jie HU, Pengcheng LI, Jun SUN, Jiaao ZHANG. Key phrase extraction model based on multi-perspective information enhancement and hierarchical weighting [J]. Journal of Computer Applications, 2026, 46(4): 1050-1057.
[11]	Haoxuan CHEN, Peichang YE, Lei LIU, Chengming LIU, Wenhua HU. Survey of automated code edit suggestion [J]. Journal of Computer Applications, 2026, 46(4): 1227-1237.
[12]	Ping HUANG, Qing LI, Haifeng QIU, Chengsi WANG, Anzi HUANG, Long FAN. Lightweight method for transmission line defect detection [J]. Journal of Computer Applications, 2026, 46(3): 969-979.
[13]	Hanqing LIU, Guoming SANG, Yijia ZHANG. Remote sensing image captioning model combining dense multi-scale feature fusion and feature knowledge-enhanced Transformer [J]. Journal of Computer Applications, 2026, 46(3): 741-749.
[14]	Jian ZHANG, Jianbo YU, Jian TANG. Municipal solid waste incineration state recognition method based on multilayer preprocessing [J]. Journal of Computer Applications, 2026, 46(3): 940-949.
[15]	Songsen YU, Huang HE, Guopeng XUE, Hengtuo CUI. Quantitation and grading method for ceramic tile chromatic aberration based on improved fractal encoding network [J]. Journal of Computer Applications, 2026, 46(3): 959-968.