Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (4): 1113-1119.DOI: 10.11772/j.issn.1001-9081.2024040550
• Artificial intelligence • Previous Articles Next Articles
					
						                                                                                                                                                                                                                                                    Haitao SUN1, Jiayu LIN2( ), Zuhong LIANG1,3, Jie GUO1
), Zuhong LIANG1,3, Jie GUO1
												  
						
						
						
					
				
Received:2024-04-30
															
							
																	Revised:2024-08-14
															
							
																	Accepted:2024-08-16
															
							
							
																	Online:2025-04-08
															
							
																	Published:2025-04-10
															
							
						Contact:
								Jiayu LIN   
													About author:SUN Haitao, born in 1999, M. S. candidate. His research interests include data augmentation, data mining.Supported by:通讯作者:
					林佳瑜
							作者简介:孙海涛(1999—),男,湖南常德人,硕士研究生,CCF会员,主要研究方向:数据增强、数据挖掘;基金资助:CLC Number:
Haitao SUN, Jiayu LIN, Zuhong LIANG, Jie GUO. Data augmentation technique incorporating label confusion for Chinese text classification[J]. Journal of Computer Applications, 2025, 45(4): 1113-1119.
孙海涛, 林佳瑜, 梁祖红, 郭洁. 结合标签混淆的中文文本分类数据增强技术[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1113-1119.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024040550
| 样本类型 | 样本内容 | 
|---|---|
| 原始样本 | 两天价网站背后重重迷雾:做个网站究竟要多少钱 | 
| 增强样本1 | 两:天价网站背后重重迷雾,,做个网站究竟要多少钱。 | 
| 增强样本2 | 两!天价网站背后重!重迷雾:做个网站: 究竟要多少钱? | 
| 增强样本3 | 两天价网。站背后重重迷,雾,;做个网站究竟要多少。 钱? | 
| 增强样本4 | 两。天价网站背后重重迷雾:,做个?站究, 竟要多少钱? | 
Tab. 1 Examples after text augmentation
| 样本类型 | 样本内容 | 
|---|---|
| 原始样本 | 两天价网站背后重重迷雾:做个网站究竟要多少钱 | 
| 增强样本1 | 两:天价网站背后重重迷雾,,做个网站究竟要多少钱。 | 
| 增强样本2 | 两!天价网站背后重!重迷雾:做个网站: 究竟要多少钱? | 
| 增强样本3 | 两天价网。站背后重重迷,雾,;做个网站究竟要多少。 钱? | 
| 增强样本4 | 两。天价网站背后重重迷雾:,做个?站究, 竟要多少钱? | 
| 错误样本 | 正确分类 | 错误分类 | 
|---|---|---|
| 三联书店建起书香巷 | 科技 | 教育 | 
| Google多项功能前晚集中“瘫痪” | 科技 | 社会 | 
| 借款纠纷牵出房产商伪造公文开发楼盘案 | 社会 | 房地产 | 
Tab. 2 Examples of misclassified samples
| 错误样本 | 正确分类 | 错误分类 | 
|---|---|---|
| 三联书店建起书香巷 | 科技 | 教育 | 
| Google多项功能前晚集中“瘫痪” | 科技 | 社会 | 
| 借款纠纷牵出房产商伪造公文开发楼盘案 | 社会 | 房地产 | 
| 数据集 | 样本数 | 标签类别数 | ||
|---|---|---|---|---|
| 训练集 | 验证集 | 测试集 | ||
| THUCNews | 180 000 | 10 000 | 10 000 | 10 | 
| 50-THU | 500 | 10 000 | 10 000 | 10 | 
| 200-THU | 2 000 | 10 000 | 10 000 | 10 | 
| 500-THU | 5 000 | 10 000 | 10 000 | 10 | 
| Toutiao | 130 000 | 10 000 | 10 000 | 13 | 
| 50-Toutiao | 650 | 10 000 | 10 000 | 13 | 
| 200-Toutiao | 2 600 | 10 000 | 10 000 | 13 | 
| 500-Toutiao | 6 500 | 10 000 | 10 000 | 13 | 
Tab. 3 Datasets used in experiments
| 数据集 | 样本数 | 标签类别数 | ||
|---|---|---|---|---|
| 训练集 | 验证集 | 测试集 | ||
| THUCNews | 180 000 | 10 000 | 10 000 | 10 | 
| 50-THU | 500 | 10 000 | 10 000 | 10 | 
| 200-THU | 2 000 | 10 000 | 10 000 | 10 | 
| 500-THU | 5 000 | 10 000 | 10 000 | 10 | 
| Toutiao | 130 000 | 10 000 | 10 000 | 13 | 
| 50-Toutiao | 650 | 10 000 | 10 000 | 13 | 
| 200-Toutiao | 2 600 | 10 000 | 10 000 | 13 | 
| 500-Toutiao | 6 500 | 10 000 | 10 000 | 13 | 
| 实际情况 | 预测正类 | 预测负类 | 
|---|---|---|
| 实际正类 | 真正类 TP | 假负类 FN | 
| 实际负类 | 假正类 FP | 真负类 TN | 
Tab. 4 Confusion matrix of classification results
| 实际情况 | 预测正类 | 预测负类 | 
|---|---|---|
| 实际正类 | 真正类 TP | 假负类 FN | 
| 实际负类 | 假正类 FP | 真负类 TN | 
| INSERT_PROB值 | 准确率/% | INSERT_PROB值 | 准确率/% | 
|---|---|---|---|
| 0.1 | 83.99 | 0.4 | 82.69 | 
| 0.2 | 84.26 | 0.5 | 83.11 | 
| 0.3 | 83.69 | 
Tab. 5 Experimental results of different INSERT_PROB values
| INSERT_PROB值 | 准确率/% | INSERT_PROB值 | 准确率/% | 
|---|---|---|---|
| 0.1 | 83.99 | 0.4 | 82.69 | 
| 0.2 | 84.26 | 0.5 | 83.11 | 
| 0.3 | 83.69 | 
| 扩充语句数 | THUCNews数据集准确率/% | ||
|---|---|---|---|
| 50-THU | 200-THU | 500-THU | |
| 0 | 62.30 | 73.12 | 80.27 | 
| 1 | 66.47 | 76.48 | 81.63 | 
| 2 | 67.28 | 77.87 | 81.20 | 
| 4 | 69.17 | 77.48 | 80.64 | 
Tab. 6 Experimental results of different augmentation scales
| 扩充语句数 | THUCNews数据集准确率/% | ||
|---|---|---|---|
| 50-THU | 200-THU | 500-THU | |
| 0 | 62.30 | 73.12 | 80.27 | 
| 1 | 66.47 | 76.48 | 81.63 | 
| 2 | 67.28 | 77.87 | 81.20 | 
| 4 | 69.17 | 77.48 | 80.64 | 
| 平滑状态 | 标签向量 | 
|---|---|
| 标签平滑前 | [1,0,0,0,0,0,0,0,0,0] | 
| 标签平滑后 | [0.91,0.01,0.01,0.01,0.01,0.01, 0.01,0.01,0.01,0.01] | 
Tab. 7 Labels before and after applying label smoothing technique
| 平滑状态 | 标签向量 | 
|---|---|
| 标签平滑前 | [1,0,0,0,0,0,0,0,0,0] | 
| 标签平滑后 | [0.91,0.01,0.01,0.01,0.01,0.01, 0.01,0.01,0.01,0.01] | 
| 模型 | THUCNews数据集准确率 | ||
|---|---|---|---|
| 50-THU | 200-THU | 500-THU | |
| BERT | 81.05 | 87.21 | 88.91 | 
| BERT+文本增强+LS | 83.27 | 87.74 | 89.28 | 
| BERT+LCDA | 84.26 | 88.61 | 89.81 | 
Tab. 8 Comparison experimental results of label confusion and label smoothing
| 模型 | THUCNews数据集准确率 | ||
|---|---|---|---|
| 50-THU | 200-THU | 500-THU | |
| BERT | 81.05 | 87.21 | 88.91 | 
| BERT+文本增强+LS | 83.27 | 87.74 | 89.28 | 
| BERT+LCDA | 84.26 | 88.61 | 89.81 | 
| 模型 | 准确率 | |||||
|---|---|---|---|---|---|---|
| THUCNews数据集 | Toutiao数据集 | |||||
| 50-THU | 200-THU | 500-THU | 50-Toutiao | 200-Toutiao | 500-Toutiao | |
| TextCNN | 72.08 | 80.49 | 82.57 | 54.18 | 68.03 | 73.89 | 
| TextCNN+AEDA | 72.55 | 79.99 | 82.32 | 55.24 | 68.23 | 74.25 | 
| TextCNN+softEDA | 72.49 | 80.51 | 82.73 | 56.06 | 69.24 | 74.33 | 
| TextCNN+LCDA | 73.27 | 80.89 | 82.84 | 57.13 | 69.22 | 74.47 | 
| TextRNN | 62.30 | 73.12 | 80.27 | 41.76 | 60.95 | 68.15 | 
| TextRNN+AEDA | 64.88 | 74.02 | 79.57 | 48.18 | 62.28 | 69.78 | 
| TextRNN+softEDA | 61.55 | 74.45 | 78.72 | 44.58 | 61.72 | 70.22 | 
| TextRNN+LCDA | 69.17 | 77.48 | 80.64 | 50.32 | 65.03 | 70.64 | 
| BERT | 81.05 | 87.18 | 88.91 | 76.08 | 79.97 | 83.11 | 
| BERT+AEDA | 81.31 | 87.21 | 87.52 | 76.69 | 80.35 | 82.43 | 
| BERT+softEDA | 82.51 | 88.48 | 89.63 | 77.12 | 81.93 | 82.91 | 
| BERT+LCDA | 84.26 | 88.61 | 89.81 | 78.98 | 83.15 | 84.33 | 
| RoBERTa-CNN | 84.82 | 87.80 | 90.12 | 78.52 | 81.33 | 82.75 | 
| RoBERTa-CNN+AEDA | 82.06 | 86.36 | 90.29 | 77.28 | 81.32 | 82.60 | 
| RoBERTa-CNN+softEDA | 86.43 | 88.55 | 90.72 | 80.69 | 83.21 | 84.00 | 
| RoBERTa-CNN+LCDA | 87.71 | 89.19 | 91.32 | 81.15 | 83.38 | 84.47 | 
Tab. 9 Comparison experimental results of different data augmentation methods
| 模型 | 准确率 | |||||
|---|---|---|---|---|---|---|
| THUCNews数据集 | Toutiao数据集 | |||||
| 50-THU | 200-THU | 500-THU | 50-Toutiao | 200-Toutiao | 500-Toutiao | |
| TextCNN | 72.08 | 80.49 | 82.57 | 54.18 | 68.03 | 73.89 | 
| TextCNN+AEDA | 72.55 | 79.99 | 82.32 | 55.24 | 68.23 | 74.25 | 
| TextCNN+softEDA | 72.49 | 80.51 | 82.73 | 56.06 | 69.24 | 74.33 | 
| TextCNN+LCDA | 73.27 | 80.89 | 82.84 | 57.13 | 69.22 | 74.47 | 
| TextRNN | 62.30 | 73.12 | 80.27 | 41.76 | 60.95 | 68.15 | 
| TextRNN+AEDA | 64.88 | 74.02 | 79.57 | 48.18 | 62.28 | 69.78 | 
| TextRNN+softEDA | 61.55 | 74.45 | 78.72 | 44.58 | 61.72 | 70.22 | 
| TextRNN+LCDA | 69.17 | 77.48 | 80.64 | 50.32 | 65.03 | 70.64 | 
| BERT | 81.05 | 87.18 | 88.91 | 76.08 | 79.97 | 83.11 | 
| BERT+AEDA | 81.31 | 87.21 | 87.52 | 76.69 | 80.35 | 82.43 | 
| BERT+softEDA | 82.51 | 88.48 | 89.63 | 77.12 | 81.93 | 82.91 | 
| BERT+LCDA | 84.26 | 88.61 | 89.81 | 78.98 | 83.15 | 84.33 | 
| RoBERTa-CNN | 84.82 | 87.80 | 90.12 | 78.52 | 81.33 | 82.75 | 
| RoBERTa-CNN+AEDA | 82.06 | 86.36 | 90.29 | 77.28 | 81.32 | 82.60 | 
| RoBERTa-CNN+softEDA | 86.43 | 88.55 | 90.72 | 80.69 | 83.21 | 84.00 | 
| RoBERTa-CNN+LCDA | 87.71 | 89.19 | 91.32 | 81.15 | 83.38 | 84.47 | 
| 模型 | 准确率 | 模型 | 准确率 | 
|---|---|---|---|
| TextRNN | 62.30 | BERT | 81.05 | 
| TextRNN+文本增强 | 65.88 | BERT+文本增强 | 83.25 | 
| TextRNN+标签混淆 | 68.87 | BERT+标签混淆 | 83.17 | 
| TextRNN+LCDA | 69.17 | BERT+LCDA | 84.26 | 
Tab. 10 Ablation experimental results on 50-THU dataset
| 模型 | 准确率 | 模型 | 准确率 | 
|---|---|---|---|
| TextRNN | 62.30 | BERT | 81.05 | 
| TextRNN+文本增强 | 65.88 | BERT+文本增强 | 83.25 | 
| TextRNN+标签混淆 | 68.87 | BERT+标签混淆 | 83.17 | 
| TextRNN+LCDA | 69.17 | BERT+LCDA | 84.26 | 
| 1 | TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 1422-1432. | 
| 2 | DING B, LIU L, BING L, et al. DAGA: data augmentation with a generation approach for low-resource tagging tasks[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 6045-6057. | 
| 3 | KOBAYASHI S. Contextual augmentation: data augmentation by words with paradigmatic relations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) . Stroudsburg: ACL, 2018: 452-457. | 
| 4 | CHEN H, HAN W, YANG D, et al. DoubleMix: simple interpolation-based data augmentation for text classification[C]// Proceedings of the 29th International Conference on Computational Linguistics. [S.l.]: International Committee on Computational Linguistics, 2022: 4622-4632. | 
| 5 | 余新言,曾诚,王乾,等. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 计算机应用, 2024, 44(6): 1767-1774. | 
| YU X Y, ZENG C, WANG Q, et al. Few-shot news topic classification method based on knowledge enhancement and prompt learning[J]. Journal of Computer Applications, 2024, 44(6): 1767-1774. | |
| 6 | SHORTEN C, KHOSHGOFTAAR T M, FURHT B. Text data augmentation for deep learning[J]. Journal of Big Data, 2021, 8: No.101. | 
| 7 | MÜLLER R, KORNBLITH S, HINTON G E. When does label smoothing help?[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 4694-4703. | 
| 8 | JOHNSON R, ZHANG T. Deep pyramid convolutional neural networks for text categorization[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2017: 562-570. | 
| 9 | JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Stroudsburg: ACL, 2016: 427-431. | 
| 10 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186. | 
| 11 | BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1877-1901. | 
| 12 | LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2023-11-12].. | 
| 13 | 姚迅,秦忠正,杨捷. 生成式标签对抗的文本分类模型[J]. 计算机应用, 2024, 44(6): 1781-1785. | 
| YAO X, QIN Z Z, YANG J. Generative label adversarial text classification model[J]. Journal of Computer Applications, 2024, 44(6): 1781-1785. | |
| 14 | 张海丰,曾诚,潘列,等. 结合BERT和特征投影网络的新闻主题文本分类方法[J]. 计算机应用, 2022, 42(4): 1116-1124. | 
| ZHANG H F, ZENG C, PAN L, et al. News topic text classification method based on BERT and feature projection network[J]. Journal of Computer Applications, 2022, 42(4): 1116-1124. | |
| 15 | ZOPH B, VASUDEVAN V, SHLENS J, et al. Learning transferable architectures for scalable image recognition[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8697-8710. | 
| 16 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. | 
| 17 | SONG Y, WANG J, JIANG T, et al. Targeted sentiment classification with attentional encoder network[C]// Proceedings of the 2019 International Conference on Artificial Neural Networks, LNCS 11730. Cham: Springer, 2019: 93-103. | 
| 18 | LUKASIK M, BHOJANAPALLI S, MENON A K, et al. Does label smoothing mitigate label noise?[C]// Proceedings of the 37th International Conference on Machine Learning. New York: JMLR.org, 2020: 6448-6458. | 
| 19 | GUO B, HAN S, HAN X, et al. Label confusion learning to enhance text classification models[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 12929-12936. | 
| 20 | WEI J, ZOU K. EDA: easy data augmentation techniques for boosting performance on text classification tasks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 6382-6388. | 
| 21 | KARIMI A, ROSSI L, PRATI A. AEDA: an easier data augmentation technique for text classification[C]// Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg: ACL, 2021: 2748-2754. | 
| 22 | WU X, GAO C, LIN M, et al. Text smoothing: enhance various data augmentation methods on text classification tasks[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg: ACL, 2022: 871-875. | 
| 23 | LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2016: 2873-2879. | 
| 24 | KIM Y. Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1746-1751. | 
| 25 | PUTRA D T, SETIAWAN E B. Sentiment analysis on social media with GloVe using combination CNN and RoBERTa[J]. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 2023, 7(3): 457-563. | 
| 26 | SEMARY N A, AHMED W, AMIN K, et al. Improving sentiment classification using a RoBERTa-based hybrid model[J]. Frontiers in Human Neuroscience, 2023, 17: No.1292010. | 
| 27 | CHOI J, JIN K, LEE J, et al. softEDA: rethinking rule-based data augmentation with soft labels[EB/OL]. [2023-11-12].. | 
| [1] | Renjie TIAN, Mingli JING, Long JIAO, Fei WANG. Recommendation algorithm of graph contrastive learning based on hybrid negative sampling [J]. Journal of Computer Applications, 2025, 45(4): 1053-1060. | 
| [2] | Jie YANG, Tashi NYIMA, Dongrub RINCHEN, Jindong QI, Dondrub TSHERING. Tibetan word segmentation system based on pre-trained model tokenization reconstruction [J]. Journal of Computer Applications, 2025, 45(4): 1199-1204. | 
| [3] | Jiaxin LI, Site MO. Power work order classification in substation area based on MiniRBT-LSTM-GAT and label smoothing [J]. Journal of Computer Applications, 2025, 45(4): 1356-1362. | 
| [4] | Chenwei SUN, Junli HOU, Xianggen LIU, Jiancheng LYU. Large language model prompt generation method for engineering drawing understanding [J]. Journal of Computer Applications, 2025, 45(3): 801-807. | 
| [5] | Kun FU, Shicong YING, Tingting ZHENG, Jiajie QU, Jingyuan CUI, Jianwei LI. Graph data augmentation method for few-shot node classification [J]. Journal of Computer Applications, 2025, 45(2): 392-402. | 
| [6] | Xuewen YAN, Zhangjin HUANG. Few-shot image classification method based on contrast learning [J]. Journal of Computer Applications, 2025, 45(2): 383-391. | 
| [7] | Jialin ZHANG, Qinghua REN, Qirong MAO. Speaker verification system utilizing global-local feature dependency for anti-spoofing [J]. Journal of Computer Applications, 2025, 45(1): 308-317. | 
| [8] | Chenyang LI, Long ZHANG, Qiusheng ZHENG, Shaohua QIAN. Multivariate controllable text generation based on diffusion sequences [J]. Journal of Computer Applications, 2024, 44(8): 2414-2420. | 
| [9] | Xinyan YU, Cheng ZENG, Qian WANG, Peng HE, Xiaoyu DING. Few-shot news topic classification method based on knowledge enhancement and prompt learning [J]. Journal of Computer Applications, 2024, 44(6): 1767-1774. | 
| [10] | Xun YAO, Zhongzheng QIN, Jie YANG. Generative label adversarial text classification model [J]. Journal of Computer Applications, 2024, 44(6): 1781-1785. | 
| [11] | Zhengyu ZHAO, Jing LUO, Xinhui TU. Information retrieval method based on multi-granularity semantic fusion [J]. Journal of Computer Applications, 2024, 44(6): 1775-1780. | 
| [12] | Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL: positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492. | 
| [13] | Jie GUO, Jiayu LIN, Zuhong LIANG, Xiaobo LUO, Haitao SUN. Recommendation method based on knowledge‑awareness and cross-level contrastive learning [J]. Journal of Computer Applications, 2024, 44(4): 1121-1127. | 
| [14] | Hang YU, Yanling ZHOU, Mengxin ZHAI, Han LIU. Text classification based on pre-training model and label fusion [J]. Journal of Computer Applications, 2024, 44(3): 709-714. | 
| [15] | Jiawei ZHANG, Guandong GAO, Ke XIAO, Shengzun SONG. Violent crime hierarchy algorithm by joint modeling of improved hierarchical attention network and TextCNN [J]. Journal of Computer Applications, 2024, 44(2): 403-410. | 
| Viewed | ||||||
| Full text |  | |||||
| Abstract |  | |||||