Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (4): 1113-1119.DOI: 10.11772/j.issn.1001-9081.2024040550
• Artificial intelligence • Previous Articles Next Articles
Haitao SUN1, Jiayu LIN2(), Zuhong LIANG1,3, Jie GUO1
Received:
2024-04-30
Revised:
2024-08-14
Accepted:
2024-08-16
Online:
2025-04-08
Published:
2025-04-10
Contact:
Jiayu LIN
About author:
SUN Haitao, born in 1999, M. S. candidate. His research interests include data augmentation, data mining.Supported by:
通讯作者:
林佳瑜
作者简介:
孙海涛(1999—),男,湖南常德人,硕士研究生,CCF会员,主要研究方向:数据增强、数据挖掘;基金资助:
CLC Number:
Haitao SUN, Jiayu LIN, Zuhong LIANG, Jie GUO. Data augmentation technique incorporating label confusion for Chinese text classification[J]. Journal of Computer Applications, 2025, 45(4): 1113-1119.
孙海涛, 林佳瑜, 梁祖红, 郭洁. 结合标签混淆的中文文本分类数据增强技术[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1113-1119.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024040550
样本类型 | 样本内容 |
---|---|
原始样本 | 两天价网站背后重重迷雾:做个网站究竟要多少钱 |
增强样本1 | 两:天价网站背后重重迷雾,,做个网站究竟要多少钱。 |
增强样本2 | 两!天价网站背后重!重迷雾:做个网站: 究竟要多少钱? |
增强样本3 | 两天价网。站背后重重迷,雾,;做个网站究竟要多少。 钱? |
增强样本4 | 两。天价网站背后重重迷雾:,做个?站究, 竟要多少钱? |
Tab. 1 Examples after text augmentation
样本类型 | 样本内容 |
---|---|
原始样本 | 两天价网站背后重重迷雾:做个网站究竟要多少钱 |
增强样本1 | 两:天价网站背后重重迷雾,,做个网站究竟要多少钱。 |
增强样本2 | 两!天价网站背后重!重迷雾:做个网站: 究竟要多少钱? |
增强样本3 | 两天价网。站背后重重迷,雾,;做个网站究竟要多少。 钱? |
增强样本4 | 两。天价网站背后重重迷雾:,做个?站究, 竟要多少钱? |
错误样本 | 正确分类 | 错误分类 |
---|---|---|
三联书店建起书香巷 | 科技 | 教育 |
Google多项功能前晚集中“瘫痪” | 科技 | 社会 |
借款纠纷牵出房产商伪造公文开发楼盘案 | 社会 | 房地产 |
Tab. 2 Examples of misclassified samples
错误样本 | 正确分类 | 错误分类 |
---|---|---|
三联书店建起书香巷 | 科技 | 教育 |
Google多项功能前晚集中“瘫痪” | 科技 | 社会 |
借款纠纷牵出房产商伪造公文开发楼盘案 | 社会 | 房地产 |
数据集 | 样本数 | 标签类别数 | ||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | ||
THUCNews | 180 000 | 10 000 | 10 000 | 10 |
50-THU | 500 | 10 000 | 10 000 | 10 |
200-THU | 2 000 | 10 000 | 10 000 | 10 |
500-THU | 5 000 | 10 000 | 10 000 | 10 |
Toutiao | 130 000 | 10 000 | 10 000 | 13 |
50-Toutiao | 650 | 10 000 | 10 000 | 13 |
200-Toutiao | 2 600 | 10 000 | 10 000 | 13 |
500-Toutiao | 6 500 | 10 000 | 10 000 | 13 |
Tab. 3 Datasets used in experiments
数据集 | 样本数 | 标签类别数 | ||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | ||
THUCNews | 180 000 | 10 000 | 10 000 | 10 |
50-THU | 500 | 10 000 | 10 000 | 10 |
200-THU | 2 000 | 10 000 | 10 000 | 10 |
500-THU | 5 000 | 10 000 | 10 000 | 10 |
Toutiao | 130 000 | 10 000 | 10 000 | 13 |
50-Toutiao | 650 | 10 000 | 10 000 | 13 |
200-Toutiao | 2 600 | 10 000 | 10 000 | 13 |
500-Toutiao | 6 500 | 10 000 | 10 000 | 13 |
实际情况 | 预测正类 | 预测负类 |
---|---|---|
实际正类 | 真正类 TP | 假负类 FN |
实际负类 | 假正类 FP | 真负类 TN |
Tab. 4 Confusion matrix of classification results
实际情况 | 预测正类 | 预测负类 |
---|---|---|
实际正类 | 真正类 TP | 假负类 FN |
实际负类 | 假正类 FP | 真负类 TN |
INSERT_PROB值 | 准确率/% | INSERT_PROB值 | 准确率/% |
---|---|---|---|
0.1 | 83.99 | 0.4 | 82.69 |
0.2 | 84.26 | 0.5 | 83.11 |
0.3 | 83.69 |
Tab. 5 Experimental results of different INSERT_PROB values
INSERT_PROB值 | 准确率/% | INSERT_PROB值 | 准确率/% |
---|---|---|---|
0.1 | 83.99 | 0.4 | 82.69 |
0.2 | 84.26 | 0.5 | 83.11 |
0.3 | 83.69 |
扩充语句数 | THUCNews数据集准确率/% | ||
---|---|---|---|
50-THU | 200-THU | 500-THU | |
0 | 62.30 | 73.12 | 80.27 |
1 | 66.47 | 76.48 | 81.63 |
2 | 67.28 | 77.87 | 81.20 |
4 | 69.17 | 77.48 | 80.64 |
Tab. 6 Experimental results of different augmentation scales
扩充语句数 | THUCNews数据集准确率/% | ||
---|---|---|---|
50-THU | 200-THU | 500-THU | |
0 | 62.30 | 73.12 | 80.27 |
1 | 66.47 | 76.48 | 81.63 |
2 | 67.28 | 77.87 | 81.20 |
4 | 69.17 | 77.48 | 80.64 |
平滑状态 | 标签向量 |
---|---|
标签平滑前 | [1,0,0,0,0,0,0,0,0,0] |
标签平滑后 | [0.91,0.01,0.01,0.01,0.01,0.01, 0.01,0.01,0.01,0.01] |
Tab. 7 Labels before and after applying label smoothing technique
平滑状态 | 标签向量 |
---|---|
标签平滑前 | [1,0,0,0,0,0,0,0,0,0] |
标签平滑后 | [0.91,0.01,0.01,0.01,0.01,0.01, 0.01,0.01,0.01,0.01] |
模型 | THUCNews数据集准确率 | ||
---|---|---|---|
50-THU | 200-THU | 500-THU | |
BERT | 81.05 | 87.21 | 88.91 |
BERT+文本增强+LS | 83.27 | 87.74 | 89.28 |
BERT+LCDA | 84.26 | 88.61 | 89.81 |
Tab. 8 Comparison experimental results of label confusion and label smoothing
模型 | THUCNews数据集准确率 | ||
---|---|---|---|
50-THU | 200-THU | 500-THU | |
BERT | 81.05 | 87.21 | 88.91 |
BERT+文本增强+LS | 83.27 | 87.74 | 89.28 |
BERT+LCDA | 84.26 | 88.61 | 89.81 |
模型 | 准确率 | |||||
---|---|---|---|---|---|---|
THUCNews数据集 | Toutiao数据集 | |||||
50-THU | 200-THU | 500-THU | 50-Toutiao | 200-Toutiao | 500-Toutiao | |
TextCNN | 72.08 | 80.49 | 82.57 | 54.18 | 68.03 | 73.89 |
TextCNN+AEDA | 72.55 | 79.99 | 82.32 | 55.24 | 68.23 | 74.25 |
TextCNN+softEDA | 72.49 | 80.51 | 82.73 | 56.06 | 69.24 | 74.33 |
TextCNN+LCDA | 73.27 | 80.89 | 82.84 | 57.13 | 69.22 | 74.47 |
TextRNN | 62.30 | 73.12 | 80.27 | 41.76 | 60.95 | 68.15 |
TextRNN+AEDA | 64.88 | 74.02 | 79.57 | 48.18 | 62.28 | 69.78 |
TextRNN+softEDA | 61.55 | 74.45 | 78.72 | 44.58 | 61.72 | 70.22 |
TextRNN+LCDA | 69.17 | 77.48 | 80.64 | 50.32 | 65.03 | 70.64 |
BERT | 81.05 | 87.18 | 88.91 | 76.08 | 79.97 | 83.11 |
BERT+AEDA | 81.31 | 87.21 | 87.52 | 76.69 | 80.35 | 82.43 |
BERT+softEDA | 82.51 | 88.48 | 89.63 | 77.12 | 81.93 | 82.91 |
BERT+LCDA | 84.26 | 88.61 | 89.81 | 78.98 | 83.15 | 84.33 |
RoBERTa-CNN | 84.82 | 87.80 | 90.12 | 78.52 | 81.33 | 82.75 |
RoBERTa-CNN+AEDA | 82.06 | 86.36 | 90.29 | 77.28 | 81.32 | 82.60 |
RoBERTa-CNN+softEDA | 86.43 | 88.55 | 90.72 | 80.69 | 83.21 | 84.00 |
RoBERTa-CNN+LCDA | 87.71 | 89.19 | 91.32 | 81.15 | 83.38 | 84.47 |
Tab. 9 Comparison experimental results of different data augmentation methods
模型 | 准确率 | |||||
---|---|---|---|---|---|---|
THUCNews数据集 | Toutiao数据集 | |||||
50-THU | 200-THU | 500-THU | 50-Toutiao | 200-Toutiao | 500-Toutiao | |
TextCNN | 72.08 | 80.49 | 82.57 | 54.18 | 68.03 | 73.89 |
TextCNN+AEDA | 72.55 | 79.99 | 82.32 | 55.24 | 68.23 | 74.25 |
TextCNN+softEDA | 72.49 | 80.51 | 82.73 | 56.06 | 69.24 | 74.33 |
TextCNN+LCDA | 73.27 | 80.89 | 82.84 | 57.13 | 69.22 | 74.47 |
TextRNN | 62.30 | 73.12 | 80.27 | 41.76 | 60.95 | 68.15 |
TextRNN+AEDA | 64.88 | 74.02 | 79.57 | 48.18 | 62.28 | 69.78 |
TextRNN+softEDA | 61.55 | 74.45 | 78.72 | 44.58 | 61.72 | 70.22 |
TextRNN+LCDA | 69.17 | 77.48 | 80.64 | 50.32 | 65.03 | 70.64 |
BERT | 81.05 | 87.18 | 88.91 | 76.08 | 79.97 | 83.11 |
BERT+AEDA | 81.31 | 87.21 | 87.52 | 76.69 | 80.35 | 82.43 |
BERT+softEDA | 82.51 | 88.48 | 89.63 | 77.12 | 81.93 | 82.91 |
BERT+LCDA | 84.26 | 88.61 | 89.81 | 78.98 | 83.15 | 84.33 |
RoBERTa-CNN | 84.82 | 87.80 | 90.12 | 78.52 | 81.33 | 82.75 |
RoBERTa-CNN+AEDA | 82.06 | 86.36 | 90.29 | 77.28 | 81.32 | 82.60 |
RoBERTa-CNN+softEDA | 86.43 | 88.55 | 90.72 | 80.69 | 83.21 | 84.00 |
RoBERTa-CNN+LCDA | 87.71 | 89.19 | 91.32 | 81.15 | 83.38 | 84.47 |
模型 | 准确率 | 模型 | 准确率 |
---|---|---|---|
TextRNN | 62.30 | BERT | 81.05 |
TextRNN+文本增强 | 65.88 | BERT+文本增强 | 83.25 |
TextRNN+标签混淆 | 68.87 | BERT+标签混淆 | 83.17 |
TextRNN+LCDA | 69.17 | BERT+LCDA | 84.26 |
Tab. 10 Ablation experimental results on 50-THU dataset
模型 | 准确率 | 模型 | 准确率 |
---|---|---|---|
TextRNN | 62.30 | BERT | 81.05 |
TextRNN+文本增强 | 65.88 | BERT+文本增强 | 83.25 |
TextRNN+标签混淆 | 68.87 | BERT+标签混淆 | 83.17 |
TextRNN+LCDA | 69.17 | BERT+LCDA | 84.26 |
1 | TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 1422-1432. |
2 | DING B, LIU L, BING L, et al. DAGA: data augmentation with a generation approach for low-resource tagging tasks[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 6045-6057. |
3 | KOBAYASHI S. Contextual augmentation: data augmentation by words with paradigmatic relations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) . Stroudsburg: ACL, 2018: 452-457. |
4 | CHEN H, HAN W, YANG D, et al. DoubleMix: simple interpolation-based data augmentation for text classification[C]// Proceedings of the 29th International Conference on Computational Linguistics. [S.l.]: International Committee on Computational Linguistics, 2022: 4622-4632. |
5 | 余新言,曾诚,王乾,等. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 计算机应用, 2024, 44(6): 1767-1774. |
YU X Y, ZENG C, WANG Q, et al. Few-shot news topic classification method based on knowledge enhancement and prompt learning[J]. Journal of Computer Applications, 2024, 44(6): 1767-1774. | |
6 | SHORTEN C, KHOSHGOFTAAR T M, FURHT B. Text data augmentation for deep learning[J]. Journal of Big Data, 2021, 8: No.101. |
7 | MÜLLER R, KORNBLITH S, HINTON G E. When does label smoothing help?[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 4694-4703. |
8 | JOHNSON R, ZHANG T. Deep pyramid convolutional neural networks for text categorization[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2017: 562-570. |
9 | JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Stroudsburg: ACL, 2016: 427-431. |
10 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186. |
11 | BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1877-1901. |
12 | LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2023-11-12].. |
13 | 姚迅,秦忠正,杨捷. 生成式标签对抗的文本分类模型[J]. 计算机应用, 2024, 44(6): 1781-1785. |
YAO X, QIN Z Z, YANG J. Generative label adversarial text classification model[J]. Journal of Computer Applications, 2024, 44(6): 1781-1785. | |
14 | 张海丰,曾诚,潘列,等. 结合BERT和特征投影网络的新闻主题文本分类方法[J]. 计算机应用, 2022, 42(4): 1116-1124. |
ZHANG H F, ZENG C, PAN L, et al. News topic text classification method based on BERT and feature projection network[J]. Journal of Computer Applications, 2022, 42(4): 1116-1124. | |
15 | ZOPH B, VASUDEVAN V, SHLENS J, et al. Learning transferable architectures for scalable image recognition[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8697-8710. |
16 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
17 | SONG Y, WANG J, JIANG T, et al. Targeted sentiment classification with attentional encoder network[C]// Proceedings of the 2019 International Conference on Artificial Neural Networks, LNCS 11730. Cham: Springer, 2019: 93-103. |
18 | LUKASIK M, BHOJANAPALLI S, MENON A K, et al. Does label smoothing mitigate label noise?[C]// Proceedings of the 37th International Conference on Machine Learning. New York: JMLR.org, 2020: 6448-6458. |
19 | GUO B, HAN S, HAN X, et al. Label confusion learning to enhance text classification models[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 12929-12936. |
20 | WEI J, ZOU K. EDA: easy data augmentation techniques for boosting performance on text classification tasks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 6382-6388. |
21 | KARIMI A, ROSSI L, PRATI A. AEDA: an easier data augmentation technique for text classification[C]// Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg: ACL, 2021: 2748-2754. |
22 | WU X, GAO C, LIN M, et al. Text smoothing: enhance various data augmentation methods on text classification tasks[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg: ACL, 2022: 871-875. |
23 | LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2016: 2873-2879. |
24 | KIM Y. Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1746-1751. |
25 | PUTRA D T, SETIAWAN E B. Sentiment analysis on social media with GloVe using combination CNN and RoBERTa[J]. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 2023, 7(3): 457-563. |
26 | SEMARY N A, AHMED W, AMIN K, et al. Improving sentiment classification using a RoBERTa-based hybrid model[J]. Frontiers in Human Neuroscience, 2023, 17: No.1292010. |
27 | CHOI J, JIN K, LEE J, et al. softEDA: rethinking rule-based data augmentation with soft labels[EB/OL]. [2023-11-12].. |
[1] | Renjie TIAN, Mingli JING, Long JIAO, Fei WANG. Recommendation algorithm of graph contrastive learning based on hybrid negative sampling [J]. Journal of Computer Applications, 2025, 45(4): 1053-1060. |
[2] | Jie YANG, Tashi NYIMA, Dongrub RINCHEN, Jindong QI, Dondrub TSHERING. Tibetan word segmentation system based on pre-trained model tokenization reconstruction [J]. Journal of Computer Applications, 2025, 45(4): 1199-1204. |
[3] | Jiaxin LI, Site MO. Power work order classification in substation area based on MiniRBT-LSTM-GAT and label smoothing [J]. Journal of Computer Applications, 2025, 45(4): 1356-1362. |
[4] | Chenwei SUN, Junli HOU, Xianggen LIU, Jiancheng LYU. Large language model prompt generation method for engineering drawing understanding [J]. Journal of Computer Applications, 2025, 45(3): 801-807. |
[5] | Kun FU, Shicong YING, Tingting ZHENG, Jiajie QU, Jingyuan CUI, Jianwei LI. Graph data augmentation method for few-shot node classification [J]. Journal of Computer Applications, 2025, 45(2): 392-402. |
[6] | Xuewen YAN, Zhangjin HUANG. Few-shot image classification method based on contrast learning [J]. Journal of Computer Applications, 2025, 45(2): 383-391. |
[7] | Jialin ZHANG, Qinghua REN, Qirong MAO. Speaker verification system utilizing global-local feature dependency for anti-spoofing [J]. Journal of Computer Applications, 2025, 45(1): 308-317. |
[8] | Chenyang LI, Long ZHANG, Qiusheng ZHENG, Shaohua QIAN. Multivariate controllable text generation based on diffusion sequences [J]. Journal of Computer Applications, 2024, 44(8): 2414-2420. |
[9] | Xinyan YU, Cheng ZENG, Qian WANG, Peng HE, Xiaoyu DING. Few-shot news topic classification method based on knowledge enhancement and prompt learning [J]. Journal of Computer Applications, 2024, 44(6): 1767-1774. |
[10] | Xun YAO, Zhongzheng QIN, Jie YANG. Generative label adversarial text classification model [J]. Journal of Computer Applications, 2024, 44(6): 1781-1785. |
[11] | Zhengyu ZHAO, Jing LUO, Xinhui TU. Information retrieval method based on multi-granularity semantic fusion [J]. Journal of Computer Applications, 2024, 44(6): 1775-1780. |
[12] | Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL: positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492. |
[13] | Jie GUO, Jiayu LIN, Zuhong LIANG, Xiaobo LUO, Haitao SUN. Recommendation method based on knowledge‑awareness and cross-level contrastive learning [J]. Journal of Computer Applications, 2024, 44(4): 1121-1127. |
[14] | Hang YU, Yanling ZHOU, Mengxin ZHAI, Han LIU. Text classification based on pre-training model and label fusion [J]. Journal of Computer Applications, 2024, 44(3): 709-714. |
[15] | Jiawei ZHANG, Guandong GAO, Ke XIAO, Shengzun SONG. Violent crime hierarchy algorithm by joint modeling of improved hierarchical attention network and TextCNN [J]. Journal of Computer Applications, 2024, 44(2): 403-410. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||