结合标签混淆的中文文本分类数据增强技术

doi:10.11772/j.issn.1001-9081.2024040550

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (4): 1113-1119.DOI: 10.11772/j.issn.1001-9081.2024040550

结合标签混淆的中文文本分类数据增强技术

孙海涛¹, 林佳瑜²(), 梁祖红¹^,³, 郭洁¹

^1.广东工业大学计算机学院，广州 510006
^2.广东工业大学图书馆，广州 510006
^3.广东工业大学实验教学部，广州 510006

收稿日期:2024-04-30 修回日期:2024-08-14 接受日期:2024-08-16 发布日期:2025-04-08 出版日期:2025-04-10
通讯作者: 林佳瑜
作者简介:孙海涛（1999—），男，湖南常德人，硕士研究生，CCF会员，主要研究方向：数据增强、数据挖掘；
梁祖红（1980—），男，广东惠州人，教授级高级工程师，博士，主要研究方向：机器学习、智能计算；
郭洁（1998—），女，湖南常德人，硕士研究生，主要研究方向：推荐系统、数据挖掘。
基金资助:
教育部产学合作协同育人项目(220901229305933)

Data augmentation technique incorporating label confusion for Chinese text classification

Haitao SUN¹, Jiayu LIN²(), Zuhong LIANG¹^,³, Jie GUO¹

^1.School of Computer Science and Technology，Guangdong University of Technology，Guangzhou Guangdong 510006，China
^2.Library，Guangdong University of Technology，Guangzhou Guangdong 510006，China
^3.Experimental Teaching Department，Guangdong University of Technology，Guangzhou Guangdong 510006，China

Received:2024-04-30 Revised:2024-08-14 Accepted:2024-08-16 Online:2025-04-08 Published:2025-04-10
Contact: Jiayu LIN
About author:SUN Haitao， born in 1999， M. S. candidate. His research interests include data augmentation， data mining.
LIANG Zuhong， born in 1980， Ph. D.， professor of engineer. His research interests include machine learning， intelligent computing.
GUO Jie， born in 1998， M. S. candidate. Her research interests include recommendation system， data mining.
Supported by:
Ministry of Education University-Industry Cooperation Collaborative Education Project(220901229305933)

摘要/Abstract

摘要：

传统数据增强技术，如同义词替换、随机插入和随机删除等，可能改变文本的原始语义，甚至导致关键信息丢失。此外，在文本分类任务中，数据通常包含文本部分和标签部分，然而传统数据增强方法仅针对文本部分。为解决这些问题，提出一种结合标签混淆的数据增强（LCDA）技术，从文本和标签这2个基本方面入手，为数据提供全面的强化。在文本方面，通过对文本进行标点符号随机插入和替换以及句末标点符号补齐等增强，在保留全部文本信息和顺序的同时增加文本的多样性；在标签方面，采用标签混淆方法生成模拟标签分布替代传统的one-hot标签分布，以更好地反映实例和标签与标签之间的关系。在THUCNews（TsingHua University Chinese News）和Toutiao这2个中文新闻数据集构建的小样本数据集上分别结合TextCNN、TextRNN、BERT（Bidirectional Encoder Representations from Transformers）和RoBERTa-CNN（Robustly optimized BERT approach Convolutional Neural Network）文本分类模型的实验结果表明，与增强前相比，性能均得到显著提升。其中，在由THUCNews数据集构造的50-THU数据集上，4种模型结合LCDA技术后的准确率相较于增强前分别提高了1.19、6.87、3.21和2.89个百分点；相较于softEDA（Easy Data Augmentation with soft labels）方法增强的模型分别提高了0.78、7.62、1.75和1.28个百分点。通过在文本和标签这2个维度的处理结果可知，LCDA技术能显著提升模型的准确率，在数据量较少的应用场景中表现尤为突出。

关键词: 数据增强, 文本分类, 标签混淆, 中文新闻主题, 预训练模型

Abstract:

Traditional data augmentation techniques， such as synonym substitution， random insertion， and random deletion， may change the original semantics of text and even result in the loss of critical information. Moreover， data in text classification tasks typically have both textual and label parts. However， traditional data augmentation methods only focus on the textual part. To address these issues， a Label Confusion incorporated Data Augmentation （LCDA） technique was proposed for providing a comprehensive enhancement of data from both textual and label aspects. In terms of text， by enhancing the text through random insertion and replacement of punctuation marks and completing end-of-sentence punctuation marks， textual diversity was increased with all textual information and sequence preserved. In terms of labels， simulated label distribution was generated using a label confusion approach， and used to replace the traditional one-hot label distribution， so as to better reflect the relationships among instances and labels as well as between labels. In experiments conducted on few-shot datasets constructed from THUCNews （TsingHua University Chinese News） and Toutiao Chinese news datasets， the proposed technique was combined with TextCNN， TextRNN， BERT （Bidirectional Encoder Representations from Transformers）， and RoBERTa-CNN （Robustly optimized BERT approach Convolutional Neural Network） text classification models. The experimental results indicate that compared to those before enhancement， all models demonstrate significant performance improvements. Specifically， on 50-THU， a dataset constructed on THUCNews dataset， the accuracies of four models combing LCDA technique are improved by 1.19， 6.87， 3.21， and 2.89 percentage points， respectively， compared to those before enhancement， and by 0.78， 7.62， 1.75， and 1.28 percentage points， respectively， compared to those of the four models combining softEDA （Easy Data Augmentation with soft labels） method. By both textual and label processing results， model accuracy is enhanced by LCDA technique significantly， particularly in application scenarios characterized by limited data availability.

Key words: data augmentation, text classification, label confusion, Chinese news topic, pre-trained model

中图分类号:

TP391.1

孙海涛, 林佳瑜, 梁祖红, 郭洁. 结合标签混淆的中文文本分类数据增强技术[J]. 计算机应用, 2025, 45(4): 1113-1119.

Haitao SUN, Jiayu LIN, Zuhong LIANG, Jie GUO. Data augmentation technique incorporating label confusion for Chinese text classification[J]. Journal of Computer Applications, 2025, 45(4): 1113-1119.

图/表 11

参考文献 27

1	TANG D， QIN B， LIU T. Document modeling with gated recurrent neural network for sentiment classification［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2015： 1422-1432.
2	DING B， LIU L， BING L， et al. DAGA： data augmentation with a generation approach for low-resource tagging tasks［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020： 6045-6057.
3	KOBAYASHI S. Contextual augmentation： data augmentation by words with paradigmatic relations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers） . Stroudsburg： ACL， 2018： 452-457.
4	CHEN H， HAN W， YANG D， et al. DoubleMix： simple interpolation-based data augmentation for text classification［C］// Proceedings of the 29th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2022： 4622-4632.
5	余新言，曾诚，王乾，等. 基于知识增强和提示学习的小样本新闻主题分类方法［J］. 计算机应用， 2024， 44（6）： 1767-1774.
	YU X Y， ZENG C， WANG Q， et al. Few-shot news topic classification method based on knowledge enhancement and prompt learning［J］. Journal of Computer Applications， 2024， 44（6）： 1767-1774.
6	SHORTEN C， KHOSHGOFTAAR T M， FURHT B. Text data augmentation for deep learning［J］. Journal of Big Data， 2021， 8： No.101.
7	MÜLLER R， KORNBLITH S， HINTON G E. When does label smoothing help？［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 4694-4703.
8	JOHNSON R， ZHANG T. Deep pyramid convolutional neural networks for text categorization［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 562-570.
9	JOULIN A， GRAVE E， BOJANOWSKI P， et al. Bag of tricks for efficient text classification［C］// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics： Volume 2， Short Papers. Stroudsburg： ACL， 2016： 427-431.
10	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
11	BROWN T B， MANN B， RYDER N， et al. Language models are few-shot learners［C］// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 1877-1901.
12	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach［EB/OL］. ［2023-11-12］..
13	姚迅，秦忠正，杨捷. 生成式标签对抗的文本分类模型［J］. 计算机应用， 2024， 44（6）： 1781-1785.
	YAO X， QIN Z Z， YANG J. Generative label adversarial text classification model［J］. Journal of Computer Applications， 2024， 44（6）： 1781-1785.
14	张海丰，曾诚，潘列，等. 结合BERT和特征投影网络的新闻主题文本分类方法［J］. 计算机应用， 2022， 42（4）： 1116-1124.
	ZHANG H F， ZENG C， PAN L， et al. News topic text classification method based on BERT and feature projection network［J］. Journal of Computer Applications， 2022， 42（4）： 1116-1124.
15	ZOPH B， VASUDEVAN V， SHLENS J， et al. Learning transferable architectures for scalable image recognition［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8697-8710.
16	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
17	SONG Y， WANG J， JIANG T， et al. Targeted sentiment classification with attentional encoder network［C］// Proceedings of the 2019 International Conference on Artificial Neural Networks， LNCS 11730. Cham： Springer， 2019： 93-103.
18	LUKASIK M， BHOJANAPALLI S， MENON A K， et al. Does label smoothing mitigate label noise？［C］// Proceedings of the 37th International Conference on Machine Learning. New York： JMLR.org， 2020： 6448-6458.
19	GUO B， HAN S， HAN X， et al. Label confusion learning to enhance text classification models［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 12929-12936.
20	WEI J， ZOU K. EDA： easy data augmentation techniques for boosting performance on text classification tasks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 6382-6388.
21	KARIMI A， ROSSI L， PRATI A. AEDA： an easier data augmentation technique for text classification［C］// Findings of the Association for Computational Linguistics： EMNLP 2021. Stroudsburg： ACL， 2021： 2748-2754.
22	WU X， GAO C， LIN M， et al. Text smoothing： enhance various data augmentation methods on text classification tasks［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 2： Short Papers）. Stroudsburg： ACL， 2022： 871-875.
23	LIU P， QIU X， HUANG X. Recurrent neural network for text classification with multi-task learning［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2016： 2873-2879.
24	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1746-1751.
25	PUTRA D T， SETIAWAN E B. Sentiment analysis on social media with GloVe using combination CNN and RoBERTa［J］. Jurnal RESTI （Rekayasa Sistem dan Teknologi Informasi）， 2023， 7（3）： 457-563.
26	SEMARY N A， AHMED W， AMIN K， et al. Improving sentiment classification using a RoBERTa-based hybrid model［J］. Frontiers in Human Neuroscience， 2023， 17： No.1292010.
27	CHOI J， JIN K， LEE J， et al. softEDA： rethinking rule-based data augmentation with soft labels［EB/OL］. ［2023-11-12］..

样本类型	样本内容
原始样本	两天价网站背后重重迷雾：做个网站究竟要多少钱
增强样本1	两：天价网站背后重重迷雾，，做个网站究竟要多少钱。
增强样本2	两！天价网站背后重！重迷雾：做个网站：究竟要多少钱？
增强样本3	两天价网。站背后重重迷，雾，；做个网站究竟要多少。钱？
增强样本4	两。天价网站背后重重迷雾：，做个？站究，竟要多少钱？

样本类型	样本内容
原始样本	两天价网站背后重重迷雾：做个网站究竟要多少钱
增强样本1	两：天价网站背后重重迷雾，，做个网站究竟要多少钱。
增强样本2	两！天价网站背后重！重迷雾：做个网站：究竟要多少钱？
增强样本3	两天价网。站背后重重迷，雾，；做个网站究竟要多少。钱？
增强样本4	两。天价网站背后重重迷雾：，做个？站究，竟要多少钱？

错误样本	正确分类	错误分类
三联书店建起书香巷	科技	教育
Google多项功能前晚集中“瘫痪”	科技	社会
借款纠纷牵出房产商伪造公文开发楼盘案	社会	房地产

错误样本	正确分类	错误分类
三联书店建起书香巷	科技	教育
Google多项功能前晚集中“瘫痪”	科技	社会
借款纠纷牵出房产商伪造公文开发楼盘案	社会	房地产

数据集	样本数			标签类别数
数据集	训练集	验证集	测试集	标签类别数
THUCNews	180 000	10 000	10 000	10
50-THU	500	10 000	10 000	10
200-THU	2 000	10 000	10 000	10
500-THU	5 000	10 000	10 000	10
Toutiao	130 000	10 000	10 000	13
50-Toutiao	650	10 000	10 000	13
200-Toutiao	2 600	10 000	10 000	13
500-Toutiao	6 500	10 000	10 000	13

结合标签混淆的中文文本分类数据增强技术

Data augmentation technique incorporating label confusion for Chinese text classification

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 27

相关文章 15

编辑推荐

Metrics

实际情况	预测正类	预测负类
实际正类	真正类 TP	假负类 FN
实际负类	假正类 FP	真负类 TN

扩充语句数	THUCNews数据集准确率/%
扩充语句数	50-THU	200-THU	500-THU
0	62.30	73.12	80.27
1	66.47	76.48	81.63
2	67.28	77.87	81.20
4	69.17	77.48	80.64

模型	THUCNews数据集准确率
模型	50-THU	200-THU	500-THU
BERT	81.05	87.21	88.91
BERT+文本增强+LS	83.27	87.74	89.28
BERT+LCDA	84.26	88.61	89.81

模型	准确率
	THUCNews数据集			Toutiao数据集
	50-THU	200-THU	500-THU	50-Toutiao	200-Toutiao	500-Toutiao
TextCNN	72.08	80.49	82.57	54.18	68.03	73.89
TextCNN+AEDA	72.55	79.99	82.32	55.24	68.23	74.25
TextCNN+softEDA	72.49	80.51	82.73	56.06	69.24	74.33
TextCNN+LCDA	73.27	80.89	82.84	57.13	69.22	74.47
TextRNN	62.30	73.12	80.27	41.76	60.95	68.15
TextRNN+AEDA	64.88	74.02	79.57	48.18	62.28	69.78
TextRNN+softEDA	61.55	74.45	78.72	44.58	61.72	70.22
TextRNN+LCDA	69.17	77.48	80.64	50.32	65.03	70.64
BERT	81.05	87.18	88.91	76.08	79.97	83.11
BERT+AEDA	81.31	87.21	87.52	76.69	80.35	82.43
BERT+softEDA	82.51	88.48	89.63	77.12	81.93	82.91
BERT+LCDA	84.26	88.61	89.81	78.98	83.15	84.33
RoBERTa-CNN	84.82	87.80	90.12	78.52	81.33	82.75
RoBERTa-CNN+AEDA	82.06	86.36	90.29	77.28	81.32	82.60
RoBERTa-CNN+softEDA	86.43	88.55	90.72	80.69	83.21	84.00
RoBERTa-CNN+LCDA	87.71	89.19	91.32	81.15	83.38	84.47

[1]	田仁杰, 景明利, 焦龙, 王飞. 基于混合负采样的图对比学习推荐算法[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1053-1060.
[2]	杨杰, 尼玛扎西, 仁青东主, 祁晋东, 才让东知. 基于预训练模型标记器重构的藏文分词系统[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1199-1204.
[3]	李嘉欣, 莫思特. 基于MiniRBT-LSTM-GAT与标签平滑的台区电力工单分类[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1356-1362.
[4]	盛坤, 王中卿. 基于大语言模型和数据增强的通感隐喻分析[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 794-800.
[5]	孙晨伟, 侯俊利, 刘祥根, 吕建成. 面向工程图纸理解的大语言模型提示生成方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 801-807.
[6]	洪予晨, 李金龙. 基于预训练的符号化音乐生成[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 578-583.
[7]	富坤, 应世聪, 郑婷婷, 屈佳捷, 崔静远, 李建伟. 面向小样本节点分类的图数据增强方法[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 392-402.
[8]	严雪文, 黄章进. 基于对比学习的小样本图像分类方法[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 383-391.
[9]	张嘉琳, 任庆桦, 毛启容. 利用全局-局部特征依赖的反欺骗说话人验证系统[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 308-317.
[10]	杨莹, 郝晓燕, 于丹, 马垚, 陈永乐. 面向图神经网络模型提取攻击的图数据生成方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2483-2492.
[11]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[12]	李晨阳, 张龙, 郑秋生, 钱少华. 基于扩散序列的多元可控文本生成[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2414-2420.
[13]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[14]	赵征宇, 罗景, 涂新辉. 基于多粒度语义融合的信息检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1775-1780.
[15]	余新言, 曾诚, 王乾, 何鹏, 丁晓玉. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1767-1774.

INSERT_PROB值	准确率/%	INSERT_PROB值	准确率/%
0.1	83.99	0.4	82.69
0.2	84.26	0.5	83.11
0.3	83.69

INSERT_PROB值	准确率/%	INSERT_PROB值	准确率/%
0.1	83.99	0.4	82.69
0.2	84.26	0.5	83.11
0.3	83.69