Data augmentation technique incorporating label confusion for Chinese text classification

doi:10.11772/j.issn.1001-9081.2024040550

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (4): 1113-1119.DOI: 10.11772/j.issn.1001-9081.2024040550

• Artificial intelligence • Previous Articles Next Articles

Data augmentation technique incorporating label confusion for Chinese text classification

Haitao SUN¹, Jiayu LIN²(), Zuhong LIANG¹^,³, Jie GUO¹

^1.School of Computer Science and Technology，Guangdong University of Technology，Guangzhou Guangdong 510006，China
^2.Library，Guangdong University of Technology，Guangzhou Guangdong 510006，China
^3.Experimental Teaching Department，Guangdong University of Technology，Guangzhou Guangdong 510006，China

Received:2024-04-30 Revised:2024-08-14 Accepted:2024-08-16 Online:2025-04-08 Published:2025-04-10
Contact: Jiayu LIN
About author:SUN Haitao， born in 1999， M. S. candidate. His research interests include data augmentation， data mining.
LIANG Zuhong， born in 1980， Ph. D.， professor of engineer. His research interests include machine learning， intelligent computing.
GUO Jie， born in 1998， M. S. candidate. Her research interests include recommendation system， data mining.
Supported by:
Ministry of Education University-Industry Cooperation Collaborative Education Project(220901229305933)

结合标签混淆的中文文本分类数据增强技术

孙海涛¹, 林佳瑜²(), 梁祖红¹^,³, 郭洁¹

^1.广东工业大学计算机学院，广州 510006
^2.广东工业大学图书馆，广州 510006
^3.广东工业大学实验教学部，广州 510006

通讯作者: 林佳瑜
作者简介:孙海涛（1999—），男，湖南常德人，硕士研究生，CCF会员，主要研究方向：数据增强、数据挖掘；
梁祖红（1980—），男，广东惠州人，教授级高级工程师，博士，主要研究方向：机器学习、智能计算；
郭洁（1998—），女，湖南常德人，硕士研究生，主要研究方向：推荐系统、数据挖掘。
基金资助:
教育部产学合作协同育人项目(220901229305933)

Abstract

Abstract:

Traditional data augmentation techniques， such as synonym substitution， random insertion， and random deletion， may change the original semantics of text and even result in the loss of critical information. Moreover， data in text classification tasks typically have both textual and label parts. However， traditional data augmentation methods only focus on the textual part. To address these issues， a Label Confusion incorporated Data Augmentation （LCDA） technique was proposed for providing a comprehensive enhancement of data from both textual and label aspects. In terms of text， by enhancing the text through random insertion and replacement of punctuation marks and completing end-of-sentence punctuation marks， textual diversity was increased with all textual information and sequence preserved. In terms of labels， simulated label distribution was generated using a label confusion approach， and used to replace the traditional one-hot label distribution， so as to better reflect the relationships among instances and labels as well as between labels. In experiments conducted on few-shot datasets constructed from THUCNews （TsingHua University Chinese News） and Toutiao Chinese news datasets， the proposed technique was combined with TextCNN， TextRNN， BERT （Bidirectional Encoder Representations from Transformers）， and RoBERTa-CNN （Robustly optimized BERT approach Convolutional Neural Network） text classification models. The experimental results indicate that compared to those before enhancement， all models demonstrate significant performance improvements. Specifically， on 50-THU， a dataset constructed on THUCNews dataset， the accuracies of four models combing LCDA technique are improved by 1.19， 6.87， 3.21， and 2.89 percentage points， respectively， compared to those before enhancement， and by 0.78， 7.62， 1.75， and 1.28 percentage points， respectively， compared to those of the four models combining softEDA （Easy Data Augmentation with soft labels） method. By both textual and label processing results， model accuracy is enhanced by LCDA technique significantly， particularly in application scenarios characterized by limited data availability.

Key words: data augmentation, text classification, label confusion, Chinese news topic, pre-trained model

摘要：

传统数据增强技术，如同义词替换、随机插入和随机删除等，可能改变文本的原始语义，甚至导致关键信息丢失。此外，在文本分类任务中，数据通常包含文本部分和标签部分，然而传统数据增强方法仅针对文本部分。为解决这些问题，提出一种结合标签混淆的数据增强（LCDA）技术，从文本和标签这2个基本方面入手，为数据提供全面的强化。在文本方面，通过对文本进行标点符号随机插入和替换以及句末标点符号补齐等增强，在保留全部文本信息和顺序的同时增加文本的多样性；在标签方面，采用标签混淆方法生成模拟标签分布替代传统的one-hot标签分布，以更好地反映实例和标签与标签之间的关系。在THUCNews（TsingHua University Chinese News）和Toutiao这2个中文新闻数据集构建的小样本数据集上分别结合TextCNN、TextRNN、BERT（Bidirectional Encoder Representations from Transformers）和RoBERTa-CNN（Robustly optimized BERT approach Convolutional Neural Network）文本分类模型的实验结果表明，与增强前相比，性能均得到显著提升。其中，在由THUCNews数据集构造的50-THU数据集上，4种模型结合LCDA技术后的准确率相较于增强前分别提高了1.19、6.87、3.21和2.89个百分点；相较于softEDA（Easy Data Augmentation with soft labels）方法增强的模型分别提高了0.78、7.62、1.75和1.28个百分点。通过在文本和标签这2个维度的处理结果可知，LCDA技术能显著提升模型的准确率，在数据量较少的应用场景中表现尤为突出。

关键词: 数据增强, 文本分类, 标签混淆, 中文新闻主题, 预训练模型

CLC Number:

TP391.1

Haitao SUN, Jiayu LIN, Zuhong LIANG, Jie GUO. Data augmentation technique incorporating label confusion for Chinese text classification[J]. Journal of Computer Applications, 2025, 45(4): 1113-1119.

孙海涛, 林佳瑜, 梁祖红, 郭洁. 结合标签混淆的中文文本分类数据增强技术[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1113-1119.

Figures/Tables 11

References 27

1	TANG D， QIN B， LIU T. Document modeling with gated recurrent neural network for sentiment classification［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2015： 1422-1432.
2	DING B， LIU L， BING L， et al. DAGA： data augmentation with a generation approach for low-resource tagging tasks［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020： 6045-6057.
3	KOBAYASHI S. Contextual augmentation： data augmentation by words with paradigmatic relations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers） . Stroudsburg： ACL， 2018： 452-457.
4	CHEN H， HAN W， YANG D， et al. DoubleMix： simple interpolation-based data augmentation for text classification［C］// Proceedings of the 29th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2022： 4622-4632.
5	余新言，曾诚，王乾，等. 基于知识增强和提示学习的小样本新闻主题分类方法［J］. 计算机应用， 2024， 44（6）： 1767-1774.
	YU X Y， ZENG C， WANG Q， et al. Few-shot news topic classification method based on knowledge enhancement and prompt learning［J］. Journal of Computer Applications， 2024， 44（6）： 1767-1774.
6	SHORTEN C， KHOSHGOFTAAR T M， FURHT B. Text data augmentation for deep learning［J］. Journal of Big Data， 2021， 8： No.101.
7	MÜLLER R， KORNBLITH S， HINTON G E. When does label smoothing help？［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 4694-4703.
8	JOHNSON R， ZHANG T. Deep pyramid convolutional neural networks for text categorization［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 562-570.
9	JOULIN A， GRAVE E， BOJANOWSKI P， et al. Bag of tricks for efficient text classification［C］// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics： Volume 2， Short Papers. Stroudsburg： ACL， 2016： 427-431.
10	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
11	BROWN T B， MANN B， RYDER N， et al. Language models are few-shot learners［C］// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 1877-1901.
12	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach［EB/OL］. ［2023-11-12］..
13	姚迅，秦忠正，杨捷. 生成式标签对抗的文本分类模型［J］. 计算机应用， 2024， 44（6）： 1781-1785.
	YAO X， QIN Z Z， YANG J. Generative label adversarial text classification model［J］. Journal of Computer Applications， 2024， 44（6）： 1781-1785.
14	张海丰，曾诚，潘列，等. 结合BERT和特征投影网络的新闻主题文本分类方法［J］. 计算机应用， 2022， 42（4）： 1116-1124.
	ZHANG H F， ZENG C， PAN L， et al. News topic text classification method based on BERT and feature projection network［J］. Journal of Computer Applications， 2022， 42（4）： 1116-1124.
15	ZOPH B， VASUDEVAN V， SHLENS J， et al. Learning transferable architectures for scalable image recognition［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8697-8710.
16	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
17	SONG Y， WANG J， JIANG T， et al. Targeted sentiment classification with attentional encoder network［C］// Proceedings of the 2019 International Conference on Artificial Neural Networks， LNCS 11730. Cham： Springer， 2019： 93-103.
18	LUKASIK M， BHOJANAPALLI S， MENON A K， et al. Does label smoothing mitigate label noise？［C］// Proceedings of the 37th International Conference on Machine Learning. New York： JMLR.org， 2020： 6448-6458.
19	GUO B， HAN S， HAN X， et al. Label confusion learning to enhance text classification models［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 12929-12936.
20	WEI J， ZOU K. EDA： easy data augmentation techniques for boosting performance on text classification tasks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 6382-6388.
21	KARIMI A， ROSSI L， PRATI A. AEDA： an easier data augmentation technique for text classification［C］// Findings of the Association for Computational Linguistics： EMNLP 2021. Stroudsburg： ACL， 2021： 2748-2754.
22	WU X， GAO C， LIN M， et al. Text smoothing： enhance various data augmentation methods on text classification tasks［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 2： Short Papers）. Stroudsburg： ACL， 2022： 871-875.
23	LIU P， QIU X， HUANG X. Recurrent neural network for text classification with multi-task learning［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2016： 2873-2879.
24	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1746-1751.
25	PUTRA D T， SETIAWAN E B. Sentiment analysis on social media with GloVe using combination CNN and RoBERTa［J］. Jurnal RESTI （Rekayasa Sistem dan Teknologi Informasi）， 2023， 7（3）： 457-563.
26	SEMARY N A， AHMED W， AMIN K， et al. Improving sentiment classification using a RoBERTa-based hybrid model［J］. Frontiers in Human Neuroscience， 2023， 17： No.1292010.
27	CHOI J， JIN K， LEE J， et al. softEDA： rethinking rule-based data augmentation with soft labels［EB/OL］. ［2023-11-12］..

样本类型	样本内容
原始样本	两天价网站背后重重迷雾：做个网站究竟要多少钱
增强样本1	两：天价网站背后重重迷雾，，做个网站究竟要多少钱。
增强样本2	两！天价网站背后重！重迷雾：做个网站：究竟要多少钱？
增强样本3	两天价网。站背后重重迷，雾，；做个网站究竟要多少。钱？
增强样本4	两。天价网站背后重重迷雾：，做个？站究，竟要多少钱？

样本类型	样本内容
原始样本	两天价网站背后重重迷雾：做个网站究竟要多少钱
增强样本1	两：天价网站背后重重迷雾，，做个网站究竟要多少钱。
增强样本2	两！天价网站背后重！重迷雾：做个网站：究竟要多少钱？
增强样本3	两天价网。站背后重重迷，雾，；做个网站究竟要多少。钱？
增强样本4	两。天价网站背后重重迷雾：，做个？站究，竟要多少钱？

错误样本	正确分类	错误分类
三联书店建起书香巷	科技	教育
Google多项功能前晚集中“瘫痪”	科技	社会
借款纠纷牵出房产商伪造公文开发楼盘案	社会	房地产

错误样本	正确分类	错误分类
三联书店建起书香巷	科技	教育
Google多项功能前晚集中“瘫痪”	科技	社会
借款纠纷牵出房产商伪造公文开发楼盘案	社会	房地产

数据集	样本数			标签类别数
数据集	训练集	验证集	测试集	标签类别数
THUCNews	180 000	10 000	10 000	10
50-THU	500	10 000	10 000	10
200-THU	2 000	10 000	10 000	10
500-THU	5 000	10 000	10 000	10
Toutiao	130 000	10 000	10 000	13
50-Toutiao	650	10 000	10 000	13
200-Toutiao	2 600	10 000	10 000	13
500-Toutiao	6 500	10 000	10 000	13

Data augmentation technique incorporating label confusion for Chinese text classification

结合标签混淆的中文文本分类数据增强技术

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 27

Related Articles 15

Recommended Articles

Metrics

实际情况	预测正类	预测负类
实际正类	真正类 TP	假负类 FN
实际负类	假正类 FP	真负类 TN

扩充语句数	THUCNews数据集准确率/%
扩充语句数	50-THU	200-THU	500-THU
0	62.30	73.12	80.27
1	66.47	76.48	81.63
2	67.28	77.87	81.20
4	69.17	77.48	80.64

模型	THUCNews数据集准确率
模型	50-THU	200-THU	500-THU
BERT	81.05	87.21	88.91
BERT+文本增强+LS	83.27	87.74	89.28
BERT+LCDA	84.26	88.61	89.81

模型	准确率
	THUCNews数据集			Toutiao数据集
	50-THU	200-THU	500-THU	50-Toutiao	200-Toutiao	500-Toutiao
TextCNN	72.08	80.49	82.57	54.18	68.03	73.89
TextCNN+AEDA	72.55	79.99	82.32	55.24	68.23	74.25
TextCNN+softEDA	72.49	80.51	82.73	56.06	69.24	74.33
TextCNN+LCDA	73.27	80.89	82.84	57.13	69.22	74.47
TextRNN	62.30	73.12	80.27	41.76	60.95	68.15
TextRNN+AEDA	64.88	74.02	79.57	48.18	62.28	69.78
TextRNN+softEDA	61.55	74.45	78.72	44.58	61.72	70.22
TextRNN+LCDA	69.17	77.48	80.64	50.32	65.03	70.64
BERT	81.05	87.18	88.91	76.08	79.97	83.11
BERT+AEDA	81.31	87.21	87.52	76.69	80.35	82.43
BERT+softEDA	82.51	88.48	89.63	77.12	81.93	82.91
BERT+LCDA	84.26	88.61	89.81	78.98	83.15	84.33
RoBERTa-CNN	84.82	87.80	90.12	78.52	81.33	82.75
RoBERTa-CNN+AEDA	82.06	86.36	90.29	77.28	81.32	82.60
RoBERTa-CNN+softEDA	86.43	88.55	90.72	80.69	83.21	84.00
RoBERTa-CNN+LCDA	87.71	89.19	91.32	81.15	83.38	84.47

[1]	Renjie TIAN, Mingli JING, Long JIAO, Fei WANG. Recommendation algorithm of graph contrastive learning based on hybrid negative sampling [J]. Journal of Computer Applications, 2025, 45(4): 1053-1060.
[2]	Jie YANG, Tashi NYIMA, Dongrub RINCHEN, Jindong QI, Dondrub TSHERING. Tibetan word segmentation system based on pre-trained model tokenization reconstruction [J]. Journal of Computer Applications, 2025, 45(4): 1199-1204.
[3]	Jiaxin LI, Site MO. Power work order classification in substation area based on MiniRBT-LSTM-GAT and label smoothing [J]. Journal of Computer Applications, 2025, 45(4): 1356-1362.
[4]	Chenwei SUN, Junli HOU, Xianggen LIU, Jiancheng LYU. Large language model prompt generation method for engineering drawing understanding [J]. Journal of Computer Applications, 2025, 45(3): 801-807.
[5]	Kun FU, Shicong YING, Tingting ZHENG, Jiajie QU, Jingyuan CUI, Jianwei LI. Graph data augmentation method for few-shot node classification [J]. Journal of Computer Applications, 2025, 45(2): 392-402.
[6]	Xuewen YAN, Zhangjin HUANG. Few-shot image classification method based on contrast learning [J]. Journal of Computer Applications, 2025, 45(2): 383-391.
[7]	Jialin ZHANG, Qinghua REN, Qirong MAO. Speaker verification system utilizing global-local feature dependency for anti-spoofing [J]. Journal of Computer Applications, 2025, 45(1): 308-317.
[8]	Chenyang LI, Long ZHANG, Qiusheng ZHENG, Shaohua QIAN. Multivariate controllable text generation based on diffusion sequences [J]. Journal of Computer Applications, 2024, 44(8): 2414-2420.
[9]	Xinyan YU, Cheng ZENG, Qian WANG, Peng HE, Xiaoyu DING. Few-shot news topic classification method based on knowledge enhancement and prompt learning [J]. Journal of Computer Applications, 2024, 44(6): 1767-1774.
[10]	Xun YAO, Zhongzheng QIN, Jie YANG. Generative label adversarial text classification model [J]. Journal of Computer Applications, 2024, 44(6): 1781-1785.
[11]	Zhengyu ZHAO, Jing LUO, Xinhui TU. Information retrieval method based on multi-granularity semantic fusion [J]. Journal of Computer Applications, 2024, 44(6): 1775-1780.
[12]	Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL： positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492.
[13]	Jie GUO, Jiayu LIN, Zuhong LIANG, Xiaobo LUO, Haitao SUN. Recommendation method based on knowledge‑awareness and cross-level contrastive learning [J]. Journal of Computer Applications, 2024, 44(4): 1121-1127.
[14]	Hang YU, Yanling ZHOU, Mengxin ZHAI, Han LIU. Text classification based on pre-training model and label fusion [J]. Journal of Computer Applications, 2024, 44(3): 709-714.
[15]	Jiawei ZHANG, Guandong GAO, Ke XIAO, Shengzun SONG. Violent crime hierarchy algorithm by joint modeling of improved hierarchical attention network and TextCNN [J]. Journal of Computer Applications, 2024, 44(2): 403-410.

INSERT_PROB值	准确率/%	INSERT_PROB值	准确率/%
0.1	83.99	0.4	82.69
0.2	84.26	0.5	83.11
0.3	83.69

INSERT_PROB值	准确率/%	INSERT_PROB值	准确率/%
0.1	83.99	0.4	82.69
0.2	84.26	0.5	83.11
0.3	83.69