《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (4): 1113-1119.DOI: 10.11772/j.issn.1001-9081.2024040550

• 人工智能 • 上一篇    下一篇

结合标签混淆的中文文本分类数据增强技术

孙海涛1, 林佳瑜2(), 梁祖红1,3, 郭洁1   

  1. 1.广东工业大学 计算机学院,广州 510006
    2.广东工业大学 图书馆,广州 510006
    3.广东工业大学 实验教学部,广州 510006
  • 收稿日期:2024-04-30 修回日期:2024-08-14 接受日期:2024-08-16 发布日期:2025-04-08 出版日期:2025-04-10
  • 通讯作者: 林佳瑜
  • 作者简介:孙海涛(1999—),男,湖南常德人,硕士研究生,CCF会员,主要研究方向:数据增强、数据挖掘;
    梁祖红(1980—),男,广东惠州人,教授级高级工程师,博士,主要研究方向:机器学习、智能计算;
    郭洁(1998—),女,湖南常德人,硕士研究生,主要研究方向:推荐系统、数据挖掘。
  • 基金资助:
    教育部产学合作协同育人项目(220901229305933)

Data augmentation technique incorporating label confusion for Chinese text classification

Haitao SUN1, Jiayu LIN2(), Zuhong LIANG1,3, Jie GUO1   

  1. 1.School of Computer Science and Technology,Guangdong University of Technology,Guangzhou Guangdong 510006,China
    2.Library,Guangdong University of Technology,Guangzhou Guangdong 510006,China
    3.Experimental Teaching Department,Guangdong University of Technology,Guangzhou Guangdong 510006,China
  • Received:2024-04-30 Revised:2024-08-14 Accepted:2024-08-16 Online:2025-04-08 Published:2025-04-10
  • Contact: Jiayu LIN
  • About author:SUN Haitao, born in 1999, M. S. candidate. His research interests include data augmentation, data mining.
    LIANG Zuhong, born in 1980, Ph. D., professor of engineer. His research interests include machine learning, intelligent computing.
    GUO Jie, born in 1998, M. S. candidate. Her research interests include recommendation system, data mining.
  • Supported by:
    Ministry of Education University-Industry Cooperation Collaborative Education Project(220901229305933)

摘要:

传统数据增强技术,如同义词替换、随机插入和随机删除等,可能改变文本的原始语义,甚至导致关键信息丢失。此外,在文本分类任务中,数据通常包含文本部分和标签部分,然而传统数据增强方法仅针对文本部分。为解决这些问题,提出一种结合标签混淆的数据增强(LCDA)技术,从文本和标签这2个基本方面入手,为数据提供全面的强化。在文本方面,通过对文本进行标点符号随机插入和替换以及句末标点符号补齐等增强,在保留全部文本信息和顺序的同时增加文本的多样性;在标签方面,采用标签混淆方法生成模拟标签分布替代传统的one-hot标签分布,以更好地反映实例和标签与标签之间的关系。在THUCNews(TsingHua University Chinese News)和Toutiao这2个中文新闻数据集构建的小样本数据集上分别结合TextCNN、TextRNN、BERT(Bidirectional Encoder Representations from Transformers)和RoBERTa-CNN(Robustly optimized BERT approach Convolutional Neural Network)文本分类模型的实验结果表明,与增强前相比,性能均得到显著提升。其中,在由THUCNews数据集构造的50-THU数据集上,4种模型结合LCDA技术后的准确率相较于增强前分别提高了1.19、6.87、3.21和2.89个百分点;相较于softEDA(Easy Data Augmentation with soft labels)方法增强的模型分别提高了0.78、7.62、1.75和1.28个百分点。通过在文本和标签这2个维度的处理结果可知,LCDA技术能显著提升模型的准确率,在数据量较少的应用场景中表现尤为突出。

关键词: 数据增强, 文本分类, 标签混淆, 中文新闻主题, 预训练模型

Abstract:

Traditional data augmentation techniques, such as synonym substitution, random insertion, and random deletion, may change the original semantics of text and even result in the loss of critical information. Moreover, data in text classification tasks typically have both textual and label parts. However, traditional data augmentation methods only focus on the textual part. To address these issues, a Label Confusion incorporated Data Augmentation (LCDA) technique was proposed for providing a comprehensive enhancement of data from both textual and label aspects. In terms of text, by enhancing the text through random insertion and replacement of punctuation marks and completing end-of-sentence punctuation marks, textual diversity was increased with all textual information and sequence preserved. In terms of labels, simulated label distribution was generated using a label confusion approach, and used to replace the traditional one-hot label distribution, so as to better reflect the relationships among instances and labels as well as between labels. In experiments conducted on few-shot datasets constructed from THUCNews (TsingHua University Chinese News) and Toutiao Chinese news datasets, the proposed technique was combined with TextCNN, TextRNN, BERT (Bidirectional Encoder Representations from Transformers), and RoBERTa-CNN (Robustly optimized BERT approach Convolutional Neural Network) text classification models. The experimental results indicate that compared to those before enhancement, all models demonstrate significant performance improvements. Specifically, on 50-THU, a dataset constructed on THUCNews dataset, the accuracies of four models combing LCDA technique are improved by 1.19, 6.87, 3.21, and 2.89 percentage points, respectively, compared to those before enhancement, and by 0.78, 7.62, 1.75, and 1.28 percentage points, respectively, compared to those of the four models combining softEDA (Easy Data Augmentation with soft labels) method. By both textual and label processing results, model accuracy is enhanced by LCDA technique significantly, particularly in application scenarios characterized by limited data availability.

Key words: data augmentation, text classification, label confusion, Chinese news topic, pre-trained model

中图分类号: