《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 335-342.DOI: 10.11772/j.issn.1001-9081.2021122221

• 人工智能 • 上一篇    

标签语义增强的弱监督文本分类模型

林呈宇1,2, 王雷1, 薛聪1()   

  1. 1.中国科学院 信息工程研究所,北京 100093
    2.中国科学院大学 网络空间安全学院,北京 100049
  • 收稿日期:2022-01-06 修回日期:2022-03-22 接受日期:2022-04-13 发布日期:2023-02-08 出版日期:2023-02-10
  • 通讯作者: 薛聪
  • 作者简介:林呈宇(1997—),男,浙江宁波人,硕士研究生,主要研究方向:自然语言处理
    王雷(1985—),男,内蒙古包头人,高级工程师,博士,主要研究方向:密码工程与应用、身份管理与网络信任、大数据智能处理
  • 基金资助:
    国家自然科学基金重点项目(U1636220)

Weakly-supervised text classification with label semantic enhancement

Chengyu LIN1,2, Lei WANG1, Cong XUE1()   

  1. 1.Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China
    2.School of Cyber Security,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2022-01-06 Revised:2022-03-22 Accepted:2022-04-13 Online:2023-02-08 Published:2023-02-10
  • Contact: Cong XUE
  • About author:LIN Chengyu, born in 1997, M. S. candidate. His research interests include natural language processing.
    WANG Lei, born in 1985, Ph. D., senior engineer. His research interests include cryptographic engineering and application, identity management and network trust, big data intelligent processing.
  • Supported by:
    National Natural Science Foundation of China(U1636220)

摘要:

针对弱监督文本分类任务中存在的类别词表噪声和标签噪声问题,提出了一种标签语义增强的弱监督文本分类模型。首先,基于单词上下文语义表示对类别词表去噪,从而构建高度准确的类别词表;然后,构建基于MASK机制的词类别预测任务对预训练模型BERT进行微调,以学习单词与类别的关系;最后,利用引入标签语义的自训练模块来充分利用所有数据信息并减少标签噪声的影响,以实现词级到句子级语义的转换,从而准确预测文本序列类别。实验结果表明,与目前最先进的弱监督文本分类模型LOTClass相比,所提方法在THUCNews、AG News和IMDB公开数据集上,分类准确率分别提高了5.29、1.41和1.86个百分点。

关键词: 弱监督文本分类, BERT, MASK机制, 标签语义, 标签噪声, 自训练

Abstract:

Aiming at the problem of category vocabulary noise and label noise in weakly-supervised text classification tasks, a weakly-supervised text classification model with label semantic enhancement was proposed. Firstly, the category vocabulary was denoised on the basis of the contextual semantic representation of the words in order to construct a highly accurate category vocabulary. Then, a word category prediction task based on MASK mechanism was constructed to fine-tune the pre-training model BERT (Bidirectional Encoder Representations from Transformers), so as to learn the relationship between words and categories. Finally, a self-training module with label semantics introduced was used to make full use of all data information and reduce the impact of label noise in order to achieve word-level to sentence-level semantic conversion, thereby accurately predicting text sequence categories. Experimental results show that compared with the current state-of-the-art weakly-supervised text classification model LOTClass (Label-name-Only Text Classification), the proposed method improves the classification accuracy by 5.29, 1.41 and 1.86 percentage points respectively on the public datasets THUCNews, AG News and IMDB.

Key words: weakly-supervised text classification, BERT (Bidirectional Encoder Representations from Transformers), MASK mechanism, label semantics, label noise, self-training

中图分类号: