《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (3): 709-714.DOI: 10.11772/j.issn.1001-9081.2023030340

• 人工智能 • 上一篇    下一篇

基于预训练模型与标签融合的文本分类

余杭1, 周艳玲1(), 翟梦鑫1, 刘涵2   

  1. 1.湖北大学 计算机与信息工程学院,武汉 430062
    2.深圳大学 计算机科学与软件学院,广东 深圳 518060
  • 收稿日期:2023-04-10 修回日期:2023-06-30 接受日期:2023-07-04 发布日期:2023-09-07 出版日期:2024-03-10
  • 通讯作者: 周艳玲
  • 作者简介:余杭(1998—),男,湖北黄冈人,硕士研究生,主要研究方向:自然语言处理
    翟梦鑫(2000—),女,河南周口人,硕士研究生,主要研究方向:自然语言处理
    刘涵(1987—),男,河北任丘人,助理教授,博士,主要研究方向:自然语言处理。
  • 基金资助:
    湖北省教育厅科学技术研究项目(D20221006)

Text classification based on pre-training model and label fusion

Hang YU1, Yanling ZHOU1(), Mengxin ZHAI1, Han LIU2   

  1. 1.School of Computer Science and Information Engineering,Hubei University,Wuhan Hubei 430062,China
    2.College of Computer Science and Software Engineering,Shenzhen University,Shenzhen Guangdong 518060,China
  • Received:2023-04-10 Revised:2023-06-30 Accepted:2023-07-04 Online:2023-09-07 Published:2024-03-10
  • Contact: Yanling ZHOU
  • About author:YU Hang, born in 1998, M. S. candidate. His research interests include natural language processing.
    ZHAI Mengxin, born in 2000, M. S. candidate. Her research interests include natural language processing.
    LIU Han, born in 1987, Ph. D., assistant professor. His research interests include natural language processing.
  • Supported by:
    Science and Technology Research Project of Hubei Provincial Department of Education(D20221006)

摘要:

对海量的用户文本评论数据进行准确分类具有重要的经济效益和社会效益。目前大部分文本分类方法是将文本编码直接使用于各式的分类器之前,而忽略了标签文本中蕴含的提示信息。针对以上问题,提出一种基于RoBERTa(Robustly optimized BERT pretraining approach)的文本和标签信息融合分类模型(TLIFC-RoBERTa)。首先,利用RoBERTa预训练模型获得词向量;然后,利用孪生网络结构分别训练文本和标签向量,通过交互注意力将标签信息映射到文本上,达到将标签信息融入模型的效果;最后,设置自适应融合层将文本表示与标签表示紧密融合进行分类。在今日头条和THUCNews数据集上的实验结果表明,相较于将Labelatt(Label-based attention improved model)中使用的静态词向量改为RoBERTa-wwm训练后的词向量算法(RA-Labelatt)、RoBERTa结合基于标签嵌入的多尺度卷积初始化文本分类算法(LEMC-RoBERTa)等主流深度学习模型,TLIFC-RoBERTa的精度最高,对于用户评论数据集有最优的分类效果。

关键词: 文本分类, 预训练模型, 交互注意力, 标签嵌入, RoBERTa

Abstract:

Accurate classification of massive user text comment data has important economic and social benefits. Nowadays, in most text classification methods, text encoding method is used directly before various classifiers, while the prompt information contained in the label text is ignored. To address the above issues, a pre-training model based Text and Label Information Fusion Classification model based on RoBERTa (Robustly optimized BERT pretraining approach) was proposed, namely TLIFC-RoBERTa. Firstly, a RoBERTa pre-training model was used to obtain the word vector. Then, the Siamese network structure was used to train the text and label vectors respectively, and the label information was mapped to the text through interactive attention, so as to integrate the label information into the model. Finally, an adaptive fusion layer was set to closely fuse the text representation with the label representation for classification. Experimental results on Today Headlines and THUCNews datasets show that compared with mainstream deep learning models such as RA-Labelatt (replacing static word vectors in Label-based attention improved model with word vectors trained by RoBERTa-wwm) and LEMC-RoBERTa (RoBERTa combined with Label-Embedding-based Multi-scale Convolution for text classification), the accuracy of TLIFC-RoBERTa is the highest, and it achieves the best classification performance in user comment datasets.

Key words: text classification, pre-training model, interactive attention, label embedding, RoBERTa (Robustly optimized BERT pretraining approach)

中图分类号: