基于预训练模型与标签融合的文本分类

doi:10.11772/j.issn.1001-9081.2023030340

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (3): 709-714.DOI: 10.11772/j.issn.1001-9081.2023030340

基于预训练模型与标签融合的文本分类

余杭¹, 周艳玲¹(), 翟梦鑫¹, 刘涵²

^1.湖北大学计算机与信息工程学院，武汉 430062
^2.深圳大学计算机科学与软件学院，广东深圳 518060

收稿日期:2023-04-10 修回日期:2023-06-30 接受日期:2023-07-04 发布日期:2023-09-07 出版日期:2024-03-10
通讯作者: 周艳玲
作者简介:余杭（1998—），男，湖北黄冈人，硕士研究生，主要研究方向：自然语言处理
翟梦鑫（2000—），女，河南周口人，硕士研究生，主要研究方向：自然语言处理
刘涵（1987—），男，河北任丘人，助理教授，博士，主要研究方向：自然语言处理。
基金资助:
湖北省教育厅科学技术研究项目(D20221006)

Text classification based on pre-training model and label fusion

Hang YU¹, Yanling ZHOU¹(), Mengxin ZHAI¹, Han LIU²

^1.School of Computer Science and Information Engineering，Hubei University，Wuhan Hubei 430062，China
^2.College of Computer Science and Software Engineering，Shenzhen University，Shenzhen Guangdong 518060，China

Received:2023-04-10 Revised:2023-06-30 Accepted:2023-07-04 Online:2023-09-07 Published:2024-03-10
Contact: Yanling ZHOU
About author:YU Hang， born in 1998， M. S. candidate. His research interests include natural language processing.
ZHAI Mengxin， born in 2000， M. S. candidate. Her research interests include natural language processing.
LIU Han， born in 1987， Ph. D.， assistant professor. His research interests include natural language processing.
Supported by:
Science and Technology Research Project of Hubei Provincial Department of Education(D20221006)

摘要/Abstract

摘要：

对海量的用户文本评论数据进行准确分类具有重要的经济效益和社会效益。目前大部分文本分类方法是将文本编码直接使用于各式的分类器之前，而忽略了标签文本中蕴含的提示信息。针对以上问题，提出一种基于RoBERTa（Robustly optimized BERT pretraining approach）的文本和标签信息融合分类模型（TLIFC-RoBERTa）。首先，利用RoBERTa预训练模型获得词向量；然后，利用孪生网络结构分别训练文本和标签向量，通过交互注意力将标签信息映射到文本上，达到将标签信息融入模型的效果；最后，设置自适应融合层将文本表示与标签表示紧密融合进行分类。在今日头条和THUCNews数据集上的实验结果表明，相较于将Labelatt（Label-based attention improved model）中使用的静态词向量改为RoBERTa-wwm训练后的词向量算法（RA-Labelatt）、RoBERTa结合基于标签嵌入的多尺度卷积初始化文本分类算法（LEMC-RoBERTa）等主流深度学习模型，TLIFC-RoBERTa的精度最高，对于用户评论数据集有最优的分类效果。

关键词: 文本分类, 预训练模型, 交互注意力, 标签嵌入, RoBERTa

Abstract:

Accurate classification of massive user text comment data has important economic and social benefits. Nowadays， in most text classification methods， text encoding method is used directly before various classifiers， while the prompt information contained in the label text is ignored. To address the above issues， a pre-training model based Text and Label Information Fusion Classification model based on RoBERTa （Robustly optimized BERT pretraining approach） was proposed， namely TLIFC-RoBERTa. Firstly， a RoBERTa pre-training model was used to obtain the word vector. Then， the Siamese network structure was used to train the text and label vectors respectively， and the label information was mapped to the text through interactive attention， so as to integrate the label information into the model. Finally， an adaptive fusion layer was set to closely fuse the text representation with the label representation for classification. Experimental results on Today Headlines and THUCNews datasets show that compared with mainstream deep learning models such as RA-Labelatt （replacing static word vectors in Label-based attention improved model with word vectors trained by RoBERTa-wwm） and LEMC-RoBERTa （RoBERTa combined with Label-Embedding-based Multi-scale Convolution for text classification）， the accuracy of TLIFC-RoBERTa is the highest， and it achieves the best classification performance in user comment datasets.

Key words: text classification, pre-training model, interactive attention, label embedding, RoBERTa (Robustly optimized BERT pretraining approach)

中图分类号:

TP183

余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 计算机应用, 2024, 44(3): 709-714.

Hang YU, Yanling ZHOU, Mengxin ZHAI, Han LIU. Text classification based on pre-training model and label fusion[J]. Journal of Computer Applications, 2024, 44(3): 709-714.

图/表 5

参考文献 19

1	李博涵，向宇轩，封顶，等.融合知识感知与双重注意力的短文本分类模型［J］.软件学报，2022，33（10）：3565-3581.
	LI B H， XIANG Y X， FENG D， et al. Short text classification model combining knowledge aware and dual attention ［J］. Journal of Software， 2022， 33（10）： 3565-3581.
2	WAWRE S V， DESHMUKH S N. Sentiment classification using machine learning techniques ［J］. International Journal of Science and Research， 2016， 5（4）： 819-821. 10.21275/v5i4.nov162724
3	MULLEN T， COLLIER N. Sentiment analysis using support vector machines with diverse information sources ［C］// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2004： 412-418. 10.3115/1219044.1219069
4	TAN S， CHENG X， WANG Y， et al. Adapting naive Bayes to domain adaptation for sentiment analysis ［C］// Proceedings of the 2009 European Conference on Information Retrieval. Berlin： Springer， 2009： 337-349. 10.1007/978-3-642-00958-7_31
5	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference of Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1746-1751. 10.3115/v1/d14-1181
6	LIU G， GUO J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification ［J］. Neurocomputing， 2019， 337： 325-338. 10.1016/j.neucom.2019.01.078
7	DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，Volume 1 （Long and Short Papers）. Stroudsburg：ACL，2019： 4171-4186. 10.18653/v1/n18-2
8	张海丰，曾诚，潘列，等.结合BERT和特征投影网络的新闻主题文本分类方法［J］.计算机应用，2022，42（4）：1116-1124. 10.11772/j.issn.1001-9081.2021071257
	ZHANG H F， ZENG C， PAN L， et al. News topic text classification method based on BERT and feature projection network ［J］. Journal of Computer Applications， 2022， 42（4）： 1116-1124. 10.11772/j.issn.1001-9081.2021071257
9	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach ［EB/OL］. （2019-07-26）［2022-04-24］. .
10	PENNINGTON J， SOCHER R， MANNING C. GloVe： global vectors for word representation ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg：ACL， 2014： 1532-1543. 10.3115/v1/d14-1162
11	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space ［EB/OL］. （2013-09-07）［2020-08-06］. . 10.3126/jiee.v3i1.34327
12	黄友文，魏国庆，胡燕芳.DistillBIGRU：基于知识蒸馏的文本分类模型［J］.中文信息学报，2022，36（4）：81-89. 10.3969/j.issn.1003-0077.2022.04.010
	HUANG Y W， WEI G Q， HU Y F. DistillBIGRU： text classification model based on knowledge distillation ［J］. Journal of Chinese Information Processing， 2022， 36（4）： 81-89. 10.3969/j.issn.1003-0077.2022.04.010
13	CUI Y， CHE W， LIU T， et al. Pre-training with whole word masking for Chinese BERT［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 3504-3514. 10.1109/taslp.2021.3124365
14	CUI Y， CHE W， LIU T， et al. Revisiting pre-trained models for Chinese natural language processing ［C］// Proceedings of the 2020 Conference of Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020： 657-668. 10.18653/v1/2020.findings-emnlp.58
15	徐月梅，樊祖薇，曹晗.基于标签嵌入注意力机制的多任务文本分类模型［J］.数据分析与知识发现，2022，6（2/3）：105-116.
	XU Y M， FAN Z W， CAO H. A multi-task text classification model based on label embedding of attention mechanism ［J］. Data Analysis and Knowledge Discovery， 2022， 6（2/3）： 105-116.
16	袁国瑞. 基于标签嵌入和注意力机制的文本分类算法研究［D］.合肥：中国科学技术大学，2021： 34-38. 10.23919/ccc52363.2021.9550750
	YUAN G R. Text classification based on label embedding and attention mechanism ［D］. Hefei： University of Science and Technology of China， 2021： 34-38. 10.23919/ccc52363.2021.9550750
17	张舒萌，余增，李天瑞.跨领域文本的可迁移情绪分析方法［J］.计算机科学，2022，49（3）：218-224. 10.11896/jsjkx.210400034
	ZHANG S M， YU Z， LI T R. Transferable emotion analysis method for cross-domain text ［J］. Computer Science， 2022， 49（3）： 218-224. 10.11896/jsjkx.210400034
18	WANG Z， HUANG H， HAN S. IDEA： interactive double attentions from label embedding for text classification ［C］// Proceedings of the 2022 IEEE 34th International Conference on Tools with Artificial Intelligence. Piscataway： IEEE， 2022： 233-238. 10.1109/ictai56018.2022.00041
19	ZHANG X， QIU X， PANG J， et al. Dual-axial self-attention network for text classification［J］. SCIENCE CHINA Information Sciences， 2021， 64： 222102. 10.1007/s11432-019-2744-2

数据集	训练集样本数	测试集样本数
今日头条	160 000	40 000
THUCNews	180 000	10 000

数据集	训练集样本数	测试集样本数
今日头条	160 000	40 000
THUCNews	180 000	10 000

模型	今日头条				THUCNews
模型	Acc	Pre	Rec	Macro-F1	Acc	Pre	Rec	Macro-F1
TextCNN	77.54	73.41	73.20	70.77	90.56	90.62	90.56	90.57
Bi-LSTM	78.20	73.63	73.88	71.22	91.10	91.18	91.10	91.09
BERT	87.63	84.66	84.88	83.12	94.20	93.54	93.38	92.53
LEMC-BERT	87.64	84.75	84.81	83.15	94.31	93.63	93.75	92.77
BERT-Labelatt	87.78	84.49	84.97	83.32	94.34	93.88	93.77	92.88
TLIFC-BERT	87.80	84.96	85.10	83.39	94.41	93.77	93.75	92.91
RoBERTa+	87.82	85.12	84.89	83.41	94.72	94.26	94.19	93.37
LEMC-RoBERTa	87.94	85.00	85.04	83.43	94.81	94.20	94.28	93.56
RA-Labelatt	88.20	85.66	85.63	84.04	94.95	94.58	94.45	93.73
TLIFC-RoBERTa	88.46	85.66	85.73	84.14	94.97	94.46	94.44	93.71

模型	今日头条				THUCNews
模型	Acc	Pre	Rec	Macro-F1	Acc	Pre	Rec	Macro-F1
TextCNN	77.54	73.41	73.20	70.77	90.56	90.62	90.56	90.57
Bi-LSTM	78.20	73.63	73.88	71.22	91.10	91.18	91.10	91.09
BERT	87.63	84.66	84.88	83.12	94.20	93.54	93.38	92.53
LEMC-BERT	87.64	84.75	84.81	83.15	94.31	93.63	93.75	92.77
BERT-Labelatt	87.78	84.49	84.97	83.32	94.34	93.88	93.77	92.88
TLIFC-BERT	87.80	84.96	85.10	83.39	94.41	93.77	93.75	92.91
RoBERTa+	87.82	85.12	84.89	83.41	94.72	94.26	94.19	93.37
LEMC-RoBERTa	87.94	85.00	85.04	83.43	94.81	94.20	94.28	93.56
RA-Labelatt	88.20	85.66	85.63	84.04	94.95	94.58	94.45	93.73
TLIFC-RoBERTa	88.46	85.66	85.73	84.14	94.97	94.46	94.44	93.71

数据集	模型	Acc	Pre	Rec	Macro-F1
今日头条	RoBERTa	87.82	85.12	84.89	83.41
	RoBERTa-att	88.04	85.29	85.24	83.66
	TLIFC-RoBERTa	88.46	85.66	85.73	84.14
THUCNews	RoBERTa	94.72	94.26	94.19	93.37
	RoBERTa-att	94.80	94.40	94.27	93.53
	TLIFC-RoBERTa	94.97	94.46	94.44	93.71

基于预训练模型与标签融合的文本分类

Text classification based on pre-training model and label fusion

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 5

参考文献 19

相关文章 15

编辑推荐

Metrics

[1]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[2]	张家伟, 高冠东, 肖珂, 宋胜尊. 基于改进分层注意网络和TextCNN联合建模的暴力犯罪分级算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 403-410.
[3]	田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714.
[4]	张心月, 刘蓉, 魏驰宇, 方可. 融合提示知识的方面级情感分析方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2753-2759.
[5]	于碧辉, 蔡兴业, 魏靖烜. 基于提示学习的小样本文本分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2735-2740.
[6]	崔雨萌, 王靖亚, 刘晓文, 闫尚义, 陶知众. 融合注意力和裁剪机制的通用文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2396-2405.
[7]	张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.
[8]	杨森淇, 段旭良, 肖展, 郎松松, 李志勇. 基于ERNIE+DPCNN+BiGRU的农业新闻文本分类[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1461-1466.
[9]	石利锋, 倪郑威. 基于槽位相关信息提取的对话状态追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1430-1437.
[10]	张旭, 生龙, 张海芳, 田丰, 王巍. 基于标签混淆的院前急救文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1050-1055.
[11]	王惠茹, 李秀红, 李哲, 马春明, 任泽裕, 杨丹. 多模态预训练模型综述[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 991-1004.
[12]	林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342.
[13]	徐铭, 李林昊, 齐巧玲, 王利琴. 基于注意力平衡列表的溯因推理模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 349-355.
[14]	胡婕, 陈晓茜, 张龑. 基于池化和特征组合增强BERT的答案选择模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 365-373.
[15]	吴明月, 周栋, 赵文玉, 屈薇. 基于流形学习的句向量优化[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3062-3069.