标签语义增强的弱监督文本分类模型

doi:10.11772/j.issn.1001-9081.2021122221

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 335-342.DOI: 10.11772/j.issn.1001-9081.2021122221

• 人工智能 • 上一篇

标签语义增强的弱监督文本分类模型

林呈宇¹^,², 王雷¹, 薛聪¹()

^1.中国科学院信息工程研究所，北京 100093
^2.中国科学院大学网络空间安全学院，北京 100049

收稿日期:2022-01-06 修回日期:2022-03-22 接受日期:2022-04-13 发布日期:2023-02-08 出版日期:2023-02-10
通讯作者: 薛聪
作者简介:林呈宇（1997—），男，浙江宁波人，硕士研究生，主要研究方向：自然语言处理
王雷（1985—），男，内蒙古包头人，高级工程师，博士，主要研究方向：密码工程与应用、身份管理与网络信任、大数据智能处理
基金资助:
国家自然科学基金重点项目(U1636220)

Weakly-supervised text classification with label semantic enhancement

Chengyu LIN¹^,², Lei WANG¹, Cong XUE¹()

^1.Institute of Information Engineering，Chinese Academy of Sciences，Beijing 100093，China
^2.School of Cyber Security，University of Chinese Academy of Sciences，Beijing 100049，China

Received:2022-01-06 Revised:2022-03-22 Accepted:2022-04-13 Online:2023-02-08 Published:2023-02-10
Contact: Cong XUE
About author:LIN Chengyu， born in 1997， M. S. candidate. His research interests include natural language processing.
WANG Lei， born in 1985， Ph. D.， senior engineer. His research interests include cryptographic engineering and application， identity management and network trust， big data intelligent processing.
Supported by:
National Natural Science Foundation of China(U1636220)

摘要/Abstract

摘要：

针对弱监督文本分类任务中存在的类别词表噪声和标签噪声问题，提出了一种标签语义增强的弱监督文本分类模型。首先，基于单词上下文语义表示对类别词表去噪，从而构建高度准确的类别词表；然后，构建基于MASK机制的词类别预测任务对预训练模型BERT进行微调，以学习单词与类别的关系；最后，利用引入标签语义的自训练模块来充分利用所有数据信息并减少标签噪声的影响，以实现词级到句子级语义的转换，从而准确预测文本序列类别。实验结果表明，与目前最先进的弱监督文本分类模型LOTClass相比，所提方法在THUCNews、AG News和IMDB公开数据集上，分类准确率分别提高了5.29、1.41和1.86个百分点。

关键词: 弱监督文本分类, BERT, MASK机制, 标签语义, 标签噪声, 自训练

Abstract:

Aiming at the problem of category vocabulary noise and label noise in weakly-supervised text classification tasks， a weakly-supervised text classification model with label semantic enhancement was proposed. Firstly， the category vocabulary was denoised on the basis of the contextual semantic representation of the words in order to construct a highly accurate category vocabulary. Then， a word category prediction task based on MASK mechanism was constructed to fine-tune the pre-training model BERT （Bidirectional Encoder Representations from Transformers）， so as to learn the relationship between words and categories. Finally， a self-training module with label semantics introduced was used to make full use of all data information and reduce the impact of label noise in order to achieve word-level to sentence-level semantic conversion， thereby accurately predicting text sequence categories. Experimental results show that compared with the current state-of-the-art weakly-supervised text classification model LOTClass （Label-name-Only Text Classification）， the proposed method improves the classification accuracy by 5.29， 1.41 and 1.86 percentage points respectively on the public datasets THUCNews， AG News and IMDB.

Key words: weakly-supervised text classification, BERT (Bidirectional Encoder Representations from Transformers), MASK mechanism, label semantics, label noise, self-training

中图分类号:

TP391.1

林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 计算机应用, 2023, 43(2): 335-342.

Chengyu LIN, Lei WANG, Cong XUE. Weakly-supervised text classification with label semantic enhancement[J]. Journal of Computer Applications, 2023, 43(2): 335-342.

图/表 10

参考文献 30

1	YU Y， ZUO S M， JIANG H M， et al. Fine-tuning pre-trained language model with weak supervision： a contrastive-regularized self-training approach［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2021： 1063-1077. 10.18653/v1/2021.naacl-main.84
2	MEKALA D， SHANG J B. Contextualized weak supervision for text classification［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2020： 323-333. 10.18653/v1/2020.acl-main.30
3	MENG Y， SHEN J M， ZHANG C， et al. Weakly-supervised neural text classification［C］// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York： ACM， 2018： 983-992. 10.1145/3269206.3271737
4	WANG Z H， MEKALA D， SHANG J B. X-Class： text classification with extremely weak supervision［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2021： 3043-3053. 10.18653/v1/2021.naacl-main.242
5	AWASTHI A， GHOSH S， GOYAL R， et al. Learning from rules generalizing labeled exemplars［EB/OL］. （2020-05-15）［2021-11-07］..
6	SHEN T， GENG X B， LONG G D， et al. Effective search of logical forms for weakly supervised knowledge-based question answering［C］// Proceedings of the 29th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2020： 2227-2233. 10.24963/ijcai.2020/308
7	TAN B W， QIN L H， XING E P， et al. Summarizing text on any aspects： a knowledge-informed weakly-supervised approach［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 6301-6309. 10.18653/v1/2020.emnlp-main.510
8	LI C L， XING J， SUN A X， et al. Effective document labeling with very few seed words： a topic model approach［C］// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. New York： ACM， 2016： 85-94. 10.1145/2983323.2983721
9	MENG Y， SHEN J M， ZHANG C， et al. Weakly-supervised hierarchical text classification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019：6826-6833. 10.1609/aaai.v33i01.33016826
10	KARAMANOLAKIS G， MUKHERJEE S， ZHENG G Q， et al. Self-training with weak supervision［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2021： 845-863. 10.18653/v1/2021.naacl-main.66
11	REN W D， LI Y H， SU H T， et al. Denoising multi-source weak supervision for neural text classification［C］// Findings of the Association for Computational Linguistics： EMNLP 2020. Stroudsburg， PA： ACL， 2020： 3739-3754. 10.18653/v1/2020.findings-emnlp.334
12	MENG Y， ZHANG Y Y， HUANG J X， et al. Text classification using label names only： a language model self-training approach［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 9006-9017. 10.18653/v1/2020.emnlp-main.724
13	JINDAL I， PRESSEL D， LESTER B， et al. An effective label noise model for DNN text classification［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 3246-3256. 10.18653/v1/n19-1328
14	POWERS D M W. Applications and explanations of Zipf’s law［C］// Proceedings of the 1998 Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning. Somerset， NJ： ACL， 1998： 151-160. 10.3115/1603899.1603924
15	GABRILOVICH E， MARKOVITCH S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis［C］// Proceedings of the 20th International Joint Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2007： 1606-1611.
16	CHEN X Y， XIA Y Q， JIN P， et al. Dataless text classification with descriptive LDA［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2015： 2224-2231. 10.1609/aaai.v29i1.9506
17	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
18	YANG Z L， DAI Z H， YANG Y M， et al. XLNet： generalized autoregressive pretraining for language understanding［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-11-07］..
19	ZHANG L， DING J D， XU Y， et al. Weakly-supervised text classification based on keyword graph［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2021： 2803-2813. 10.18653/v1/2021.emnlp-main.222
20	JIN Y P， BHATIA A， WANVARIE D. Seed word selection for weakly-supervised text classification with unsupervised error estimation［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies： Student Research Workshop. Stroudsburg， PA： ACL， 2021： 112-118. 10.18653/v1/2021.naacl-srw.14
21	XIAO H R， LIU X， SONG Y Q. Efficient path prediction for semi-supervised and weakly supervised hierarchical text classification［C］// Proceedings of the 2019 World Wide Web Conference. New York： ACM， 2019： 3370-3376. 10.1145/3308558.3313658
22	LEE D H. Pseudo-label： the simple and efficient semi-supervised learning method for deep neural networks［C/OL］// Proceedings of the ICML 2013 Workshop on Challenges in Representation Learning. ［2021-11-07］..
23	XIE J Y， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
24	GUO B Y， HAN S Q， HAN X， et al. Label confusion learning to enhance text classification models［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 12929-12936. 10.1609/aaai.v35i14.17529
25	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2014： 1746-1751. 10.3115/v1/d14-1181
26	LIU P F， QIU X P， HUANG X J. Recurrent neural network for text classification with multi-task learning［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2016： 2873-2879. 10.24963/ijcai.2017/473
27	XIE Q Z， DAI Z H， HOVY E， et al. Unsupervised data augmentation for consistency training［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020： 6256-6268.
28	EDUNOV S， OTT M， AULI M， et al. Understanding back-translation at scale［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2018： 489-500. 10.18653/v1/d18-1045
29	SU J L. WoBERT： Word-based Chinese BERT model — ZhuiyiAI［EB/OL］. ［2021-11-07］.. 10.1145/3468920.3468936
30	WEI J， ZOU K. EDA： easy data augmentation techniques for boosting performance on text classification tasks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： ACL， 2019： 6382-6388. 10.18653/v1/d19-1670

示例	噪声类型
句子1：我国航天科技取得重大突破。	无标签噪声
句子2：晶科科技历史性抛盘套牢众多散户。	误识别噪声
句子3：神舟十号载人飞行任务新闻发布会。	未识别噪声

示例	噪声类型
句子1：我国航天科技取得重大突破。	无标签噪声
句子2：晶科科技历史性抛盘套牢众多散户。	误识别噪声
句子3：神舟十号载人飞行任务新闻发布会。	未识别噪声

数据集	数据总量	样本数		类别数	平均每条数据词数
数据集	数据总量	训练集	测试集	类别数	平均每条数据词数
THUCNews	200 000	170 000	30 000	10	362
AG News	127 600	120 000	7 600	4	223
IMDB	50 000	25 000	25 000	2	292

数据集	数据总量	样本数		类别数	平均每条数据词数
数据集	数据总量	训练集	测试集	类别数	平均每条数据词数
THUCNews	200 000	170 000	30 000	10	362
AG News	127 600	120 000	7 600	4	223
IMDB	50 000	25 000	25 000	2	292

模型方法	数据集
模型方法	THUCNews	AG News	IMDB
TextCNN	91.22	87.26	86.73
BiLSTM	91.12	82.58	87.56
BERT	94.83	92.27	93.87
UDA	88.54	86.90	88.61
LOTClass	55.53	86.44	86.62
BERT w.simple match	48.46	75.21	68.74
LSETClass	60.82	87.85	88.48
LSETClass-LE	58.06	87.08	87.53
LSETClass-WD	58.27	87.22	87.36

标签语义增强的弱监督文本分类模型

Weakly-supervised text classification with label semantic enhancement

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 30

相关文章 15

编辑推荐

Metrics

[1]	焦守龙, 段友祥, 孙歧峰, 庄子浩, 孙琛皓. 融合实体描述信息和邻居节点特征的知识表示学习方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1050-1056.
[2]	张海丰, 曾诚, 潘列, 郝儒松, 温超东, 何鹏. 结合BERT和特征投影网络的新闻主题文本分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1116-1124.
[3]	张毅, 王爽胜, 何彬, 叶培明, 李克强. 基于BERT的初等数学文本命名实体识别方法[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 433-439.
[4]	肖锐, 刘明义, 涂志莹, 王忠杰. 基于社交媒体文本挖掘的个人事件检测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3513-3519.
[5]	师夏阳, 张风远, 袁嘉琪, 黄敏. 基于多语BERT的无监督攻击性言论检测[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3379-3385.
[6]	曾兰兰, 王以松, 陈攀峰. 基于BERT和联合学习的裁判文书命名实体识别[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3011-3017.
[7]	吕学强, 彭郴, 张乐, 董志安, 游新冬. 融合BERT与标签语义注意力的文本多标签分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 57-63.
[8]	彭宇, 李晓瑜, 胡世杰, 刘晓磊, 钱伟中. 基于BERT的三阶段式问答模型[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 64-70.
[9]	阮启铭, 过弋, 郑楠, 王业相. 基于层级多任务BERT的海关报关商品分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 71-77.
[10]	温超东, 曾诚, 任俊伟, 张. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
[11]	张增辉, 姜高霞, 王文剑. 基于动态概率抽样的标签噪声过滤方法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3485-3491.
[12]	沈子懿, 王卫亚, 蒋东华, 荣宪伟. 基于Hopfield混沌神经网络和压缩感知的可视化图像加密算法[J]. 计算机应用, 2021, 41(10): 2893-2899.
[13]	张增辉, 姜高霞, 王文剑. 基于局部概率抽样的标签噪声过滤方法[J]. 计算机应用, 2021, 41(1): 67-73.
[14]	罗俊, 陈黎飞. 基于BERT的不完全数据情感分类[J]. 计算机应用, 2021, 41(1): 139-144.
[15]	陈佳伟, 韩芳, 王直杰. 基于自注意力门控图卷积网络的特定目标情感分析[J]. 计算机应用, 2020, 40(8): 2202-2206.