Weakly-supervised text classification with label semantic enhancement

doi:10.11772/j.issn.1001-9081.2021122221

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (2): 335-342.DOI: 10.11772/j.issn.1001-9081.2021122221

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Weakly-supervised text classification with label semantic enhancement

Chengyu LIN¹^,², Lei WANG¹, Cong XUE¹()

^1.Institute of Information Engineering，Chinese Academy of Sciences，Beijing 100093，China
^2.School of Cyber Security，University of Chinese Academy of Sciences，Beijing 100049，China

Received:2022-01-06 Revised:2022-03-22 Accepted:2022-04-13 Online:2023-02-08 Published:2023-02-10
Contact: Cong XUE
About author:LIN Chengyu， born in 1997， M. S. candidate. His research interests include natural language processing.
WANG Lei， born in 1985， Ph. D.， senior engineer. His research interests include cryptographic engineering and application， identity management and network trust， big data intelligent processing.
Supported by:
National Natural Science Foundation of China(U1636220)

标签语义增强的弱监督文本分类模型

林呈宇¹^,², 王雷¹, 薛聪¹()

^1.中国科学院信息工程研究所，北京 100093
^2.中国科学院大学网络空间安全学院，北京 100049

通讯作者: 薛聪
作者简介:林呈宇（1997—），男，浙江宁波人，硕士研究生，主要研究方向：自然语言处理
王雷（1985—），男，内蒙古包头人，高级工程师，博士，主要研究方向：密码工程与应用、身份管理与网络信任、大数据智能处理
基金资助:
国家自然科学基金重点项目(U1636220)

Abstract

Abstract:

Aiming at the problem of category vocabulary noise and label noise in weakly-supervised text classification tasks， a weakly-supervised text classification model with label semantic enhancement was proposed. Firstly， the category vocabulary was denoised on the basis of the contextual semantic representation of the words in order to construct a highly accurate category vocabulary. Then， a word category prediction task based on MASK mechanism was constructed to fine-tune the pre-training model BERT （Bidirectional Encoder Representations from Transformers）， so as to learn the relationship between words and categories. Finally， a self-training module with label semantics introduced was used to make full use of all data information and reduce the impact of label noise in order to achieve word-level to sentence-level semantic conversion， thereby accurately predicting text sequence categories. Experimental results show that compared with the current state-of-the-art weakly-supervised text classification model LOTClass （Label-name-Only Text Classification）， the proposed method improves the classification accuracy by 5.29， 1.41 and 1.86 percentage points respectively on the public datasets THUCNews， AG News and IMDB.

Key words: weakly-supervised text classification, BERT (Bidirectional Encoder Representations from Transformers), MASK mechanism, label semantics, label noise, self-training

摘要：

针对弱监督文本分类任务中存在的类别词表噪声和标签噪声问题，提出了一种标签语义增强的弱监督文本分类模型。首先，基于单词上下文语义表示对类别词表去噪，从而构建高度准确的类别词表；然后，构建基于MASK机制的词类别预测任务对预训练模型BERT进行微调，以学习单词与类别的关系；最后，利用引入标签语义的自训练模块来充分利用所有数据信息并减少标签噪声的影响，以实现词级到句子级语义的转换，从而准确预测文本序列类别。实验结果表明，与目前最先进的弱监督文本分类模型LOTClass相比，所提方法在THUCNews、AG News和IMDB公开数据集上，分类准确率分别提高了5.29、1.41和1.86个百分点。

关键词: 弱监督文本分类, BERT, MASK机制, 标签语义, 标签噪声, 自训练

CLC Number:

TP391.1

Chengyu LIN, Lei WANG, Cong XUE. Weakly-supervised text classification with label semantic enhancement[J]. Journal of Computer Applications, 2023, 43(2): 335-342.

林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342.

Figures/Tables 10

References 30

1	YU Y， ZUO S M， JIANG H M， et al. Fine-tuning pre-trained language model with weak supervision： a contrastive-regularized self-training approach［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2021： 1063-1077. 10.18653/v1/2021.naacl-main.84
2	MEKALA D， SHANG J B. Contextualized weak supervision for text classification［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2020： 323-333. 10.18653/v1/2020.acl-main.30
3	MENG Y， SHEN J M， ZHANG C， et al. Weakly-supervised neural text classification［C］// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York： ACM， 2018： 983-992. 10.1145/3269206.3271737
4	WANG Z H， MEKALA D， SHANG J B. X-Class： text classification with extremely weak supervision［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2021： 3043-3053. 10.18653/v1/2021.naacl-main.242
5	AWASTHI A， GHOSH S， GOYAL R， et al. Learning from rules generalizing labeled exemplars［EB/OL］. （2020-05-15）［2021-11-07］..
6	SHEN T， GENG X B， LONG G D， et al. Effective search of logical forms for weakly supervised knowledge-based question answering［C］// Proceedings of the 29th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2020： 2227-2233. 10.24963/ijcai.2020/308
7	TAN B W， QIN L H， XING E P， et al. Summarizing text on any aspects： a knowledge-informed weakly-supervised approach［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 6301-6309. 10.18653/v1/2020.emnlp-main.510
8	LI C L， XING J， SUN A X， et al. Effective document labeling with very few seed words： a topic model approach［C］// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. New York： ACM， 2016： 85-94. 10.1145/2983323.2983721
9	MENG Y， SHEN J M， ZHANG C， et al. Weakly-supervised hierarchical text classification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019：6826-6833. 10.1609/aaai.v33i01.33016826
10	KARAMANOLAKIS G， MUKHERJEE S， ZHENG G Q， et al. Self-training with weak supervision［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2021： 845-863. 10.18653/v1/2021.naacl-main.66
11	REN W D， LI Y H， SU H T， et al. Denoising multi-source weak supervision for neural text classification［C］// Findings of the Association for Computational Linguistics： EMNLP 2020. Stroudsburg， PA： ACL， 2020： 3739-3754. 10.18653/v1/2020.findings-emnlp.334
12	MENG Y， ZHANG Y Y， HUANG J X， et al. Text classification using label names only： a language model self-training approach［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 9006-9017. 10.18653/v1/2020.emnlp-main.724
13	JINDAL I， PRESSEL D， LESTER B， et al. An effective label noise model for DNN text classification［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 3246-3256. 10.18653/v1/n19-1328
14	POWERS D M W. Applications and explanations of Zipf’s law［C］// Proceedings of the 1998 Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning. Somerset， NJ： ACL， 1998： 151-160. 10.3115/1603899.1603924
15	GABRILOVICH E， MARKOVITCH S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis［C］// Proceedings of the 20th International Joint Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2007： 1606-1611.
16	CHEN X Y， XIA Y Q， JIN P， et al. Dataless text classification with descriptive LDA［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2015： 2224-2231. 10.1609/aaai.v29i1.9506
17	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
18	YANG Z L， DAI Z H， YANG Y M， et al. XLNet： generalized autoregressive pretraining for language understanding［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-11-07］..
19	ZHANG L， DING J D， XU Y， et al. Weakly-supervised text classification based on keyword graph［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2021： 2803-2813. 10.18653/v1/2021.emnlp-main.222
20	JIN Y P， BHATIA A， WANVARIE D. Seed word selection for weakly-supervised text classification with unsupervised error estimation［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies： Student Research Workshop. Stroudsburg， PA： ACL， 2021： 112-118. 10.18653/v1/2021.naacl-srw.14
21	XIAO H R， LIU X， SONG Y Q. Efficient path prediction for semi-supervised and weakly supervised hierarchical text classification［C］// Proceedings of the 2019 World Wide Web Conference. New York： ACM， 2019： 3370-3376. 10.1145/3308558.3313658
22	LEE D H. Pseudo-label： the simple and efficient semi-supervised learning method for deep neural networks［C/OL］// Proceedings of the ICML 2013 Workshop on Challenges in Representation Learning. ［2021-11-07］..
23	XIE J Y， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
24	GUO B Y， HAN S Q， HAN X， et al. Label confusion learning to enhance text classification models［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 12929-12936. 10.1609/aaai.v35i14.17529
25	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2014： 1746-1751. 10.3115/v1/d14-1181
26	LIU P F， QIU X P， HUANG X J. Recurrent neural network for text classification with multi-task learning［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2016： 2873-2879. 10.24963/ijcai.2017/473
27	XIE Q Z， DAI Z H， HOVY E， et al. Unsupervised data augmentation for consistency training［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020： 6256-6268.
28	EDUNOV S， OTT M， AULI M， et al. Understanding back-translation at scale［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2018： 489-500. 10.18653/v1/d18-1045
29	SU J L. WoBERT： Word-based Chinese BERT model — ZhuiyiAI［EB/OL］. ［2021-11-07］.. 10.1145/3468920.3468936
30	WEI J， ZOU K. EDA： easy data augmentation techniques for boosting performance on text classification tasks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： ACL， 2019： 6382-6388. 10.18653/v1/d19-1670

示例	噪声类型
句子1：我国航天科技取得重大突破。	无标签噪声
句子2：晶科科技历史性抛盘套牢众多散户。	误识别噪声
句子3：神舟十号载人飞行任务新闻发布会。	未识别噪声

示例	噪声类型
句子1：我国航天科技取得重大突破。	无标签噪声
句子2：晶科科技历史性抛盘套牢众多散户。	误识别噪声
句子3：神舟十号载人飞行任务新闻发布会。	未识别噪声

数据集	数据总量	样本数		类别数	平均每条数据词数
数据集	数据总量	训练集	测试集	类别数	平均每条数据词数
THUCNews	200 000	170 000	30 000	10	362
AG News	127 600	120 000	7 600	4	223
IMDB	50 000	25 000	25 000	2	292

数据集	数据总量	样本数		类别数	平均每条数据词数
数据集	数据总量	训练集	测试集	类别数	平均每条数据词数
THUCNews	200 000	170 000	30 000	10	362
AG News	127 600	120 000	7 600	4	223
IMDB	50 000	25 000	25 000	2	292

模型方法	数据集
模型方法	THUCNews	AG News	IMDB
TextCNN	91.22	87.26	86.73
BiLSTM	91.12	82.58	87.56
BERT	94.83	92.27	93.87
UDA	88.54	86.90	88.61
LOTClass	55.53	86.44	86.62
BERT w.simple match	48.46	75.21	68.74
LSETClass	60.82	87.85	88.48
LSETClass-LE	58.06	87.08	87.53
LSETClass-WD	58.27	87.22	87.36

Weakly-supervised text classification with label semantic enhancement

标签语义增强的弱监督文本分类模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 30

Related Articles 12

Recommended Articles

Metrics

[1]	Yifei SONG, Yi LIU. Fast adversarial training method based on data augmentation and label noise [J]. Journal of Computer Applications, 2024, 44(12): 3798-3807.
[2]	Yuxin TUO, Tao XUE. Joint triple extraction model combining pointer network and relational embedding [J]. Journal of Computer Applications, 2023, 43(7): 2116-2124.
[3]	Boyi FU, Yuncong PENG, Xin LAN, Xiaolin QIN. Survey of label noise learning algorithms based on deep learning [J]. Journal of Computer Applications, 2023, 43(3): 674-684.
[4]	Qihong SONG, Jianxun LIU, Haize HU, Xiangping ZHANG. Code search model based on collaborative fusion network [J]. Journal of Computer Applications, 2023, 43(12): 3896-3902.
[5]	Yuhang LI, Yuli YANG, Yao MA, Dan YU, Yongle CHEN. Text adversarial example generation method based on BERT model [J]. Journal of Computer Applications, 2023, 43(10): 3093-3098.
[6]	Rui XIAO, Mingyi LIU, Zhiying TU, Zhongjie WANG. Personal event detection method based on text mining in social media [J]. Journal of Computer Applications, 2022, 42(11): 3513-3519.
[7]	Zenghui ZHANG, Gaoxia JIANG, Wenjian WANG. Label noise filtering method based on dynamic probability sampling [J]. Journal of Computer Applications, 2021, 41(12): 3485-3491.
[8]	LUO Jun, CHEN Lifei. Sentiment classification of incomplete data based on bidirectional encoder representations from transformers [J]. Journal of Computer Applications, 2021, 41(1): 139-144.
[9]	ZHANG Zenghui, JIANG Gaoxia, WANG Wenjian. Label noise filtering method based on local probability sampling [J]. Journal of Computer Applications, 2021, 41(1): 67-73.
[10]	LI Tingting, LYU Jia, FAN Weiya. Semi-supervised self-training positive and unlabeled learning based on new spy technology [J]. Journal of Computer Applications, 2019, 39(10): 2822-2828.
[11]	LYU Jia, LI Junnan. Self-training method based on semi-supervised clustering and data editing [J]. Journal of Computer Applications, 2018, 38(1): 110-115.
[12]	Zhi-sheng LI Yue-heng SUN Pi-lian HE Yue-xian HOU. Chinese question answering pattern learning based on self-training mechanism and Web [J]. Journal of Computer Applications, 2008, 28(6): 1575-1577.