Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (2): 335-342.DOI: 10.11772/j.issn.1001-9081.2021122221
Special Issue: 人工智能
• Artificial intelligence • Previous Articles Next Articles
					
						                                                                                                                                                                                                                    Chengyu LIN1,2, Lei WANG1, Cong XUE1(
)
												  
						
						
						
					
				
Received:2022-01-06
															
							
																	Revised:2022-03-22
															
							
																	Accepted:2022-04-13
															
							
							
																	Online:2023-02-08
															
							
																	Published:2023-02-10
															
							
						Contact:
								Cong XUE   
													About author:LIN Chengyu, born in 1997, M. S. candidate. His research interests include natural language processing.Supported by:通讯作者:
					薛聪
							作者简介:林呈宇(1997—),男,浙江宁波人,硕士研究生,主要研究方向:自然语言处理基金资助:CLC Number:
Chengyu LIN, Lei WANG, Cong XUE. Weakly-supervised text classification with label semantic enhancement[J]. Journal of Computer Applications, 2023, 43(2): 335-342.
林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021122221
| 示例 | 噪声类型 | 
|---|---|
| 句子1:我国航天科技取得重大突破。 | 无标签噪声 | 
| 句子2:晶科科技历史性抛盘套牢众多散户。 | 误识别噪声 | 
| 句子3:神舟十号载人飞行任务新闻发布会。 | 未识别噪声 | 
Tab.1 Noise instances in weakly-supervised text classification
| 示例 | 噪声类型 | 
|---|---|
| 句子1:我国航天科技取得重大突破。 | 无标签噪声 | 
| 句子2:晶科科技历史性抛盘套牢众多散户。 | 误识别噪声 | 
| 句子3:神舟十号载人飞行任务新闻发布会。 | 未识别噪声 | 
| 数据集 | 数据总量 | 样本数 | 类别数 | 平均每条数据词数 | |
|---|---|---|---|---|---|
| 训练集 | 测试集 | ||||
| THUCNews | 200 000 | 170 000 | 30 000 | 10 | 362 | 
| AG News | 127 600 | 120 000 | 7 600 | 4 | 223 | 
| IMDB | 50 000 | 25 000 | 25 000 | 2 | 292 | 
Tab.2 Dataset introduction
| 数据集 | 数据总量 | 样本数 | 类别数 | 平均每条数据词数 | |
|---|---|---|---|---|---|
| 训练集 | 测试集 | ||||
| THUCNews | 200 000 | 170 000 | 30 000 | 10 | 362 | 
| AG News | 127 600 | 120 000 | 7 600 | 4 | 223 | 
| IMDB | 50 000 | 25 000 | 25 000 | 2 | 292 | 
| 模型方法 | 数据集 | ||
|---|---|---|---|
| THUCNews | AG News | IMDB | |
| TextCNN | 91.22 | 87.26 | 86.73 | 
| BiLSTM | 91.12 | 82.58 | 87.56 | 
| BERT | 94.83 | 92.27 | 93.87 | 
| UDA | 88.54 | 86.90 | 88.61 | 
| LOTClass | 55.53 | 86.44 | 86.62 | 
| BERT w.simple match | 48.46 | 75.21 | 68.74 | 
| LSETClass | 60.82 | 87.85 | 88.48 | 
| LSETClass-LE | 58.06 | 87.08 | 87.53 | 
| LSETClass-WD | 58.27 | 87.22 | 87.36 | 
Tab.3 Accuracy comparison of experimental results on different text datasets
| 模型方法 | 数据集 | ||
|---|---|---|---|
| THUCNews | AG News | IMDB | |
| TextCNN | 91.22 | 87.26 | 86.73 | 
| BiLSTM | 91.12 | 82.58 | 87.56 | 
| BERT | 94.83 | 92.27 | 93.87 | 
| UDA | 88.54 | 86.90 | 88.61 | 
| LOTClass | 55.53 | 86.44 | 86.62 | 
| BERT w.simple match | 48.46 | 75.21 | 68.74 | 
| LSETClass | 60.82 | 87.85 | 88.48 | 
| LSETClass-LE | 58.06 | 87.08 | 87.53 | 
| LSETClass-WD | 58.27 | 87.22 | 87.36 | 
| 类别名 | LOTClass | LSETClass | 
|---|---|---|
| 体育 | 体育, 体操,运动,体检,文体,体能,体质,体制,团体, 实体,群体,体力,体格,足球,体重,总体,整体,身体, 乒乓球,体裁,全体,个体,肉体,形体,体系,人体,立体, 篮球,sports,sport,本体,羽毛球,母体,体表,奥运,大体, 机体,肢体,载体,体外,体温,体型,运动会,运动员,体面,环球,物体,奥林匹克,体液,集体,一体,一体化,具体  | |
| 房产 | 房产,房地产,地产,房屋,住房,产业,购房,房子,新房,商品房,房价, 楼房,产权,产区,产地,书房,房间,产出,投产,买房,特产,产能,国产,矿产,药房,房门,客房,家产,高产,牢房,海产,厨房,产业园,房舍, 水产,原产,出产,产生,产物,评论,房县,房中,物产,house,论坛, 再生产,产值,破产,厂房,产量,生产,遗产,认可,停产,住宅,第三产业,知识产权,固定资产,信息网,子房,生产力,小区,乳房,产业资本,年产,病房,产妇,商品生产,盛产,产于,生产者,上房,第二产业,增产,住所,农产品,土特产,产业化,年产量,房基,物业,总产量,生产资料,生产量  | 房产,房地产,地产,住房,房屋, 房子,新房,商品房, 房价,买房,楼房,购房,住宅, 产权,书房,房间,产出, 投产,房中,房门,客房,家产, 房舍, 牢房,厨房,物业, 原产,出产,产物,房县,物产,house,再生产,产值,矿产, 破产,产量,药房,生产,遗产,停产,厂房,年产量, 知识产权,固定资产,信息网,小区,产业资本,年产, 商品生产,上房,增产,住所,产业化,房基,生产力  | 
| 政治 | 政治,政治学,政治经济, 政治经济学, 宪政, 治国, 政治家,内政,军政,党政, 政治局, 国政,从政,政体, 为政,政局,政法,政府, 中国政府, 市政,行政,施政, 财政,政事,政权,政党, 政策, 美国政府,政务,执政, 民政,政制,政客,参政,邮政, 政协,治理,政工,政变, 朝政,党政军,专政,法治, 朝政, 政制,为政,政工, 专政,政局,整治,国政, 治安, 民政,市政,参政,政权, 政区,执政党,政绩,政事,廉政,中央政治局,政治委员, 政客,摄政,综合治理,主治,议政,政协,选举  | 
Tab.4 Comparison results of category vocabularies on THUCNews dataset
| 类别名 | LOTClass | LSETClass | 
|---|---|---|
| 体育 | 体育, 体操,运动,体检,文体,体能,体质,体制,团体, 实体,群体,体力,体格,足球,体重,总体,整体,身体, 乒乓球,体裁,全体,个体,肉体,形体,体系,人体,立体, 篮球,sports,sport,本体,羽毛球,母体,体表,奥运,大体, 机体,肢体,载体,体外,体温,体型,运动会,运动员,体面,环球,物体,奥林匹克,体液,集体,一体,一体化,具体  | |
| 房产 | 房产,房地产,地产,房屋,住房,产业,购房,房子,新房,商品房,房价, 楼房,产权,产区,产地,书房,房间,产出,投产,买房,特产,产能,国产,矿产,药房,房门,客房,家产,高产,牢房,海产,厨房,产业园,房舍, 水产,原产,出产,产生,产物,评论,房县,房中,物产,house,论坛, 再生产,产值,破产,厂房,产量,生产,遗产,认可,停产,住宅,第三产业,知识产权,固定资产,信息网,子房,生产力,小区,乳房,产业资本,年产,病房,产妇,商品生产,盛产,产于,生产者,上房,第二产业,增产,住所,农产品,土特产,产业化,年产量,房基,物业,总产量,生产资料,生产量  | 房产,房地产,地产,住房,房屋, 房子,新房,商品房, 房价,买房,楼房,购房,住宅, 产权,书房,房间,产出, 投产,房中,房门,客房,家产, 房舍, 牢房,厨房,物业, 原产,出产,产物,房县,物产,house,再生产,产值,矿产, 破产,产量,药房,生产,遗产,停产,厂房,年产量, 知识产权,固定资产,信息网,小区,产业资本,年产, 商品生产,上房,增产,住所,产业化,房基,生产力  | 
| 政治 | 政治,政治学,政治经济, 政治经济学, 宪政, 治国, 政治家,内政,军政,党政, 政治局, 国政,从政,政体, 为政,政局,政法,政府, 中国政府, 市政,行政,施政, 财政,政事,政权,政党, 政策, 美国政府,政务,执政, 民政,政制,政客,参政,邮政, 政协,治理,政工,政变, 朝政,党政军,专政,法治, 朝政, 政制,为政,政工, 专政,政局,整治,国政, 治安, 民政,市政,参政,政权, 政区,执政党,政绩,政事,廉政,中央政治局,政治委员, 政客,摄政,综合治理,主治,议政,政协,选举  | 
| 1 | YU Y, ZUO S M, JIANG H M, et al. Fine-tuning pre-trained language model with weak supervision: a contrastive-regularized self-training approach[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 1063-1077. 10.18653/v1/2021.naacl-main.84 | 
| 2 | MEKALA D, SHANG J B. Contextualized weak supervision for text classification[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 323-333. 10.18653/v1/2020.acl-main.30 | 
| 3 | MENG Y, SHEN J M, ZHANG C, et al. Weakly-supervised neural text classification[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM, 2018: 983-992. 10.1145/3269206.3271737 | 
| 4 | WANG Z H, MEKALA D, SHANG J B. X-Class: text classification with extremely weak supervision[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 3043-3053. 10.18653/v1/2021.naacl-main.242 | 
| 5 | AWASTHI A, GHOSH S, GOYAL R, et al. Learning from rules generalizing labeled exemplars[EB/OL]. (2020-05-15) [2021-11-07].. | 
| 6 | SHEN T, GENG X B, LONG G D, et al. Effective search of logical forms for weakly supervised knowledge-based question answering[C]// Proceedings of the 29th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2020: 2227-2233. 10.24963/ijcai.2020/308 | 
| 7 | TAN B W, QIN L H, XING E P, et al. Summarizing text on any aspects: a knowledge-informed weakly-supervised approach[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 6301-6309. 10.18653/v1/2020.emnlp-main.510 | 
| 8 | LI C L, XING J, SUN A X, et al. Effective document labeling with very few seed words: a topic model approach[C]// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. New York: ACM, 2016: 85-94. 10.1145/2983323.2983721 | 
| 9 | MENG Y, SHEN J M, ZHANG C, et al. Weakly-supervised hierarchical text classification[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2019:6826-6833. 10.1609/aaai.v33i01.33016826 | 
| 10 | KARAMANOLAKIS G, MUKHERJEE S, ZHENG G Q, et al. Self-training with weak supervision[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 845-863. 10.18653/v1/2021.naacl-main.66 | 
| 11 | REN W D, LI Y H, SU H T, et al. Denoising multi-source weak supervision for neural text classification[C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg, PA: ACL, 2020: 3739-3754. 10.18653/v1/2020.findings-emnlp.334 | 
| 12 | MENG Y, ZHANG Y Y, HUANG J X, et al. Text classification using label names only: a language model self-training approach[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 9006-9017. 10.18653/v1/2020.emnlp-main.724 | 
| 13 | JINDAL I, PRESSEL D, LESTER B, et al. An effective label noise model for DNN text classification[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 3246-3256. 10.18653/v1/n19-1328 | 
| 14 | POWERS D M W. Applications and explanations of Zipf’s law[C]// Proceedings of the 1998 Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning. Somerset, NJ: ACL, 1998: 151-160. 10.3115/1603899.1603924 | 
| 15 | GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis[C]// Proceedings of the 20th International Joint Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2007: 1606-1611. | 
| 16 | CHEN X Y, XIA Y Q, JIN P, et al. Dataless text classification with descriptive LDA[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2015: 2224-2231. 10.1609/aaai.v29i1.9506 | 
| 17 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 4171-4186. 10.18653/v1/n18-2 | 
| 18 | YANG Z L, DAI Z H, YANG Y M, et al. XLNet: generalized autoregressive pretraining for language understanding[C/OL]// Proceedings of the 33rd Conference on Neural Information Processing Systems. [2021-11-07].. | 
| 19 | ZHANG L, DING J D, XU Y, et al. Weakly-supervised text classification based on keyword graph[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 2803-2813. 10.18653/v1/2021.emnlp-main.222 | 
| 20 | JIN Y P, BHATIA A, WANVARIE D. Seed word selection for weakly-supervised text classification with unsupervised error estimation[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. Stroudsburg, PA: ACL, 2021: 112-118. 10.18653/v1/2021.naacl-srw.14 | 
| 21 | XIAO H R, LIU X, SONG Y Q. Efficient path prediction for semi-supervised and weakly supervised hierarchical text classification[C]// Proceedings of the 2019 World Wide Web Conference. New York: ACM, 2019: 3370-3376. 10.1145/3308558.3313658 | 
| 22 | LEE D H. Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks[C/OL]// Proceedings of the ICML 2013 Workshop on Challenges in Representation Learning. [2021-11-07].. | 
| 23 | XIE J Y, GIRSHICK R, FARHADI A. Unsupervised deep embedding for clustering analysis[C]// Proceedings of the 33rd International Conference on Machine Learning. New York: JMLR.org, 2016: 478-487. | 
| 24 | GUO B Y, HAN S Q, HAN X, et al. Label confusion learning to enhance text classification models[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2021: 12929-12936. 10.1609/aaai.v35i14.17529 | 
| 25 | KIM Y. Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1746-1751. 10.3115/v1/d14-1181 | 
| 26 | LIU P F, QIU X P, HUANG X J. Recurrent neural network for text classification with multi-task learning[C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2016: 2873-2879. 10.24963/ijcai.2017/473 | 
| 27 | XIE Q Z, DAI Z H, HOVY E, et al. Unsupervised data augmentation for consistency training[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2020: 6256-6268. | 
| 28 | EDUNOV S, OTT M, AULI M, et al. Understanding back-translation at scale[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2018: 489-500. 10.18653/v1/d18-1045 | 
| 29 | SU J L. WoBERT: Word-based Chinese BERT model — ZhuiyiAI[EB/OL]. [2021-11-07].. 10.1145/3468920.3468936 | 
| 30 | WEI J, ZOU K. EDA: easy data augmentation techniques for boosting performance on text classification tasks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 6382-6388. 10.18653/v1/d19-1670 | 
| [1] | Yifei SONG, Yi LIU. Fast adversarial training method based on data augmentation and label noise [J]. Journal of Computer Applications, 2024, 44(12): 3798-3807. | 
| [2] | Yuxin TUO, Tao XUE. Joint triple extraction model combining pointer network and relational embedding [J]. Journal of Computer Applications, 2023, 43(7): 2116-2124. | 
| [3] | Boyi FU, Yuncong PENG, Xin LAN, Xiaolin QIN. Survey of label noise learning algorithms based on deep learning [J]. Journal of Computer Applications, 2023, 43(3): 674-684. | 
| [4] | Qihong SONG, Jianxun LIU, Haize HU, Xiangping ZHANG. Code search model based on collaborative fusion network [J]. Journal of Computer Applications, 2023, 43(12): 3896-3902. | 
| [5] | Yuhang LI, Yuli YANG, Yao MA, Dan YU, Yongle CHEN. Text adversarial example generation method based on BERT model [J]. Journal of Computer Applications, 2023, 43(10): 3093-3098. | 
| [6] | Rui XIAO, Mingyi LIU, Zhiying TU, Zhongjie WANG. Personal event detection method based on text mining in social media [J]. Journal of Computer Applications, 2022, 42(11): 3513-3519. | 
| [7] | Zenghui ZHANG, Gaoxia JIANG, Wenjian WANG. Label noise filtering method based on dynamic probability sampling [J]. Journal of Computer Applications, 2021, 41(12): 3485-3491. | 
| [8] | LUO Jun, CHEN Lifei. Sentiment classification of incomplete data based on bidirectional encoder representations from transformers [J]. Journal of Computer Applications, 2021, 41(1): 139-144. | 
| [9] | ZHANG Zenghui, JIANG Gaoxia, WANG Wenjian. Label noise filtering method based on local probability sampling [J]. Journal of Computer Applications, 2021, 41(1): 67-73. | 
| [10] | LI Tingting, LYU Jia, FAN Weiya. Semi-supervised self-training positive and unlabeled learning based on new spy technology [J]. Journal of Computer Applications, 2019, 39(10): 2822-2828. | 
| [11] | LYU Jia, LI Junnan. Self-training method based on semi-supervised clustering and data editing [J]. Journal of Computer Applications, 2018, 38(1): 110-115. | 
| [12] | Zhi-sheng LI Yue-heng SUN Pi-lian HE Yue-xian HOU. Chinese question answering pattern learning based on self-training mechanism and Web [J]. Journal of Computer Applications, 2008, 28(6): 1575-1577. | 
| Viewed | ||||||
| 
										Full text | 
									
										 | 
								|||||
| 
										Abstract | 
									
										 | 
								|||||