A Few-shot News Topic Classification Method Based on  Knowledge Enhancement And  Learning

doi:10.11772/j.issn.1001-9081.2023050709

Abstract

Abstract: Abstract: Classification methods based on fine-tuning pre-trained models usually require a large amount of annotated data, resulting in the inability to be used for few-shot classification tasks. Therefore, a Knowledge-Enhanced and Learning (KPL) method was proposed for Chinese few-shot news topic classification. Firstrly, an optimal template was learned from the training set by using a pre-trained model. Then the template was integrated with the input text, effectively transforming the classification task into a cloze-filling task. External knowledge was simultaneously utilized to expand the label word space, enhancing the semantic richness of label words. Predicted label words were subsequently mapped back to the original labels. Experiments conducted on the THUCNews, SHNews, and Toutiao news datasets revealed improvements across the 1-shot, 5-shot, 10-shot, and 20-shot tasks. Notably, a significant improvement is observed in the 1-shot task. Compared to baseline few-shot classification methods, the accuracy rate increased by 7.95%, 2.11%, and 3.1% respectively, confirming the effectiveness of the knowledge enhancement and learning approach in few-shot news topic classification tasks.

Key words: news topic classification, learning, knowledge enhancement, few-shot learning, text classification

摘要： 摘要: 基于预训练微调的分类方法通常需要大量带标注的数据，导致无法在小样本分类任务上使用。因此，针对中文小样本新闻主题分类任务，提出一种基于知识增强和提示学习的分类方法(KPL)。首先，利用预训练模型在训练集上学习到最优的提示模板，；然后其次，将提示模板与输入文本结合，使分类任务转化为完形填空任务；同时利用外部知识来扩充标签词空间，丰富标签词的语义信息；最后，对预测的标签词与原始的标签进行映射。通过在对经过抽样形成的三个新闻数据集 THUCNews、SHNews、Toutiao三个新闻数据集分别进行实验，实验结果表明，所提方法在上述数据集上的1-shot、5-shot、10-shot、20-shot任务上整体表现有所提升，尤其在1-shot任务上提升效果突出，与基线小样本分类方法相比，准确率分别提高了7.95、2.11和3.1百分点，验证了知识增强和提示学习在小样本新闻主题分类任务上的有效性。

关键词: 新闻主题分类, 提示学习, 知识增强, 小样本学习, 文本分类

CLC Number:

余新言曾诚王乾何鹏丁晓玉. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2023050709.

Figures/Tables 9

References 35

1	DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［EB/OL］.（2019-05-24）［2023-05-13］..
2	LIU Y， OTT M， GOYAL N，et al.RoBERTa： a robustly optimized BERT pretraining approach ［EB/OL］. （2020-07-26）［2023-05-13］. .
3	RAFFEL C， SHAZEER N， ROBERTS A， et al. Exploring the limits of transfer learning with a unified text-to-text transformer ［J］. The Journal of Machine Learning Research， 2020， 21（1）： 5485-5551.
4	BROWN T B， MANN B， RYDER N， et al. Language models are few-shot learners［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 1877-1901.
5	SUN C， QIU X， XU Y， et al. How to fine-tune BERT for text classification？［C］// Proceedings of the 18th China National Conference on Chinese Computational Linguistics. Berlin： Springer， 2019： 194-206.
6	王乾，曾诚，何鹏，等.基于RoBERTa-RCNN和注意力池化的新闻主题文本分类［J/OL］.郑州大学学报（理学版）：1-8 ［2023-05-13］..
	WANG Q， ZENG C， HE P， et al. News topic text classification based on RoBERTa-RCNN and attention pooling ［J/OL］. Journal of Zhengzhou University （Natural Science Edition）：1-8 ［2023-05-13］..
7	OCH F J， GILDEA D， KHUDANPUR S， et al. A smorgasbord of features for statistical machine translation［C］// Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics： HLT-NAACL 2004. Stroudsberg： ACL， 2004： 161-168.
8	ZHANG Y， NIVRE J. Transition-based dependency parsing with rich non-local features ［C］// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2011： 188-193.
9	LIU P， YUAN W， FU J， et al. Pre-train， prompt， and predict： a systematic survey of prompting methods in natural language processing ［J］. ACM Computing Surveys， 2023， 55（9）： 195.
10	SCHICK T， SCHÜTZE H. Exploiting cloze-questions for few-shot text classification and natural language inference ［C］// Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics： Main Volume. Stroudsburg： ACL， 2021： 255-269.
11	SCAO T L， RUSH A. How many data points is a prompt worth［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2021： 2627-2636.
12	GAO T， FISCH A， CHEN D. Making pre-trained language models better few-shot learners［C］// Proceedings of the 59th Auunaul Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 3816-3830.
13	VASWANI A， SHAZEER N， PARMAR N，et al. Attention is all you need ［C］// Proceedings of the 31st International Conference of Neural Information Processing Systems. Red Hook，NY：Curran Associates Inc.， 2017：6000-6010.
14	RONG X. word2vec Parameter learning explained ［EB/OL］. （2014-11-11）［2023-05-13］. .
15	PENNINGTON J， SOCHER R， MANNING C. GloVe： global vectors for word representation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsberg： ACL，2014： 1532-1543.
16	JIANG Z， XU F F， ARAKI J， et al. How can we know what language models know？［J］. Transactions of the Association for Computational Linguistics， 2020， 8： 423-438.
17	SHIN T， RAZEGHI Y， LOGAN IV R L， et al. AutoPrompt： eliciting knowledge from language models with automatically generated prompts［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsberg： ACL， 2020： 4222-4235.
18	LIU X， ZHENG Y， DU Z， et al. GPT understands， too ［EB/OL］. （2021-05-18）［2023-05-13］. .
19	LI X L， LIANG P. Prefix-tuning： optimizing continuous prompts for generation ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 4582-4597.
20	HAMBARDZUMYAN K， KHACHATRIAN H， MAY J. WARP： word-level adversarial reprogramming ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 4921-4933.
21	SCHICK T， SCHMID H， SCHÜTZE H. Automatically identifying words that can serve as labels for few-shot text classification［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 5569-5578.
22	WEI J， HUANG C， VOSOUGHI S， et al. Few-shot text classification with triplet networks， data augmentation， and curriculum learning ［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL，2021： 5493-5500.
23	MIYATO T， DAI A M， GOODFELLOW I. Adversarial training methods for semi-supervised text classification ［EB/OL］. （2016-05-25）［2023-05-13］..
24	CHEN J， YANG Z， YANG D. MixText： linguistically-informed interpolation of hidden space for semi-supervised text classification［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 2147-2157.
25	SUN Z， FAN C， SUN X，et al. Neural semi-supervised learning for text classification under large-scale pretraining ［EB/OL］. （2020-11-19）［2023-05-13］. .
26	熊伟，宫禹.基于元学习的不平衡少样本情况下的文本分类研究［J］. 中文信息学报， 2022， 36（1）：104-116.
	XIONG W， GONG Y. Text classification based on meta learning for unbalanced small samples［J］. Journal of Chinese Information Processing， 2022， 36（1）：104-116.
27	YAO H， WU Y-X， AL-SHEDIVAT M， et al. Knowledge-aware meta-learning for low-resource text classification ［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2021： 1814-1821.
28	SCHICK T， SCHÜTZE H. It’s not just size that matters： small language models are also few-shot learners ［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2021： 2339-2352.
29	于碧辉，蔡兴业，魏靖烜.基于提示学习的小样本文本分类方法［J］.计算机应用，2023，43（9）：2735-2740.
	YU B H， CAI X Y， WEI J X. Few-shot text classification method based on prompt learning［J］. Journal of Computer Applications，2023，43（9）：2735-2740..
30	HU S， DING N， WANG H， et al. Knowledgeable prompt-tuning： incorporating knowledge into prompt verbalizer for text classification ［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2022： 2225-2240.
31	MENG Y， ZHANG Y， HUANG J， et al. Text classification using label names only： a language model self-training approach［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020： 9006-9017.
32	PEREZ E， KIELA D， CHO K. True few-shot learning with language models ［J］. Advances in Neural Information Processing Systems， 2021， 34： 11054-11070.
33	CUI Y， CHE W， LIU T， et al. Pre-training with whole word masking for Chinese BERT［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 3504-3514.
34	DING N， HU S， ZHAO W， et al. OpenPrompt： an open-source framework for prompt-learning［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics： System Demonstrations. Stroudsburg： ACL， 2022： 105-113.
35	LESTER B， AL-RFOU R， CONSTANT N. The power of scale for parameter-efficient prompt tuning ［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2021： 3045-3059.

数据集	标签	标签词集
THUCNews	房地产	房地产，房产，房地产业
	金融	金银，金融业，金融市场
	教育	高考，考生，高中，文综
Toutiao	电竞	网络游戏，竞技，玩家，
	农业	第一产业，农林牧副渔，农林
	证券	出游，旅行，出行
SHNews	科技	高科技，高新技术，技术
	文化	文明，人文，文风
	旅游	出游，旅行，出行

数据集	标签	标签词集
THUCNews	房地产	房地产，房产，房地产业
	金融	金银，金融业，金融市场
	教育	高考，考生，高中，文综
Toutiao	电竞	网络游戏，竞技，玩家，
	农业	第一产业，农林牧副渔，农林
	证券	出游，旅行，出行
SHNews	科技	高科技，高新技术，技术
	文化	文明，人文，文风
	旅游	出游，旅行，出行

数据集	样本数			标签类别数
数据集	训练集	验证集	测试集	标签类别数
THUCNews	180 000	10 000	10 000	10
Toutiao	267 877	57 401	57 401	15
SHNews	22 699	5 764	5 755	12

数据集	样本数			标签类别数
数据集	训练集	验证集	测试集	标签类别数
THUCNews	180 000	10 000	10 000	10
Toutiao	267 877	57 401	57 401	15
SHNews	22 699	5 764	5 755	12

数据集	模板
THUCNews	这是一条［MASK］新闻：x
	［MASK］新闻：x
	x是［MASK］新闻
Toutiao	这是一条［MASK］新闻：x
	［MASK］新闻：x
	分类：［MASK］x
SHNews	这是一条［MASK］新闻：x
	［MASK］新闻：x
	主题：［MASK］x