Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (6): 1789-1795.DOI: 10.11772/j.issn.1001-9081.2021091638

• The 18th CCF Conference on Web Information Systems and Applications • Previous Articles    

Integrating posterior probability calibration training into text classification algorithm

Jing JIANG1, Yu CHEN2, Jieping SUN1, Shenggen JU1()   

  1. 1.College of Computer Science,Sichuan University,Chengdu Sichuan 610065,China
    2.College of Science and Technology,Sichuan Minzu College,Kangding Sichuan 626001,China
  • Received:2021-09-27 Revised:2021-11-15 Accepted:2021-11-17 Online:2022-04-15 Published:2022-06-10
  • Contact: Shenggen JU
  • About author:JIANG Jing,born in 1996,M. S. candidate. Her research interests include natural language processing,knowledge graph
    CHEN Yu,born in 1974,M. S.,professor. His research interests include natural language processing,human-computer interaction.
    SUN Jieping,born in 1962,M. S.,associate professor. His research interests include intelligent information processing,intelligent education.
  • Supported by:
    National Natural Science Foundation of China(61972270);Key Research and Development Project in Sichuan Province(2019YFG0521)

融合后验概率校准训练的文本分类算法

江静1, 陈渝2, 孙界平1, 琚生根1()   

  1. 1.四川大学 计算机学院,成都 610065
    2.四川民族学院 理工学院,四川 康定 626001
  • 通讯作者: 琚生根
  • 作者简介:江静(1996—),女,重庆人,硕士研究生,主要研究方向:自然语言处理、知识图谱
    陈渝(1974—),男,四川仪陇人,教授,硕士,主要研究方向:自然语言处理、人机交互
    孙界平(1962—),男,四川成都人,副教授,硕士,主要研究方向:智能信息处理、智慧教育
  • 基金资助:
    国家自然科学基金资助项目(61972270);四川省重点研发项目(2019YFG0521)

Abstract:

The pre-training language models used for text representation have achieved high accuracy on various text classification tasks, but the following problems still remain: on the one hand, the category with the largest posterior probability is selected as the final classification result of the model after calculating the posterior probabilities on all categories in the pre-training language model. However, in many scenarios, the quality of the posterior probability itself can provide more reliable information than the final classification result. On the other hand, the classifier of the pre-training language model has performance degradation when assigning different labels to texts with similar semantics. In response to the above two problems, a model combining posterior probability calibration and negative example supervision named PosCal-negative was proposed. In PosCal-negative model, the difference between the predicted probability and the empirical posterior probability was dynamically penalized in an end-to-and way during the training process, and the texts with different labels were used to realize the negative supervision of the encoder, so that different feature vector representations were generated for different categories. Experimental results show that the classification accuracies of the proposed model on two Chinese maternal and child care text classification datasets MATINF-C-AGE and MATINF-C-TOPIC reach 91.55% and 69.19% respectively, which are 1.13 percentage points and 2.53 percentage points higher than those of Enhanced Representation through kNowledge IntEgration (ERNIE) model respectively.

Key words: text classification, posterior probability calibration, pre-training language model, negative supervision, deep learning

摘要:

用于文本表示的预训练语言模型在各种文本分类任务上实现了较高的准确率,但仍然存在以下问题:一方面,预训练语言模型在计算出所有类别的后验概率后选择后验概率最大的类别作为其最终分类结果,然而在很多场景下,后验概率的质量能比分类结果提供更多的可靠信息;另一方面,预训练语言模型的分类器在为语义相似的文本分配不同标签时会出现性能下降的情况。针对上述两个问题,提出一种后验概率校准结合负例监督的模型PosCal-negative。该模型端到端地在训练过程中动态地对预测概率和经验后验概率之间的差异进行惩罚,并在训练过程中利用带有不同标签的文本来实现对编码器的负例监督,从而为每个类别生成不同的特征向量表示。实验结果表明:PosCal-negative模型在两个中文母婴护理文本分类数据集MATINF-C-AGE和MATINF-C-TOPIC的分类准确率分别达到了91.55%和69.19%,相比ERNIE模型分别提高了1.13个百分点和2.53个百分点。

关键词: 文本分类, 后验概率校准, 预训练语言模型, 负例监督, 深度学习

CLC Number: