Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3815-3822.DOI: 10.11772/j.issn.1001-9081.2023121719

• Artificial intelligence • Previous Articles     Next Articles

Prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network

Yingjie GAO1, Min LIN1(), Siriguleng2,3, Bin LI1, Shujun ZHANG1,2   

  1. 1.College of Computer Science and Technology,Inner Mongolia Normal University,Hohhot Inner Mongolia 010022,China
    2.School of Chinese Language and Literature,Inner Mongolia Normal University,Hohhot Inner Mongolia 010022,China
    3.College of Computer Science and Technology,Inner Mongolia Minzu University,Tongliao Inner Mongolia 028000,China
  • Received:2023-12-15 Revised:2024-02-15 Accepted:2024-02-26 Online:2024-03-11 Published:2024-12-10
  • Contact: Min LIN
  • About author:GAO Yingjie, born in 1999, M. S. candidate. Her research interests include natural language processing.
    Siriguleng, born in 1991, Ph. D. candidate. Her research interests include computational linguistics, natural language processing.
    LI Bin, born in 1998, M. S. candidate. His research interests include entity relation extraction.
    ZHANG Shujun, born in 1979, Ph. D. candidate, senior experimentalist. His research interests include natural language processing, applied linguistics.
  • Supported by:
    National Natural Science Foundation of China(62266033);Natural Science Foundation of Inner Mongolia(2021LHMS06010);Science and Technology Program of Inner Mongolia Autonomous Region(2021GG0218);Open Project of Key Laboratory of Inner Mongolia Autonomous Region, Ministry of Education(2023KFZD03);Graduate Student Scientific Research Innovation Project in Inner Mongolia Autonomous Region(S20231076Z);Fundamental Research Fund for Inner Mongolia Normal University(2022JBXC018)

基于片段抽取原型网络的古籍文本断句标点提示学习方法

高颖杰1, 林民1(), 斯日古楞null2,3, 李斌1, 张树钧1,2   

  1. 1.内蒙古师范大学 计算机科学技术学院,呼和浩特 010022
    2.内蒙古师范大学 文学院,呼和浩特 010022
    3.内蒙古民族大学 计算机科学与技术学院,内蒙古 通辽 028000
  • 通讯作者: 林民
  • 作者简介:高颖杰(1999—),女,内蒙古锡林郭勒人,硕士研究生,主要研究方向:自然语言处理
    斯日古楞(1991—),女(蒙古族),内蒙古通辽人,博士研究生,主要研究方向:计算语言学、自然语言处理
    李斌(1998—),男,内蒙古乌兰察布人,硕士研究生,主要研究方向:实体关系抽取
    张树钧(1979—),男,内蒙古呼和浩特人,高级实验师,博士研究生,主要研究方向:自然语言处理、语言文字应用。
  • 基金资助:
    国家自然科学基金资助项目(62266033);内蒙古自然科学基金资助项目(2021LHMS06010);内蒙古自治区科技计划项目(2021GG0218);内蒙古自治区级教育部重点实验室开放课题(2023KFZD03);内蒙古自治区硕士研究生科研创新项目(S20231076Z);内蒙古师范大学基本科研业务费专项(2022JBXC018)

Abstract:

In view of the phenomenon that automatic sentence segmentation and punctuation task in ancient book information processing relies on large-scale annotated corpora, and considering that training high-quality, large-scale samples is expensive and these samples are difficult to obtain, a prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network was proposed. Firstly, structured prompt information was incorporated into the support set to form an effective prompt template, so as to improve the model's learning efficiency. Then, combined with a punctuation position extractor and a prototype network classifier, the misjudgment impact and the interference from non-punctuation labels in traditional sequence labeling method were effectively reduced. Experimental results show that on Records of the Grand Historian dataset, the F1 score of the proposed method is 2.47 percentage points higher than that of the Siku-BERT-BiGRU-CRF (Siku - Bidirectional Encoder Representation from Transformer - Bidirectional Gated Recurrent Unit - Conditional Random Field) method. In addition, on the public multi-domain ancient text dataset CCLUE, the precision and F1 score of this method reach 91.60% and 93.12% respectively, indicating that the method can perform sentence segmentation and punctuation in multi-domain ancient text effectively and automatically by using a small number of training samples. Therefore, the proposed method offers new thought and approach for conducting in-depth research on automatic sentence segmentation and punctuation, as well as for enhancing the model's learning efficiency, in multi-domain ancient text.

Key words: intelligent collation of ancient books, span-extracted prototypical network, prompt learning, automatic sentence segmentation and punctuation, deep learning

摘要:

针对古籍信息处理中自动断句及标点任务依赖大规模标注语料的现象,在考虑高质量、大规模样本的训练成本昂贵且难以获取的背景下,提出一种基于片段抽取原型网络的古籍文本断句标点提示学习方法。首先,通过对支持集加入结构化提示信息形成有效的提示模板,从而提高模型的学习效率;其次,结合标点位置提取器和原型网络分类器,有效减少传统序列标注方法中的误判影响及非标点标签的干扰。实验结果表明,与Siku-BERT-BiGRU-CRF(Siku-Bidirectional Encoder Representation from Transformer-Bidirectional Gated Recurrent Unit-Conditional Random Field)方法相比,在《史记》数据集上所提方法的F1值提升了2.47个百分点。此外,在公开的多领域古籍数据集CCLUE上,所提方法的精确率和F1值分别达到了91.60%和93.12%,说明所提方法利用少量训练样本就能对多领域古籍进行有效的自动断句标点。因此,所提方法为多领域古籍文本的自动断句及标点任务的深入研究以及提高模型的学习效率提供了新的思路和方法。

关键词: 古籍智能整理, 片段抽取原型网络, 提示学习, 自动断句标点, 深度学习

CLC Number: