《计算机应用》唯一官方网站

• •    下一篇

基于片段抽取原型网络的古籍文本断句标点提示学习方法

高颖杰1,林民2,斯日古楞1,李斌1,张树钧3   

  1. 1. 内蒙古师范大学
    2. 内蒙古师范大学信息工程与计算机学院
    3. 内蒙古呼和浩特市赛罕区内蒙古师范大学计算机与信息工程学院
  • 收稿日期:2023-12-12 修回日期:2024-02-15 发布日期:2024-03-11 出版日期:2024-03-11
  • 通讯作者: 林民
  • 基金资助:
    国家自然科学基金;内蒙古自然科学基金;内蒙古自治区科技计划项目;内蒙古自治区级教育部重点实验室开放课题资助;内蒙古自治区硕士研究生科研创新项目;内蒙古师范大学基本科研业务费专项资金资助

Prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network

  • Received:2023-12-12 Revised:2024-02-15 Online:2024-03-11 Published:2024-03-11

摘要: 摘 要: 针对古籍信息处理中自动断句及标点任务对大规模标注语料依赖的现象,在考虑到高质量、大规模训练样本的成本昂贵且难以获取的背景下,本文提出了一种基于片段抽取原型网络的古籍文本断句标点提示学习方法。首先,本文通过对支持集加入结构化提示信息,形成有效的提示模板,以提高模型的学习效率。其次,结合标点位置提取器和原型网络分类器,有效减少了传统序列标注方法中的误判影响及非标点标签的干扰。实验结果表明,与BERT-BiGRU-CRF的方法相比,在《史记》数据集上F1值提升了2.5个百分点。此外,在公开多领域古籍数据集CCLUE上,本文方法的精确率和F1值分别达到了91.60%和93.12%,说明本方法利用少量训练样本就能对多领域古籍有效自动断句标点。因此,本研究为多领域古籍文本自动断句及标点任务的深入研究、提高模型的学习效率提供了新的思路和方法。

关键词: 关键词: 古籍智能整理, 片段抽取原型网络, 提示学习, 自动断句标点, 深度学习

Abstract: Abstract: In view of the phenomenon that automatic sentence segmentation and punctuation task in ancient book information processing rely on large-scale annotated corpora, and considering that high-quality, large-scale training samples are expensive and difficult to obtain, a learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network was proposed. Firstly, structured information was incorporated into the support set, forming an effective template to improve the model's learning efficiency. Then, combined with a punctuation position extractor and a prototype network classifier, the misjudgment impact and interference from non-punctuation labels in traditional sequence labeling method were effectively reduced. Experimental results show that on the "Shiji" dataset, the F1 score is 2.5 percentage points higher than with the BERT-BiGRU-CRF(Bidirectional Encoder Representation from Transformer-Bidirectional Gated Recurrent Unit -Conditional Random Field) method. In addition, on the public multi-domain ancient text data set CCLUE, the precision rate and F1 score of this method reach 91.60% and 93.12% respectively. This indicates that the method can effectively and automatically segment punctuation in multi-domain ancient text by using a small number of training samples. Therefore, the proposed method offers a new approach and thought for conducting in-depth research on automatic sentence segmentation and punctuation in multi-domain ancient text, as well as for enhancing the model's learning efficiency.

Key words: Keywords: intelligent collation of ancient books, span-extracted prototypical network, learning, automatic sentence segmentation and punctuation, deep learning

中图分类号: