《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (4): 1199-1204.DOI: 10.11772/j.issn.1001-9081.2024040442

• 人工智能 • 上一篇    下一篇

基于预训练模型标记器重构的藏文分词系统

杨杰1,2, 尼玛扎西1,2(), 仁青东主1,2, 祁晋东1, 才让东知1,2   

  1. 1.西藏大学 信息科学技术学院,拉萨 850000
    2.藏文信息技术教育部工程研究中心(西藏大学),拉萨 850000
  • 收稿日期:2024-04-12 修回日期:2024-06-28 接受日期:2024-06-28 发布日期:2025-04-08 出版日期:2025-04-10
  • 通讯作者: 尼玛扎西
  • 作者简介:杨杰(1997—),男,甘肃天水人,硕士研究生,主要研究方向:自然语言处理
    仁青东主(1991—),男(藏族),甘肃卓尼人,讲师,博士,主要研究方向:藏语自然语言处理
    祁晋东(1997—),男,山西运城人,硕士研究生,主要研究方向:信息处理、计算机视觉
    才让东知(1992—),男(藏族),甘肃玛曲人,硕士研究生,主要研究方向:藏语自然语言处理。
  • 基金资助:
    新一代人工智能国家科技重大专项(2022ZD0116100);省部共建藏语智能信息处理及应用国家重点实验室开放课题项目(2023?Z?006)

Tibetan word segmentation system based on pre-trained model tokenization reconstruction

Jie YANG1,2, Tashi NYIMA1,2(), Dongrub RINCHEN1,2, Jindong QI1, Dondrub TSHERING1,2   

  1. 1.School of Information Science and Technology,Tibet University,Lhasa Xizang 850000,China
    2.Engineering Research Center of the Ministry of Education for Tibetan Information Technology (Tibet University),Lhasa Xizang 850000,China
  • Received:2024-04-12 Revised:2024-06-28 Accepted:2024-06-28 Online:2025-04-08 Published:2025-04-10
  • Contact: Tashi NYIMA
  • About author:YANG Jie, born in 1997, M. S. candidate. His research interests include natural language processing.
    RINCHEN Dongrub, born in 1991, Ph. D., lecturer. His research interests include Tibetan natural language processing.
    QI Jindong, born in 1997, M. S. candidate. His research interests include information processing, computer vision.
    TSHERING Dondrub, born in 1992, M. S. candidate. His research interests include Tibetan natural language processing.
  • Supported by:
    National Next-Generation Artificial Intelligence Science and Technology Major Project(2022ZD0116100);Open Topic Project of the State Key Laboratory of Tibetan Intelligent Information Processing and Application(2023-Z-006)

摘要:

针对现有的预训练模型在藏文分词任务中表现不佳的问题,提出一种建立重构标记器规范约束文本,随后重构藏文预训练模型的标记器以进行藏文分词任务的方法。首先,对原始文本进行规范化操作,以解决因语言混用等导致的错误切分的问题;其次,对预训练模型进行音节粒度的标记器重构,使得切分单元与标注单元平行;最后,在利用改进的滑动窗口还原法完成黏着切分后,利用“词首、词中、词尾、孤立”(BMES)四元标注法建立Re-TiBERT-BiLSTM-CRF模型,从而得到藏文分词系统。实验结果表明,重构标记器后的预训练模型在分词任务中明显优于原始预训练模型,而得到的系统拥有较高的藏文分词精确率,F1值最高可达97.15%,能够较好地完成藏文分词任务。

关键词: 藏语信息处理, 藏文分词模型, 预训练模型, 自然语言处理, 标记器重构

Abstract:

To address poor performance of the existing pre-trained model in Tibetan segmentation tasks, a method was proposed to establish a tokenization reconstruction standard to regulate the constraint text, and subsequently reconstruct the tokenization of the Tibetan pre-trained model to perform Tibetan segmentation tasks. Firstly, the standardization operation was performed on the original text to solve the incorrect cuts due to language mixing and so on. Secondly, reconstruction of the tokenization at syllable granularity was performed on the pre-trained model to make the cut-off units parallel to the labeled units. Finally, after completing the sticky cuts using the improved sliding window restoration method, the Re-TiBERT-BiLSTM-CRF model was established by the use of the “Begin, Middle, End and Single” (BMES) four element annotation method, so as to obtain the Tibetan word segmentation system. Experimental results show that the pre-trained model after reconstructing the tokenization is significantly better than the original pre-trained model in the segmentation tasks. The obtained system has a high Tibetan word segmentation precision, and its F1 value can reach up to 97.15%, so it can complete Tibetan segmentation tasks well.

Key words: Tibetan language information processing, Tibetan word segmentation model, pre-trained model, natural language processing, tokenization reconstruction

中图分类号: