Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Tibetan word segmentation system based on pre-trained model tokenization reconstruction

Jie YANG, Tashi NYIMA, Dongrub RINCHEN, Jindong QI, Dondrub TSHERING

Journal of Computer Applications 2025, 45 (4): 1199-1204. DOI: 10.11772/j.issn.1001-9081.2024040442

Abstract （32）

HTML （1）

PDF （1442KB）（10）

Save

To address poor performance of the existing pre-trained model in Tibetan segmentation tasks， a method was proposed to establish a tokenization reconstruction standard to regulate the constraint text， and subsequently reconstruct the tokenization of the Tibetan pre-trained model to perform Tibetan segmentation tasks. Firstly， the standardization operation was performed on the original text to solve the incorrect cuts due to language mixing and so on. Secondly， reconstruction of the tokenization at syllable granularity was performed on the pre-trained model to make the cut-off units parallel to the labeled units. Finally， after completing the sticky cuts using the improved sliding window restoration method， the Re-TiBERT-BiLSTM-CRF model was established by the use of the “Begin， Middle， End and Single” （BMES） four element annotation method， so as to obtain the Tibetan word segmentation system. Experimental results show that the pre-trained model after reconstructing the tokenization is significantly better than the original pre-trained model in the segmentation tasks. The obtained system has a high Tibetan word segmentation precision， and its F1 value can reach up to 97.15%， so it can complete Tibetan segmentation tasks well.

Table and Figures | Reference | Related Articles | Metrics