Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Tibetan word segmentation system based on pre-trained model tokenization reconstruction
Jie YANG, Tashi NYIMA, Dongrub RINCHEN, Jindong QI, Dondrub TSHERING
Journal of Computer Applications    2025, 45 (4): 1199-1204.   DOI: 10.11772/j.issn.1001-9081.2024040442
Abstract32)   HTML1)    PDF (1442KB)(10)       Save

To address poor performance of the existing pre-trained model in Tibetan segmentation tasks, a method was proposed to establish a tokenization reconstruction standard to regulate the constraint text, and subsequently reconstruct the tokenization of the Tibetan pre-trained model to perform Tibetan segmentation tasks. Firstly, the standardization operation was performed on the original text to solve the incorrect cuts due to language mixing and so on. Secondly, reconstruction of the tokenization at syllable granularity was performed on the pre-trained model to make the cut-off units parallel to the labeled units. Finally, after completing the sticky cuts using the improved sliding window restoration method, the Re-TiBERT-BiLSTM-CRF model was established by the use of the “Begin, Middle, End and Single” (BMES) four element annotation method, so as to obtain the Tibetan word segmentation system. Experimental results show that the pre-trained model after reconstructing the tokenization is significantly better than the original pre-trained model in the segmentation tasks. The obtained system has a high Tibetan word segmentation precision, and its F1 value can reach up to 97.15%, so it can complete Tibetan segmentation tasks well.

Table and Figures | Reference | Related Articles | Metrics