Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (8): 2522-2529.DOI: 10.11772/j.issn.1001-9081.2024071036

• Artificial intelligence • Previous Articles    

Sequence labeling optimization method combined with entity boundary offset

Jing YU1,2,3, Yanping CHEN1,2,3(), Ying HU1,2,3, Ruizhang HUANG1,2,3, Yongbin QIN1,2,3   

  1. 1.Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence,Guizhou University,Guiyang Guizhou 550025,China
    2.State Key Laboratory of Public Big Data (Guizhou University),Guiyang Guizhou 550025,China
    3.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
  • Received:2024-07-23 Revised:2024-10-12 Accepted:2024-10-16 Online:2024-11-19 Published:2025-08-10
  • Contact: Yanping CHEN
  • About author:YU Jing, born in 1999, M. S. candidate. Her research interests include natural language processing, named entity recognition.
    HU Ying, born in 1996, Ph. D. candidate. His research interests include natural language processing.
    HUANG Ruizhang, born in 1979, Ph. D., professor. Her research interests include data fusion analysis, text mining, network mining, knowledge discovery.
    QIN Yongbin, born in 1980, Ph. D., professor. His research interests include big data management and application, multi-source data fusion.
  • Supported by:
    Major Science and Technology Foundation of Guizhou Province([2024]003);National Key Research and Development Program of China(2023YFC3304500);National Natural Science Foundation of China(62166007)

结合实体边界偏移的序列标注优化方法

余婧1,2,3, 陈艳平1,2,3(), 扈应1,2,3, 黄瑞章1,2,3, 秦永彬1,2,3   

  1. 1.贵州大学 文本计算与认知智能教育部工程研究中心,贵阳 550025
    2.公共大数据国家重点实验室(贵州大学),贵阳 550025
    3.贵州大学 计算机科学与技术学院,贵阳 550025
  • 通讯作者: 陈艳平
  • 作者简介:余婧(1999—),女,贵州贵阳人,硕士研究生,CCF会员,主要研究方向:自然语言处理、命名实体识别
    扈应(1996—),男,重庆人,博士研究生,主要研究方向:自然语言处理
    黄瑞章(1979—),女,天津人,教授,博士,CCF会员,主要研究方向:数据融合分析、文本挖掘、网络挖掘、知识发现
    秦永彬(1980—),男,山东烟台人,教授,博士,CCF高级会员,主要研究方向:大数据管理与应用、多源数据融合。
  • 基金资助:
    贵州省科学技术基金重点资助项目([2024]003);国家重点研发计划项目(2023YFC3304500);国家自然科学基金资助项目(62166007)

Abstract:

To address the issue of positional deviation between the predicted entity boundaries and the true entity boundaries in sequence labeling models in Named Entity Recognition (NER), a sequence labeling optimization method combined with entity boundary offset was proposed. Firstly, the concept of boundary offset was introduced to quantify the positional relationship between each word and entity boundaries, and the relative offset between each word and the nearest entity boundary was calculated, and these offsets were used to generate candidate spans for the entity boundaries. Secondly, Intersection-over-Union (IoU) was used as a filtering criterion to filter out low-quality candidate spans, thereby retaining those spans most likely to represent the entity boundary. Finally, the boundary adjustment module was used to update positions of the entity boundaries in the label sequence based on the candidate spans, thereby optimizing the entity boundaries in the entire label sequence and improving the performance of entity recognition. Experimental results show that the proposed method achieves the F1-scores of 80.48%, 96.42%, and 94.80% on CLUENER2020, Resume-zh, and MSRA datasets, respectively, validating its effectiveness in NER task.

Key words: Named Entity Recognition (NER), sequence labeling, boundary offset, Intersection-over-Union (IoU), boundary adjustment

摘要:

针对序列标注模型在命名实体识别(NER)任务中出现的识别的实体边界与真实的实体边界之间存在位置偏差的问题,提出一种结合实体边界偏移的序列标注优化方法。首先,引入边界偏移量的概念量化每个词与实体边界之间的位置关系,计算每个词与最近实体边界的相对偏移量,再利用这些偏移量生成实体边界的候选跨度;其次,利用交并比(IoU)作为筛选标准过滤低质量的候选跨度,以保留最有可能代表实体边界的候选跨度;最后,通过边界调整模块,根据候选跨度更新标签序列中实体边界的位置,从而优化整个标签序列的实体边界,并提升实体识别的性能。实验结果表明,所提方法在数据集CLUENER2020、Resume-zh和MSRA上的F1值分别达到了80.48%、96.42%和94.80%,验证了该方法对NER任务的有效性。

关键词: 命名实体识别, 序列标注, 边界偏移, 交并比, 边界调整

CLC Number: