计算机应用

• 人工智能与仿真 •    下一篇

公路桥梁定期检测领域命名实体识别语料库构建

莫天金,李韧,杨建喜,李童,蒋仕新,李东   

  1. 重庆交通大学 信息科学与工程学院
  • 收稿日期:2019-11-25 修回日期:2019-12-18 发布日期:2019-12-18 出版日期:2020-05-09
  • 通讯作者: 李韧

Construction of Named Entity Corpus for Highway Bridge Inspection Domain

  • Received:2019-11-25 Revised:2019-12-18 Online:2019-12-18 Published:2020-05-09

摘要: 针对目前业界缺少适应我国公路桥梁检测领域文本特性的中文命名实体语料库的现状,构建一个较大规 模并具有较高标注质量的公路桥梁定期检测命名实体语料库。在分析该类型文本的领域特性基础上,定义了由桥梁实体、结构实体、结构病害实体等六种目标命名实体类别及其标注规范,完成了1 245份真实桥梁检测报告和1 400余万字相关网页文本信息的数据准备和预处理工作,选取了其中150份作为标注语料并完成了多轮迭代标注,标注总字数超过32万字,各类型实体最终标注一致性最高达到98. 5%,最低达到85. 2%。选取了当前命名实体识别领域主流算法和通用领域预训练模型对标注语料进行初步实验,实验结果表明其识别效果有较大提升空间。该语料库的提出可为后续研究提供识别目标定义,并奠定数据和评测基础。

Abstract: Since the named entity corpus suitable for the field of highway bridge inspection in China has not yet been proposed,as well as the corresponding named entity recognition approaches have not been effectively researched,a largescale and high-quality named entity corpus was proposed. Based on the analysis of the domain characteristics of bridge inspection,six named entity categories,such as bridge entity,bridge structure entity and structural damage entity,and their labeling specifications were defined. In addition,1 245 bridge inspection reports and relevant webpage text containing more than 14 million words were collected and preprocessed. Among them,150 inpection reports containing over 320 000 words were iteratively labelled. The final consistency of each type of entities was 85. 2% to 98. 5%. Several well-known algorithms in the field of named entity recognition and pretraining models from general domain were selected for preliminary experiments on labeled corpora. The experimental results show that the recognition effect is not good enough and should be improved in the future. The proposed corpus can provide definitions of recognition targets for subsequent research,and lay a foundation for data and evaluation.

中图分类号: