Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (2): 535-540.DOI: 10.11772/j.issn.1001-9081.2019101717

• CCF Bigdata 2019 • Previous Articles     Next Articles

Alarm text named entity recognition based on BERT

Yue WANG1,2, Mengxuan WANG1,2, Sheng ZHANG1,2, Wen DU1,2()   

  1. 1.DS Information Technology Company Limited,Shanghai 200032,China
    2.The First Research Institute of Telecommunications Technology,Shanghai 200032,China
  • Received:2019-08-20 Revised:2019-10-21 Accepted:2019-10-24 Online:2019-10-31 Published:2020-02-10
  • Contact: Wen DU
  • About author:WANG Yue, born in 1994, M. S. candidate. His research interests include machine learning, natural language processing.
    WANG Mengxuan, born in 1993, M. S. candidate. His research interests include text classification, sentiment analysis.
    ZHANG Sheng, born in 1994, M. S. candidate. His research interests include deep learning, public opinion analysis.
  • Supported by:
    the Shanghai Informatization Development (Big Data Development) Special Fund Project(201901043);the Shanghai Industrial Transformation and Upgrading Special Fund (Industrial Technology Innovation) Project(JJ-YJCX-01-18-3418)

基于BERT的警情文本命名实体识别

王月1,2, 王孟轩1,2, 张胜1,2, 杜渂1,2()   

  1. 1.迪爱斯信息技术股份有限公司,上海 200032
    2.电信科学技术第一研究所,上海 200032
  • 通讯作者: 杜渂
  • 作者简介:王月(1994—),男,江苏连云港人,硕士研究生,CCF会员,主要研究方向:机器学习、自然语言处理
    王孟轩(1993—),男,宁夏银川人,硕士研究生,CCF会员,主要研究方向:文本分类、情感分析
    张胜(1994—),男,湖北武汉人,硕士研究生,CCF会员,主要研究方向:深度学习、舆情分析;
  • 基金资助:
    上海市信息化发展(大数据发展)专项资金资助项目(201901043);上海市产业转型升级专项资金(产业技术创新)资助项目(JJ-YJCX-01-18-3418)

Abstract:

Aiming at the problem that the key entity information in the police field is difficult to recognize, a neural network model based on BERT (Bidirectional Encoder Representations from Transformers), namely BERT-BiLSTM-Attention-CRF, was proposed to recognize and extract related named entities, in the meantime, the corresponding entity annotation specifications were designed for different cases. In the model ,the BERT pre-trained word vectors were used to replace the word vectors trained by the traditional methods such as Skip-gram and Continuous Bag of Words (CBOW), improving the representation ability of the word vector and solving the problem of word boundary division in Chinese corpus trained by the character vectors. And the attention mechanism was used to improve the architecture of classical Named Entity Recognition (NER) model BiLSTM-CRF. BERT-BiLSTM-Attention-CRF model has an accuracy of 91% on the test set, which is 7% higher than that of CRF++ Baseline, and 4% higher than that of BiLSTM-CRF model. The F1 values of the entities are all higher than 0.87.

Key words: alarm text, Named Entity Recognition (NER), pretraining language model, annotation specification, word vector

摘要:

针对警情领域关键实体信息难以识别的问题,提出一种基于BERT的神经网络模型BERT-BiLSTM-Attention-CRF用于识别和提取相关命名实体,且针对不同案由设计了相应的实体标记注规范。该模型使用BERT预训练词向量代替传统Skip-gram和CBOW等方式训练的静态词向量,提升了词向量的表证能力,同时解决了中文语料采用字向量训练时词语边界的划分问题;还使用注意力机制改进经典的命名实体识别(NER)模型架构BiLSTM-CRF。BERT-BiLSTM-Attention-CRF模型在测试集上的准确率达91%,较CRF++的基准模型提高7%,也高于BiLSTM-CRF模型86%的准确率,其中相关人名、损失金额、处理方式等实体的F1值均高于0.87。

关键词: 警情文本, 命名实体识别, 预训练语言模型, 标注规范, 词向量

CLC Number: