计算机应用 ›› 2019, Vol. 39 ›› Issue (5): 1288-1292.DOI: 10.11772/j.issn.1001-9081.2018102155

• 人工智能 • 上一篇    下一篇

基于深度神经网络的法语命名实体识别模型

严红1, 陈兴蜀1,2, 王文贤3, 王海舟3, 殷明勇3   

  1. 1. 四川大学 计算机学院, 成都 610065;
    2. 四川大学 网络空间安全学院, 成都 610065;
    3. 四川大学 网络空间安全研究院, 成都 610065
  • 收稿日期:2018-10-26 修回日期:2018-12-26 出版日期:2019-05-10 发布日期:2019-05-14
  • 通讯作者: 王文贤
  • 作者简介:严红(1994-),女,四川广元人,硕士研究生,主要研究方向:命名实体识别、舆情分析;陈兴蜀(1968-),女,四川自贡人,教授,博士,主要研究方向:网络安全、云计算、大数据安全;王文贤(1978-),男,福建晋江人,讲师,博士,主要研究方向:网络安全、云计算、大数据安全;王海舟(1986-),男,四川南充人,讲师,博士,主要研究方向:网络安全、P2P;殷明勇(1983-),男,陕西汉中人,博士研究生,主要研究方向:舆情分析。
  • 基金资助:
    国家自然科学基金资助项目(61802270);国家"双创"示范基地之变革性技术国际研发转化平台项目(C700011);四川省重点研发项目(2018G20100)。

Recognition model for French named entities based on deep neural network

YAN Hong1, CHEN Xingshu1,2, WANG Wenxian3, WANG Haizhou3, YIN Mingyong3   

  1. 1. College of Computer Science, Sichuan University, Chengdu Sichuan 610065, China;
    2. College of Cybersecurity, Sichuan University, Chengdu Sichuan 610065, China;
    3. Cybersecurity Research Institute, Sichuan University, Chengdu Sichuan 610065, China
  • Received:2018-10-26 Revised:2018-12-26 Online:2019-05-10 Published:2019-05-14
  • Supported by:
    This work is partially supported by National Natural Science Foundation of China (61802270), the Transformative Technology International R&D and Transformation Platform of the National "Double Creation" Demonstration Base (C700011), the Sichuan Key Research and Development Project (2018G20100).

摘要: 现有法语命名实体识别(NER)研究中,机器学习模型多使用词的字符形态特征,多语言通用命名实体模型使用字词嵌入代表的语义特征,都没有综合考虑语义、字符形态和语法特征。针对上述不足,设计了一种基于深度神经网络的法语命名实体识别模型CGC-fr。首先从文本中提取单词的词嵌入、字符嵌入和语法特征向量;然后由卷积神经网络(CNN)从单词的字符嵌入序列中提取单词的字符特征;最后通过双向门控循环神经网络(BiGRU)和条件随机场(CRF)分类器根据词嵌入、字符特征和语法特征向量识别出法语文本中的命名实体。实验中,CGC-fr在测试集的F1值能够达到82.16%,相对于机器学习模型NERC-fr、多语言通用的神经网络模型LSTM-CRF和Char attention模型,分别提升了5.67、1.79和1.06个百分点。实验结果表明,融合三种特征的CGC-fr模型比其他模型更具有优势。

关键词: 命名实体识别, 法语, 深度神经网络, 自然语言处理, 序列标注

Abstract: In the existing French Named Entity Recognition (NER) research, the machine learning models mostly use the character morphological features of words, and the multilingual generic named entity models use the semantic features represented by word embedding, both without taking into account the semantic, character morphological and grammatical features comprehensively. Aiming at this shortcoming, a deep neural network based model CGC-fr was designed to recognize French named entity. Firstly, word embedding, character embedding and grammar feature vector were extracted from the text. Then, character feature was extracted from the character embedding sequence of words by using Convolution Neural Network (CNN). Finally, Bi-directional Gated Recurrent Unit Network (BiGRU) and Conditional Random Field (CRF) were used to label named entities in French text according to word embedding, character feature and grammar feature vector. In the experiments, F1 value of CGC-fr model can reach 82.16% in the test set, which is 5.67 percentage points, 1.79 percentage points and 1.06 percentage points higher than that of NERC-fr, LSTM(Long Short-Term Memory network)-CRF and Char attention models respectively. The experimental results show that CGC-fr model with three features is more advantageous than the others.

Key words: Named Entity Recognition (NER), French, neural network, Natural Language Processing (NLP), sequence labeling

中图分类号: