计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3146-3151.DOI: 10.11772/j.issn.1001-9081.2016.11.3146

• 人工智能 • 上一篇    下一篇

基于词向量和条件随机场的领域术语识别方法

冯艳红1,2, 于红1,2, 孙庚1,2, 赵禹锦3   

  1. 1. 大连海洋大学 信息工程学院, 辽宁 大连 116023;
    2. 辽宁省海洋信息技术重点实验室(大连海洋大学), 辽宁 大连 116023;
    3. 大连海洋大学 经济管理学院, 辽宁 大连 116023
  • 收稿日期:2016-04-22 修回日期:2016-06-20 出版日期:2016-11-10 发布日期:2016-11-12
  • 通讯作者: 于红
  • 作者简介:冯艳红(1980-),女,黑龙江绥化人,讲师,硕士,CCF会员,主要研究方向:自然语言处理、信息检索;于红(1968-),女,辽宁大连人,教授,博士,CCF会员,主要研究方向:数据挖掘、信息检索;孙庚(1979-),男,黑龙江齐齐哈尔人,副教授,硕士,主要研究方向:嵌入式系统;赵禹锦(1990-),女,辽宁营口人,硕士研究生,主要研究方向:数据挖掘、信息检索。

Domain-specific term recognition method based on word embedding and conditional random field

FENG Yanhong1,2, YU Hong1,2, SUN Geng1,2, ZHAO Yujin3   

  1. 1. College of Information Engineering, Dalian Ocean University, Dalian Liaoning 116023, China;
    2. Key Laboratory of Marine Information Technology of Liaoning Province(Dalian Ocean University), Dalian Liaoning 116023, China;
    3. College of Economics Management, Dalian Ocean University, Dalian Liaoning 116023, China
  • Received:2016-04-22 Revised:2016-06-20 Online:2016-11-10 Published:2016-11-12

摘要: 针对基于统计特征的领域术语识别方法忽略了术语的语义和领域特性,从而影响识别结果这一问题,提出一种基于词向量和条件随机场(CRF)的领域术语识别方法。该方法利用词向量具有较强的语义表达能力、词语与领域术语之间的相似度具有较强的领域表达能力这一特点,在统计特征的基础上,增加了词语的词向量与领域术语的词向量之间的相似度特征,构成基于词向量的特征向量,并采用CRF方法综合这些特征实现了领域术语识别。最后在领域语料库和SogouCA语料库上进行实验,识别结果的准确率、召回率和F测度分别达到了0.9855、0.9439和0.9643,表明所提的领域术语识别方法取得了较好的效果。

关键词: 词向量, 条件随机场, 术语识别, 相似度特征

Abstract: Domain-specific term recognition methods based on statistical distribution characteristics neglect term semantics and domain feature, and the recognition result are unsatisfying. To resolve this problem, a domain-specific term recognition method based on word embedding and Conditional Random Field (CRF) was proposed. The strong semantic expression ability of word embedding and strong field expression ability of similarity between words and term were fully utilized. Based on statistical features, the similarity between word embedding of words and word embedding of term was increased to create the feature vector. term recognition was realized by CRF and a series of features. Finally, experiment was carried out on field text and SogouCA corpus, and the precision, recall and F measure of the recognition results reached 0.9855, 0.9439 and 0.9643, respectively. The results show that the proposed method is more effective than current methods.

Key words: word embedding, Conditional Random Fields (CRF), term recognition, similarity feature

中图分类号: