基于词向量和条件随机场的领域术语识别方法

doi:10.11772/j.issn.1001-9081.2016.11.3146

计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3146-3151.DOI: 10.11772/j.issn.1001-9081.2016.11.3146

基于词向量和条件随机场的领域术语识别方法

冯艳红^1,2, 于红^1,2, 孙庚^1,2, 赵禹锦³

1. 大连海洋大学信息工程学院, 辽宁大连 116023;
2. 辽宁省海洋信息技术重点实验室(大连海洋大学), 辽宁大连 116023;
3. 大连海洋大学经济管理学院, 辽宁大连 116023

收稿日期:2016-04-22 修回日期:2016-06-20 发布日期:2016-11-12 出版日期:2016-11-10
通讯作者: 于红
作者简介:冯艳红(1980-),女,黑龙江绥化人,讲师,硕士,CCF会员,主要研究方向:自然语言处理、信息检索;于红(1968-),女,辽宁大连人,教授,博士,CCF会员,主要研究方向:数据挖掘、信息检索;孙庚(1979-),男,黑龙江齐齐哈尔人,副教授,硕士,主要研究方向:嵌入式系统;赵禹锦(1990-),女,辽宁营口人,硕士研究生,主要研究方向:数据挖掘、信息检索。

Domain-specific term recognition method based on word embedding and conditional random field

FENG Yanhong^1,2, YU Hong^1,2, SUN Geng^1,2, ZHAO Yujin³

1. College of Information Engineering, Dalian Ocean University, Dalian Liaoning 116023, China;
2. Key Laboratory of Marine Information Technology of Liaoning Province(Dalian Ocean University), Dalian Liaoning 116023, China;
3. College of Economics Management, Dalian Ocean University, Dalian Liaoning 116023, China

Received:2016-04-22 Revised:2016-06-20 Online:2016-11-12 Published:2016-11-10

摘要/Abstract

摘要： 针对基于统计特征的领域术语识别方法忽略了术语的语义和领域特性，从而影响识别结果这一问题，提出一种基于词向量和条件随机场（CRF）的领域术语识别方法。该方法利用词向量具有较强的语义表达能力、词语与领域术语之间的相似度具有较强的领域表达能力这一特点，在统计特征的基础上，增加了词语的词向量与领域术语的词向量之间的相似度特征，构成基于词向量的特征向量，并采用CRF方法综合这些特征实现了领域术语识别。最后在领域语料库和SogouCA语料库上进行实验，识别结果的准确率、召回率和F测度分别达到了0.9855、0.9439和0.9643，表明所提的领域术语识别方法取得了较好的效果。

关键词: 词向量, 条件随机场, 术语识别, 相似度特征

Abstract: Domain-specific term recognition methods based on statistical distribution characteristics neglect term semantics and domain feature, and the recognition result are unsatisfying. To resolve this problem, a domain-specific term recognition method based on word embedding and Conditional Random Field (CRF) was proposed. The strong semantic expression ability of word embedding and strong field expression ability of similarity between words and term were fully utilized. Based on statistical features, the similarity between word embedding of words and word embedding of term was increased to create the feature vector. term recognition was realized by CRF and a series of features. Finally, experiment was carried out on field text and SogouCA corpus, and the precision, recall and F measure of the recognition results reached 0.9855, 0.9439 and 0.9643, respectively. The results show that the proposed method is more effective than current methods.

Key words: word embedding, Conditional Random Fields (CRF), term recognition, similarity feature

中图分类号:

TP391.4

冯艳红, 于红, 孙庚, 赵禹锦. 基于词向量和条件随机场的领域术语识别方法[J]. 计算机应用, 2016, 36(11): 3146-3151.

FENG Yanhong, YU Hong, SUN Geng, ZHAO Yujin. Domain-specific term recognition method based on word embedding and conditional random field[J]. Journal of Computer Applications, 2016, 36(11): 3146-3151.

参考文献

[1] 祝清松,冷伏海.自动术语识别存在的问题及发展趋势综述[J].图书情报工作,2012,56(18):104-109.(ZHU Q S, LENG F H. Existing problems and developing trends of automatic term recognition[J]. Library and Information Service, 2012, 56(18):104-109.)
[2] 吴海燕.基于互信息与词语共现的领域术语自动抽取方法研究[J].重庆邮电大学学报(自然科学版),2013,25(5):690-693.(WU H Y. Automatic domain term extraction based on word co-occurrency and mutual information[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2013,25(5):690-693.)
[3] 李丽双,王意文,黄德根.基于信息熵和词频分布变化的术语抽取研究[J].中文信息学报,2015,29(1):82-87.(LI L S, WANG Y W, HUANG D G. Term extraction based on information entropy and word frequency distribution variety[J]. Journal of Chinese Information Processing, 2015, 29(1):82-87.)
[4] LAFFERTY J, MCCALLUM A, PEREIRA F. Conditional random fields:probabilistic models for segmenting and labeling sequence data[EB/OL].[2016-05-51]. http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf.
[5] 孙丽萍,过弋,唐文武,等. 基于构成模式和条件随机场的企业简称预测[J]. 计算机应用, 2016,36(2):449-454.(SUN L P, GUO G,TANG W W, et al. Enterprise abbreviation prediction based on constitution pattern and conditional random field[J]. Journal of Computer Applications, 2016,36(2):449-454.)
[6] 栗伟,赵大哲,李博,等.CRF与规则相结合的医学病历实体识别[J].计算机应用研究,2015,32(4):1082-1086.(LI W,ZHAO D Z,LI B,et al. Combining CRF and rule based medical named entity recognition[J]. Application Research of Computers, 2015, 32(4):1082-1086.)
[7] 施水才,王锴,韩艳铧,等.基于条件随机场的领域术语识别研究[J].计算机工程与应用,2013,49(10):147-149.(SHI S C,WANG K, HAN Y H,et al. Terminology recognition based on conditional random fields[J]. Computer Engineering and Applications, 2013, 49(10):147-149.)
[8] 刘海霞,黄德根.语义信息与CRF结合的汉语功能块自动识别[J]. 中文信息学报,2011,25(5):53-59.(LIU H X, HUANG D G. Chinese functional chunk parsing employing CRF and semantic information[J]. Journal of Chinese Information Processing, 2011, 25(5):53-59.)
[9] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003,3:1137-1155.
[10] MNIH A, HINTON G E. A scalable hierarchical distributed language model[C]//NIPS2008:Advances in Neural Information Processing Systems. Cambridge, MA:MIT Press, 2008:1081-1088.
[11] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J/OL].[2015-08-16]. http://arxiv.org/pdf/1301.3781v3.pdf.
[12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//NIPS2013:Advances in Neural Information Processing Systems 26. Cambridge, MA:MIT Press, 2013:3111-3119.
[13] 潘迎捷. 水产辞典[M]. 上海:上海辞书出版社,2007:1-353.(PAN Y J. Dictionary of Fisheries[M]. Shanghai:Shanghai Lexi-cographical Publishing House, 2007:1-353.)
[14] 搜狗全网新闻数据[EB/OL].[2015-09-08].http://www.sogou.com/labs/dl/ca.html. (SogouCA [EB/OL].[2015-09-08].http://www.sogou.com/labs/dl/ca.html.)
[15] word2vec[EB/OL].[2015-11-02]. https://github.com/NLPchina/Word2VEC_java
[16] LAI S, LIU K, XU L, et al. How to generate a good word embedding?[J]. IEEE Intelligent Systems, 2015, Ⅲ(2):1.
[17] CRF++[EB/OL].[2015-11-25]. http://sourceforge.net/projects/crfpp/files.

基于词向量和条件随机场的领域术语识别方法

Domain-specific term recognition method based on word embedding and conditional random field

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	袁泉, 陈昌平, 陈泽, 詹林峰. 基于BERT的两次注意力机制远程监督关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1080-1085.
[2]	董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708.
[3]	刘清堂, 马鑫倩, 周洁, 吴林静, 周鹏霄. 融合常识库和语法特征的数学应用题题意理解[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 356-364.
[4]	王伟, 赵尔平, 崔志远, 孙浩. 基于HowNet义原和Word2vec词向量表示的多特征融合消歧方法[J]. 计算机应用, 2021, 41(8): 2193-2198.
[5]	温超东, 曾诚, 任俊伟, 张. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
[6]	张心怡, 冯仕民, 丁恩杰. 面向煤矿的实体识别与关系抽取模型[J]. 计算机应用, 2020, 40(8): 2182-2188.
[7]	胡甜甜, 但雅波, 胡杰, 李想, 李少波. 基于注意力机制的Bi-LSTM结合CRF的新闻命名实体识别及其情感分类[J]. 计算机应用, 2020, 40(7): 1879-1883.
[8]	王月, 王孟轩, 张胜, 杜渂. 基于BERT的警情文本命名实体识别[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 535-540.
[9]	武婷, 曹春萍. 融合位置权重的基于注意力交叉注意力的长短期记忆方面情感分析模型[J]. 计算机应用, 2019, 39(8): 2198-2203.
[10]	孟曌, 田生伟, 禹龙, 王瑞锦. 联合分层注意力网络和独立循环神经网络的地域欺凌识别[J]. 计算机应用, 2019, 39(8): 2450-2455.
[11]	陈郑淏, 冯翱, 何嘉. 基于一维卷积混合神经网络的文本情感分类[J]. 计算机应用, 2019, 39(7): 1936-1941.
[12]	张克君, 李伟男, 钱榕, 史泰猛, 焦萌. 基于深度学习的文本自动摘要方案[J]. 计算机应用, 2019, 39(2): 311-315.
[13]	许玥, 冯梦如, 皮家甜, 陈勇. 基于深度学习模型的遥感图像分割方法[J]. 计算机应用, 2019, 39(10): 2905-2914.
[14]	廖斌, 李浩文. 基于多孔卷积神经网络的图像深度估计模型[J]. 计算机应用, 2019, 39(1): 267-274.
[15]	张晨, 钱涛, 姬东鸿. 基于神经网络的微博情绪识别与诱因抽取联合模型[J]. 计算机应用, 2018, 38(9): 2464-2468.