Disambiguation method of multi-feature fusion based on HowNet sememe and Word2vec word embedding representation

doi:10.11772/j.issn.1001-9081.2020101625

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (8): 2193-2198.DOI: 10.11772/j.issn.1001-9081.2020101625

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Disambiguation method of multi-feature fusion based on HowNet sememe and Word2vec word embedding representation

WANG Wei, ZHAO Erping, CUI Zhiyuan, SUN Hao

College of Information Engineering, Xizang Minzu University, Xianyang Shaanxi 712082, China

Received:2002-10-20 Revised:2020-12-29 Online:2021-08-10 Published:2021-01-27
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61762082), the Tibet Autonomous Region Science and Technology Program (XZ202001ZY0055G).

基于HowNet义原和Word2vec词向量表示的多特征融合消歧方法

王伟, 赵尔平, 崔志远, 孙浩

西藏民族大学信息工程学院, 陕西咸阳 712082

通讯作者: 赵尔平
作者简介:王伟(1996-),男,江苏扬州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱;赵尔平(1976-),男,陕西彬县人,副教授,硕士,CCF会员,主要研究方向:大数据、知识图谱;崔志远(1997-),男,山东潍坊人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱;孙浩(1995-),男,江苏徐州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱。
基金资助:
国家自然科学基金资助项目（61762082）；西藏自治区科技计划项目（XZ202001ZY0055G）。

Abstract

Abstract: Aiming at the problems that the low-frequency words expressed by the existing word vectors are of poor quality, the semantic information expressed by them is easy to be confused, and the existing disambiguation models cannot distinguish polysemous words accurately, a multi-feature fusion disambiguation method based on word vector fusion was proposed. In the method, the word vectors expressed by HowNet sememes and the word vectors generated by Word2vec (Word to vector) were fused to complement the polysemous information of words and improve the expression quality of low-frequency words. Firstly, the cosine similarity between the entity to be disambiguated and the candidate entity was calculated to obtain the similarity between them. After that, the clustering algorithm and HowNet knowledge base were used to obtain entity category feature similarity. Then, the improved Latent Dirichlet Allocation (LDA) topic model was used to extract the topic keywords to calculate the similarity of entity topic feature similarity. Finally, the word sense disambiguation of polysemous words was realized by weighted fusion of the above three types of feature similarities. Experimental results conducted on the test set of the Tibet animal husbandry field show that the accuracy of the proposed method (90.1%) is 7.6 percentage points higher than that of typical graph model disambiguation method.

Key words: disambiguation, sememe, word vector fusion, feature fusion, polysemy

摘要： 针对目前词向量表示低频词质量差，表示的语义信息容易混淆，以及现有的消歧模型对多义词不能准确区分等问题，提出一种基于词向量融合表示的多特征融合消歧方法。该方法将使用知网（HowNet）义原表示的词向量与Word2vec生成的词向量进行融合来补全词的多义信息以及提高低频词的表示质量。首先计算待消歧实体与候选实体的余弦相似度来获得二者的相似度；其次使用聚类算法和知网知识库来获取实体类别特征相似度；然后利用改进的潜在狄利克雷分布（LDA）主题模型来抽取主题关键词以计算实体主题特征相似度，最后通过加权融合以上三类特征相似度实现多义词词义消歧。在西藏畜牧业领域测试集上进行的实验结果表明，所提方法的准确率（90.1%）比典型的图模型消歧方法提高了7.6个百分点。

关键词: 消歧, 义原, 词向量融合, 特征融合, 多义词

CLC Number:

TP391.1

WANG Wei, ZHAO Erping, CUI Zhiyuan, SUN Hao. Disambiguation method of multi-feature fusion based on HowNet sememe and Word2vec word embedding representation[J]. Journal of Computer Applications, 2021, 41(8): 2193-2198.

王伟, 赵尔平, 崔志远, 孙浩. 基于HowNet义原和Word2vec词向量表示的多特征融合消歧方法[J]. 计算机应用, 2021, 41(8): 2193-2198.

References

[1] 王瑞, 李弼程, 杜文倩. 基于上下文词向量和主题模型的实体消歧方法[J]. 中文信息学报,2019,33(11):46-56.(WANG R,LI B C,DU W Q. Entity disambiguation method based on context word vector and topic model[J]. Journal of Chinese Information Processing,2019,33(11):46-56.)
[2] 马晓军, 郭剑毅, 王红斌, 等. 融合词向量和主题模型的领域实体消歧[J]. 模式识别与人工智能,2017,30(12):1130-1137. (MA X J,GUO J Y,WANG H B,et al. Entity disambiguation in specific domains combining word vector and topic models[J]. Pattern Recognition and Artificial Intelligence,2017,30(12):1130-1137.)
[3] 杨陟卓. 基于上下文翻译的有监督词义消歧研究[J]. 计算机科学,2017,44(4):252-255,280.(YANG Z Z. Supervised WSD method based on context translation[J]. Computer Science,2017, 44(4):252-255,280.)
[4] 王苗, 杨鹏. 一种改进的无监督网络图词义消歧方法研究(英文)[J]. 机床与液压,2017,45(18):130-135.(WANG M, YANG P. An improved unsupervised word sense disambiguation method based on network graph(English)[J]. Machine Tool and Hydraulics,2017,45(18):130-135.)
[5] 陈洋, 罗智勇. 一种基于Hownet的词向量表示方法[J]. 北京大学学报(自然科学版),2019,55(1):22-28.(CHEN Y,LUO Z Y. A word vector representation method based on Hownet[J]. Acta Scientiarum Naturalium Universitatis Pekinensis,2019,55(1):22-28.)
[6] 范鹏程, 沈英汉, 许洪波, 等. 融合实体知识描述的实体联合消歧方法[J]. 中文信息学报,2020,34(7):42-49,78.(FAN P C, SHEN Y H,XU H B,et al. Joint entity disambiguation with entity knowledge description[J]. Journal of Chinese Information Processing,2020,34(7):42-49,78.)
[7] 李小涛, 游树娟, 陈维. 一种基于词义向量模型的词语语义相似度算法[J]. 自动化学报,2020,46(8):1654-1669.(LI X T, YOU S J,CHEN W. An algorithm of semantic similarity between words based on word single-meaning embedding model[J]. Acta Automatica Sinica,2020,46(8):1654-1669.)
[8] 张春祥, 赵凌云, 高雪瑶. 结合词形词性和译文的汉语词义消歧[J]. 哈尔滨理工大学学报,2020,25(3):131-136.(ZHANG C X,ZHAO L Y,GAO X Y. Chinese word sense disambiguation based on word-translation and part-of-speech[J]. Journal of Harbin University of Science and Technology,2020,25(3):131-136.)
[9] 张雄, 陈福才, 黄瑞阳. 基于融合特征相似度的实体消歧方法研究[J]. 计算机应用研究,2017,34(2):347-350,396.(ZHANG X,CHEN F C,HUANG R Y. Research on entity disambiguation method based on fusion feature similarity[J]. Application Research of Computers,2017,34(2):347-350,396.)
[10] 沈喆, 王毅, 姚毅凡, 等. 面向学术文献的作者名消歧方法研究综述[J]. 数据分析与知识发现,2020,4(8):15-27.(SHEN Z, WANG Y, YAO Y F, et al. Author name disambiguation techniques for academic literature:a review[J]. Data Analysis and Knowledge Discovery,2020,4(8):15-27.)
[11] 王旭阳, 姜喜秋. 基于上下文信息的中文命名实体消歧方法研究[J]. 计算机应用研究,2018,35(4):1072-1075.(WANG X Y,JIANG X Q. Chinese named entity disambiguation method research based on context information[J]. Application Research of Computers,2018,35(4):1072-1075.)
[12] MIKOLOV T,YIH W T,ZWEIG G. Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2013:746-751.
[13] 郭宇飞, 郝晓燕. 基于卷积神经网络的FrameNet框架消歧研究[J]. 中北大学学报(自然科学版),2020,41(4):346-351. (GUO Y F,HAO X Y. Research on disambiguation of FrameNet framework based on convolutional neural network[J]. Journal of North University of China(Natural Science Edition),2020,41(4):346-351.)
[14] HUANG D C,WANG J L. An approach on Chinese microblog entity linking combining baidu encyclopaedia and word2vec[J]. Procedia Computer Science,2017,111:37-45.
[15] CHEN X X,LIU Z Y,SUN M S. A unified model for word sense representation and disambiguation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2014:1025-1035.
[16] 林泽斐, 欧石燕. 多特征融合的中文命名实体链接方法研究[J]. 情报学报,2019,38(1):68-78.(LIN Z F,OU S Y. Research on Chinese named entity linking method based on multifeature fusion[J]. Journal of the China Society for Scientific and Technical Information,2019,38(1):68-78.)
[17] 曾健荣, 张仰森, 王思远, 等. 基于多特征融合的同名专家消歧方法研究[J]. 北京大学学报(自然科学版),2020,56(4):607-613.(ZENG J R,ZHANG Y S,WANG S Y,et al. Research on expert disambiguation of same name based on multi-feature fusion[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020,56(4):607-613.)
[18] 孙茂松, 陈新雄. 借重于人工知识库的词和义项的向量表示:以HowNet为例[J]. 中文信息学报,2016,30(6):1-6,14. (SUN M S,CHEN X X. Embedding for words and word senses based on human annotated knowledge base:a case study on HowNet[J]. Journal of Chinese Information Processing,2016,30(6):1-6,14.)
[19] HACHEY B,RADFORD W,NOTHMAN J,et al. Evaluating entity linking with Wikipedia[J]. Artificial Intelligence,2013, 194:130-150.
[20] 张涛, 刘康, 赵军. 一种基于图模型的维基概念相似度计算方法及其在实体链接系统中的应用[J]. 中文信息学报,2015,29(2):58-67. (ZHANG T, LIU K, ZHAO J. A graph-based similarity measure between Wikipedia concepts and its application in entity linking system[J]. Journal of Chinese Information Processing,2015,29(2):58-67.)
[21] 赵畅, 李慧颖. 面向知识库问答的实体链接方法[J]. 中文信息学报,2019,33(11):125-133.(ZHAO C,LI H Y. An entity linking approach for knowledge base question answering[J]. Journal of Chinese Information Processing, 2019, 33(11):125-133.)

Disambiguation method of multi-feature fusion based on HowNet sememe and Word2vec word embedding representation

基于HowNet义原和Word2vec词向量表示的多特征融合消歧方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	ZHOU Xianbing, FAN Xiaochao, REN Ge, YANG Yong. Automated English essay scoring method based on multi-level semantic features [J]. Journal of Computer Applications, 2021, 41(8): 2205-2211.
[2]	WU Lidan, XUE Yuyang, TONG Tong, DU Min, GAO Qinquan. Image colorization algorithm based on foreground semantic information [J]. Journal of Computer Applications, 2021, 41(7): 2048-2053.
[3]	DU Yan, LYU Liangfu, JIAO Yichen. Fuzzy prototype network based on fuzzy reasoning [J]. Journal of Computer Applications, 2021, 41(7): 1885-1890.
[4]	ZHANG Sun, YIN Chunyong. Sequential multimodal sentiment analysis model based on multi-task learning [J]. Journal of Computer Applications, 2021, 41(6): 1631-1639.
[5]	LAI Xuemei, TANG Hong, CHEN Hongyu, LI Shanshan. Multimodal sentiment analysis based on feature fusion of attention mechanism-bidirectional gated recurrent unit [J]. Journal of Computer Applications, 2021, 41(5): 1268-1274.
[6]	BIAN Pengcheng, ZHENG Zhonglong, LI Minglu, HE Yiran, WANG Tianxiang, ZHANG Dawei, CHEN Liyuan. Attention fusion network based video super-resolution reconstruction [J]. Journal of Computer Applications, 2021, 41(4): 1012-1019.
[7]	JIANG Qianyu, WANG Fengying, JIA Lipeng. Malware detection method based on perceptual hash algorithm and feature fusion [J]. Journal of Computer Applications, 2021, 41(3): 780-785.
[8]	HOU Yunlong, ZHU Lei, CHEN Qin, LYU Suidong. Salient object detection based on difference of Gaussian feature network [J]. Journal of Computer Applications, 2021, 41(3): 706-713.
[9]	HU Yishan, QIN Pinle, ZENG Jianchao, CHAI Rui, WANG Lifang. Ultrasound thyroid segmentation network based on feature fusion and dynamic multi-scale dilated convolution [J]. Journal of Computer Applications, 2021, 41(3): 891-897.
[10]	LIU Ziyan, ZHU Mingcheng, YUAN Lei, MA Shanshan, CHEN Lingzhouting. Video person re-identification based on non-local attention and multi-feature fusion [J]. Journal of Computer Applications, 2021, 41(2): 530-536.
[11]	CHANG Zheng, LUO Ping, YANG Bo, ZHANG Xiaoxiao. Respiratory sound recognition of chronic obstructive pulmonary disease patients based on HHT-MFCC and short-term energy [J]. Journal of Computer Applications, 2021, 41(2): 598-603.
[12]	MENG Xiangrui, YANG Wenzhong, WANG Ting. Survey of sentiment analysis based on image and text fusion [J]. Journal of Computer Applications, 2021, 41(2): 307-317.
[13]	DU Peide, YAN Hua. Crowd counting network based on multi-scale spatial attention feature fusion [J]. Journal of Computer Applications, 2021, 41(2): 537-543.
[14]	QIU Ningjia, WANG Xiaoxia, WANG Peng, WANG Yanchun. Analysis of double-channel Chinese sentiment model integrating grammar rules [J]. Journal of Computer Applications, 2021, 41(2): 318-323.
[15]	GUO Kexin, ZHANG Yuxiang. Visual-textual sentiment analysis method based on multi-level spatial attention [J]. Journal of Computer Applications, 2021, 41(10): 2835-2841.