计算机应用 ›› 2021, Vol. 41 ›› Issue (8): 2193-2198.DOI: 10.11772/j.issn.1001-9081.2020101625

所属专题: 人工智能

王伟, 赵尔平, 崔志远, 孙浩   

  1. 西藏民族大学 信息工程学院, 陕西 咸阳 712082
  • 收稿日期:2002-10-20 修回日期:2020-12-29 发布日期:2021-01-27 出版日期:2021-08-10
  • 通讯作者: 赵尔平
  • 作者简介:王伟(1996-),男,江苏扬州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱;赵尔平(1976-),男,陕西彬县人,副教授,硕士,CCF会员,主要研究方向:大数据、知识图谱;崔志远(1997-),男,山东潍坊人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱;孙浩(1995-),男,江苏徐州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱。
  • 基金资助:

Disambiguation method of multi-feature fusion based on HowNet sememe and Word2vec word embedding representation

WANG Wei, ZHAO Erping, CUI Zhiyuan, SUN Hao   

  1. College of Information Engineering, Xizang Minzu University, Xianyang Shaanxi 712082, China
  • Received:2002-10-20 Revised:2020-12-29 Online:2021-01-27 Published:2021-08-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61762082), the Tibet Autonomous Region Science and Technology Program (XZ202001ZY0055G).

摘要: 针对目前词向量表示低频词质量差,表示的语义信息容易混淆,以及现有的消歧模型对多义词不能准确区分等问题,提出一种基于词向量融合表示的多特征融合消歧方法。该方法将使用知网(HowNet)义原表示的词向量与Word2vec生成的词向量进行融合来补全词的多义信息以及提高低频词的表示质量。首先计算待消歧实体与候选实体的余弦相似度来获得二者的相似度;其次使用聚类算法和知网知识库来获取实体类别特征相似度;然后利用改进的潜在狄利克雷分布(LDA)主题模型来抽取主题关键词以计算实体主题特征相似度,最后通过加权融合以上三类特征相似度实现多义词词义消歧。在西藏畜牧业领域测试集上进行的实验结果表明,所提方法的准确率(90.1%)比典型的图模型消歧方法提高了7.6个百分点。

关键词: 消歧, 义原, 词向量融合, 特征融合, 多义词

Abstract: Aiming at the problems that the low-frequency words expressed by the existing word vectors are of poor quality, the semantic information expressed by them is easy to be confused, and the existing disambiguation models cannot distinguish polysemous words accurately, a multi-feature fusion disambiguation method based on word vector fusion was proposed. In the method, the word vectors expressed by HowNet sememes and the word vectors generated by Word2vec (Word to vector) were fused to complement the polysemous information of words and improve the expression quality of low-frequency words. Firstly, the cosine similarity between the entity to be disambiguated and the candidate entity was calculated to obtain the similarity between them. After that, the clustering algorithm and HowNet knowledge base were used to obtain entity category feature similarity. Then, the improved Latent Dirichlet Allocation (LDA) topic model was used to extract the topic keywords to calculate the similarity of entity topic feature similarity. Finally, the word sense disambiguation of polysemous words was realized by weighted fusion of the above three types of feature similarities. Experimental results conducted on the test set of the Tibet animal husbandry field show that the accuracy of the proposed method (90.1%) is 7.6 percentage points higher than that of typical graph model disambiguation method.

Key words: disambiguation, sememe, word vector fusion, feature fusion, polysemy
