计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1327-1333.DOI: 10.11772/j.issn.1001-9081.2017112623

• 数据科学与技术 • 上一篇    下一篇

基于文献信息网络语义特征的相似性搜索

邱庆羽1, 李婧2, 全兵2, 童超2, 张利君3, 张海仙1   

  1. 1. 四川大学 计算机学院, 成都 610065;
    2. 中移(苏州)软件技术有限公司, 江苏 苏州 215000;
    3. 成都瑞贝英特信息技术有限公司, 成都 610041
  • 收稿日期:2017-11-06 修回日期:2017-12-07 出版日期:2018-05-10 发布日期:2018-05-24
  • 通讯作者: 张海仙
  • 作者简介:邱庆羽(1994-),男,黑龙江牡丹江人,硕士研究生,CCF会员,主要研究方向:机器学习、数据挖掘;李婧(1988-),女,江苏无锡人,工程师,硕士,主要研究方向:大数据、人工智能;全兵(1988-),男,江西抚州人,工程师,硕士,主要研究方向:大数据、人工智能;童超(1988-),男,辽宁大连人,硕士,主要研究方向:大数据、机器学习;张利君(1978-),女,四川简阳人,工程师,硕士,主要研究方向:机器学习、数据挖掘;张海仙(1980-),女,河南南阳人,副教授,博士,主要研究方向:深度神经网络。
  • 基金资助:
    教育部-中国移动科研基金资助项目(MCM20160307);四川省科技创新苗子工程项目和成都市科技局国际合作项目(2016-GH02-00048-HZ,2015-GH02-00041-HZ)。

Similarity search based on semantic features of bibliographic information network

QIU Qingyu1, LI Jing2, QUAN Bing2, TONG Chao2, ZHANG Lijun3, ZHANG Haixian1   

  1. 1. School of Computer Science, Sichuan University, Chengdu Sichuan 610065, China;
    2. China Mobile(Suzhou) Software Technology Company Limited, Suzhou Jiangsu 215000, China;
    3. Chengdu Ruibeiyingte Information Technology Company Limited, Chengdu Sichuan 610041, China
  • Received:2017-11-06 Revised:2017-12-07 Online:2018-05-10 Published:2018-05-24
  • Contact: 张海仙
  • Supported by:
    This work is partially supported by the Ministry of Education-China Mobile Research Fund (MCM20160307), the Sichuan Province Science and Technology Innovation Talent Project and the International Cooperation Project of Chengdu Municipal Science and Technology Bureau (2016-GH02-00048-HZ, 2015-GH02-00041-HZ).

摘要: 文献信息网络是典型的异构信息网络,基于其进行相似性搜索是图挖掘领域的一个研究热点。然而,现有的方法主要采用元路径或元结构的方式,并未考虑节点自身的语义特征,从而导致搜索结果出现偏差。对此,基于文献信息网络提出了一种基于向量的语义特征提取方法,并设计实现了基于向量的节点相似性计算方法VSim;此外,结合元路径设计了基于语义特征的相似性搜索算法VPSim;为提高算法的执行效率,针对文献网络数据的特点,设计了剪枝策略。通过在真实数据上的实验,验证了VSim对搜索语义特征相似实体的适用性,以及VPSim算法的有效性、高执行效率和高可扩展性。

关键词: 文献信息网络, 相似性搜索, 图挖掘, 元路径, 语义特征

Abstract: Bibliography information network is a typical heterogeneous information network and the similarity search based on it is a hot topic of graph mining. However, current methods mainly adopt meta path or meta structure to search similar objects, do not consider semantic features of node itself which leads to a deviation in the search results. To fill this gap, a vector-based semantic feature extraction method was proposed, and a vector-based node similarity calculation method called VSim was designed and implemented. In addition, a similarity search algorithm VPSim (Similarity computation Based on Vector and meta Path) based on semantic features was designed by combining the meta-paths. In order to improve the execution efficiency of the algorithm, a pruning strategy based on the characteristics of bibliographic network data was designed. Experiments on real-world data sets demonstrate that VSim is applicative for searching entities with similar semantic features and VPSim is effective, efficient and extensible.

Key words: bibliography information network, similarity search, graph mining, meta path, semantic features

中图分类号: