计算机应用 ›› 2016, Vol. 36 ›› Issue (7): 1993-1997.DOI: 10.11772/j.issn.1001-9081.2016.07.1993

• 大数据 • 上一篇    下一篇

基于卡方分布的高维数据相似性连接查询算法

马友忠1,2, 贾世杰1, 张永新1   

  1. 1. 洛阳师范学院 信息技术学院, 河南 洛阳 471022;
    2. 中原经济区智慧旅游河南省协同创新中心, 河南 洛阳 471022
  • 收稿日期:2015-10-26 修回日期:2015-12-20 出版日期:2016-07-10 发布日期:2016-07-14
  • 通讯作者: 马友忠
  • 作者简介:马友忠(1981-),男,河南项城人,讲师,博士,CCF会员,主要研究方向:云计算与大数据处理、Web数据管理;贾世杰(1982-),男,河南洛阳人,讲师,博士,主要研究方向:智慧协同网络、多媒体通信;张永新(1980-),男,河南洛阳人,讲师,博士,主要研究方向:图像融合。
  • 基金资助:
    国家自然科学基金资助项目(61501216,61272015);河南省科技攻关计划项目(152102210332,152102210331);中原经济区智慧旅游河南省协同创新中心2015年度开放课题(2015-ZHLV-009)。

Chi-square distribution based similarity join query algorithm on high-dimensional data

MA Youzhong1,2, JIA Shijie1, ZHANG Yongxin1   

  1. 1. School of Information Technology, Luoyang Normal University, Luoyang Henan 471022, China;
    2. Central Plains Economic Zone Wisdom Tourism Collaborative Innovation Center in Henan Province, Luoyang Henan 471022, China
  • Received:2015-10-26 Revised:2015-12-20 Online:2016-07-10 Published:2016-07-14
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61501216, 61272015), the Science and Technology Project of Henan Province (152102210332, 152102210331), the Open Project of Central Plains Economic Zone Wisdom Tourism Collaborative Innovation Center in Henan Province (2015-ZHLV-009).

摘要: 为了解决高维数据相似性连接查询中存在的维度灾难和计算代价高等问题,基于p-稳态分布,将高维数据映射到低维空间。根据卡方分布的性质,证明了如果低维空间的距离大于,则原始空间距离大于ε的概率具有一定的下界,从而可以在低维空间以较低的计算代价进行有效过滤。在此基础上,提出了基于卡方分布的高维数据相似性连接查询算法。为了进一步提高查询效率,提出了基于双重过滤的高维数据相似性连接查询算法。利用真实数据集进行了实验,实验结果表明所提方法具有较好的性能。基于卡方分布的相似性连接查询算法召回率可以达到90%以上。基于双重过滤的相似性连接查询算法可以进一步提高性能,但是会损失一定的召回率。对时间性能要求比较高、对召回率要求不太严格的查询任务可以采用基于双重过滤的相似性连接查询算法;反之,可以采用基于卡方分布的相似性连接查询算法。

关键词: 相似性连接查询, 高维数据, 卡方分布, p-稳态分布, 召回率

Abstract: To deal with the curse of dimensionality and costly computation problems existed in high-dimensional similarity join query, the high-dimensional data were mapped to low-dimensional space based on p-stable distribution. According the definition of chi-square distribution, a theorem was proved:if the distance of two points in low-dimensional space is greater than , the probability that the distance of two points in original space is greater than ε has a lower bound. So the effective filtering can be performed at relative low cost in the mapped space. A novel chi-square distribution-based similarity join query algorithm on high-dimensional data was proposed. In order to further improve the query efficiency, another similarity join query algorithm based on double filtering was also proposed. Comprehensive experiments were performed. The experimental results show that the proposed approaches have good performance. The recall of the chi-square distribution-based similarity join query algorithm is larger than 90%. The double filtering based similarity join query algorithm can further improve the efficiency, but it will lose some recall rate. Chi-square distribution based similarity join query algorithm is suitable for the query tasks which are critical of the query performance but not critical of the recall; otherwise, the similarity join query algorithm based on double filtering is favorable.

Key words: similarity join query, high-dimensional data, chi-square distribution, p-stable distribution, recall

中图分类号: