基于卡方分布的高维数据相似性连接查询算法

doi:10.11772/j.issn.1001-9081.2016.07.1993

计算机应用 ›› 2016, Vol. 36 ›› Issue (7): 1993-1997.DOI: 10.11772/j.issn.1001-9081.2016.07.1993

基于卡方分布的高维数据相似性连接查询算法

马友忠^1,2, 贾世杰¹, 张永新¹

1. 洛阳师范学院信息技术学院, 河南洛阳 471022;
2. 中原经济区智慧旅游河南省协同创新中心, 河南洛阳 471022

收稿日期:2015-10-26 修回日期:2015-12-20 发布日期:2016-07-14 出版日期:2016-07-10
通讯作者: 马友忠
作者简介:马友忠(1981-),男,河南项城人,讲师,博士,CCF会员,主要研究方向:云计算与大数据处理、Web数据管理;贾世杰(1982-),男,河南洛阳人,讲师,博士,主要研究方向:智慧协同网络、多媒体通信;张永新(1980-),男,河南洛阳人,讲师,博士,主要研究方向:图像融合。
基金资助:
国家自然科学基金资助项目（61501216，61272015）；河南省科技攻关计划项目（152102210332，152102210331）；中原经济区智慧旅游河南省协同创新中心2015年度开放课题（2015-ZHLV-009）。

Chi-square distribution based similarity join query algorithm on high-dimensional data

MA Youzhong^1,2, JIA Shijie¹, ZHANG Yongxin¹

1. School of Information Technology, Luoyang Normal University, Luoyang Henan 471022, China;
2. Central Plains Economic Zone Wisdom Tourism Collaborative Innovation Center in Henan Province, Luoyang Henan 471022, China

Received:2015-10-26 Revised:2015-12-20 Online:2016-07-14 Published:2016-07-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61501216, 61272015), the Science and Technology Project of Henan Province (152102210332, 152102210331), the Open Project of Central Plains Economic Zone Wisdom Tourism Collaborative Innovation Center in Henan Province (2015-ZHLV-009).

摘要/Abstract

摘要： 为了解决高维数据相似性连接查询中存在的维度灾难和计算代价高等问题，基于p-稳态分布，将高维数据映射到低维空间。根据卡方分布的性质，证明了如果低维空间的距离大于kε，则原始空间距离大于ε的概率具有一定的下界，从而可以在低维空间以较低的计算代价进行有效过滤。在此基础上，提出了基于卡方分布的高维数据相似性连接查询算法。为了进一步提高查询效率，提出了基于双重过滤的高维数据相似性连接查询算法。利用真实数据集进行了实验，实验结果表明所提方法具有较好的性能。基于卡方分布的相似性连接查询算法召回率可以达到90%以上。基于双重过滤的相似性连接查询算法可以进一步提高性能，但是会损失一定的召回率。对时间性能要求比较高、对召回率要求不太严格的查询任务可以采用基于双重过滤的相似性连接查询算法；反之，可以采用基于卡方分布的相似性连接查询算法。

关键词: 相似性连接查询, 高维数据, 卡方分布, p-稳态分布, 召回率

Abstract: To deal with the curse of dimensionality and costly computation problems existed in high-dimensional similarity join query, the high-dimensional data were mapped to low-dimensional space based on p-stable distribution. According the definition of chi-square distribution, a theorem was proved:if the distance of two points in low-dimensional space is greater than kε, the probability that the distance of two points in original space is greater than ε has a lower bound. So the effective filtering can be performed at relative low cost in the mapped space. A novel chi-square distribution-based similarity join query algorithm on high-dimensional data was proposed. In order to further improve the query efficiency, another similarity join query algorithm based on double filtering was also proposed. Comprehensive experiments were performed. The experimental results show that the proposed approaches have good performance. The recall of the chi-square distribution-based similarity join query algorithm is larger than 90%. The double filtering based similarity join query algorithm can further improve the efficiency, but it will lose some recall rate. Chi-square distribution based similarity join query algorithm is suitable for the query tasks which are critical of the query performance but not critical of the recall; otherwise, the similarity join query algorithm based on double filtering is favorable.

中图分类号:

TP311.13

马友忠, 贾世杰, 张永新. 基于卡方分布的高维数据相似性连接查询算法[J]. 计算机应用, 2016, 36(7): 1993-1997.

MA Youzhong, JIA Shijie, ZHANG Yongxin. Chi-square distribution based similarity join query algorithm on high-dimensional data[J]. Journal of Computer Applications, 2016, 36(7): 1993-1997.

参考文献

[1] 庞俊,谷峪,许嘉,等.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13.(PANG J, GU Y, XU J, et al. Research advance on similarity join queries[J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(1):1-13.)
[2] 庞俊,于戈,许嘉,等.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,42(1):1-5.(PANG J, YU G, XU J, et al. Similarity joins on massive data based on MapReduce framework[J]. Computer Science, 2015, 42(1):1-5.)
[3] SHIM K, SRIKANT R, AGRAWAL R. High-dimensional similarity joins[J]. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(1):156-171.
[4] BÖHM C, BRAUNMVLLER B, KREBS F, et al. Epsilon grid or-der:an algorithm for the similarity join on massive high-dimensional data[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2001:379-388.
[5] KALASHNIKOV D. Super-EGO:fast multi-dimensional similarity join[J]. The VLDB Journal, 2013, 22(4):561-585.
[6] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[C]//Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation. San Francisco:USENIX Association, 2004:137-150.
[7] SEIDL T, FRIES S, BODEN B. MR-DSJ:distance-based self-join for large-scale vector data analysis with MapReduce[C]//Proceedings of the 15th BTW Conference on Database Systems for Business, Technology, and Web. Berlin:Springer, 2013:37-56.
[8] FRIES S, BODEN B, STEPIEN G, et al. PHiDJ:parallel similarity self-join for high-dimensional vector data with MapReduce[C]//Proceedings of the 30th IEEE International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:796-807.
[9] LUO W, TAN H, MAO H, et al. Efficient similarity joins on massive high-dimensional datasets using MapReduce[C]//Proceedings of the 13th IEEE International Conference on Mobile Data Management. Piscataway, NJ:IEEE, 2012:1-10.
[10] LU W, SHEN Y, CHEN S, et al. Efficient processing of k nearest neighbor joins using MapReduce[J]. Proceedings of the VLDB Endowment, 2012, 5(10):1016-1027.
[11] ZHANG C, LI F, JESTES J. Efficient parallel kNN joins for large data in MapReduce[C]//Proceedings of the 15th International Conference on Extending Database Technology. New York:ACM, 2012:38-49.
[12] VERNICA R, CAREY M, LI C. Efficient parallel set-similarity joins using MapReduce[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data. New York:ACM, 2010:495-506.
[13] RONG C, LU W, WANG X, et al. Efficient and scalable processing of string similarity join[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10):2217-2230.
[14] ELSAYED T, LIN J, OARD D. Pairwise document similarity in large collections with MapReduce[C]//Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computer Linguistics, 2008:265-268.
[15] METWALLY A, FALOUTSOS C. V-SMART-Join:a scalable MapReduce framework for all-pair similarity joins of multisets and vectors[C]//Proceedings of the VLDB Endowment, 2012, 5(8):704-715.
[16] BARAGLIA R, MORALES G, LUCCHESE C. Document similarity self-join with MapReduce[C]//Proceedings of the 10th IEEE International Conference on Data Mining. Piscataway, NJ:IEEE, 2010:731-736.
[17] 刘义,陈荦,景宁,等.海量空间数据的并行Top-k连接查询[J].计算机研究与发展,2011,48(z2):163-172.(LIU Y, CHEN L, JING N, et al. Parallel top-k spatial join query processing on massive spatial data[J]. Journal of Computer Research and Development, 2011, 48(z2):163-172.)
[18] 雷斌,许嘉,谷峪,等.概率数据上基于EMD距离的并行Top-k相似性连接算法[J].软件学报,2013,24(S2):188-199.(LEI B, XU J, GU Y, et al. Parallel top-k similarity join algorithm on large probabilistic data based on earth mover's distance[J]. Journal of Software, 2013, 24(S2):188-199.)
[19] HUANG J, ZHANG R, BUYYA R, et al. MELODY-JOIN:efficient earth mover's distance similarity joins using MapReduce[C]//Proceedings of the 30th IEEE International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:808-819.

基于卡方分布的高维数据相似性连接查询算法

Chi-square distribution based similarity join query algorithm on high-dimensional data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孟圣洁, 于万钧, 陈颖. 最大相关和最大差异的高维数据特征选择算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 767-771.
[2]	乔永坚, 刘晓琳, 白亮. 面向高维特征缺失数据的K最近邻插补子空间聚类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3322-3329.
[3]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[4]	王心, 朱浩华, 刘光灿. 卷积鲁棒主成分分析[J]. 计算机应用, 2021, 41(5): 1314-1318.
[5]	王丽娟, 陈少敏, 尹明, 许跃颖, 郝志峰, 蔡瑞初, 温雯. 基于近邻图改进的块对角子空间聚类算法[J]. 计算机应用, 2021, 41(1): 36-42.
[6]	马友忠, 张智辉, 林春杰. 大数据相似性连接查询技术研究进展[J]. 计算机应用, 2018, 38(4): 978-986.
[7]	王翔, 胡学钢. 高维小样本分类问题中特征选择研究综述[J]. 计算机应用, 2017, 37(9): 2433-2438.
[8]	代照坤, 刘辉, 王文哲, 王亚楠. 基于谱特征嵌入的脑网络状态观测矩阵降维方法[J]. 计算机应用, 2017, 37(8): 2410-2415.
[9]	邱保志, 唐雅敏. 快速识别密度骨架的聚类算法[J]. 计算机应用, 2017, 37(12): 3482-3486.
[10]	王伟东, 刘兵, 管红杰, 周勇, 夏士雄. 基于核函数的谱嵌入聚类算法[J]. 计算机应用, 2015, 35(3): 761-765.
[11]	李泽安陈建平章雅娟赵为华. 高维数据挖掘中特征选择的稳健方法[J]. 计算机应用, 2013, 33(08): 2194-2197.
[12]	李冬睿许统德. 自适应邻域选择的数据可分性降维方法[J]. 计算机应用, 2012, 32(08): 2253-2257.
[13]	范纯龙袁滨余周华徐蕾. 基于陷阱技术的网络爬虫检测[J]. 计算机应用, 2010, 30(07): 1782-1784.
[14]	郭文忠陈国龙陈庆良. 高维数据环境下网络异常检测的改进否定选择算法[J]. 计算机应用, 2009, 29(3): 805-807.
[15]	王淑娥孙劲光. 基于γ划分策略的高维数据索引结构的研究[J]. 计算机应用, 2008, 28(10): 2565-2568.