Research progress in similarity join query of big data

doi:10.11772/j.issn.1001-9081.2017092202

Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (4): 978-986.DOI: 10.11772/j.issn.1001-9081.2017092202

Previous Articles Next Articles

Research progress in similarity join query of big data

MA Youzhong^1,2, ZHANG Zhihui³, LIN Chunjie^1,2

1. School of Information Technology, Luoyang Normal University, Luoyang Henan 471934, China;
2. Henan Key Laboratory for Big Data Processing and Analytics of Electronic Commerce(Luoyang Normal University), Luoyang Henan 471934, China;
3. Department of Computer, Luoyang Railway Information Engineering School, Luoyang Henan 471900, China

Received:2017-09-11 Revised:2017-11-27 Online:2018-04-09 Published:2018-04-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61602231), the National Key R&D Plan Project (2016YFE0104600), the Science and Technology Open Cooperation Project of Henan Province (172106000077, 152106000048), the Key Scientific Research Project of Higher Education of Henan Province (16A520022).

大数据相似性连接查询技术研究进展

马友忠^1,2, 张智辉³, 林春杰^1,2

1. 洛阳师范学院信息技术学院, 河南洛阳 471934;
2. 河南省电子商务大数据处理与分析重点实验室(洛阳师范学院), 河南洛阳 471934;
3. 洛阳铁路信息工程学校计算机教研室, 河南洛阳 471900

通讯作者: 马友忠
作者简介:马友忠(1981-),男,河南项城人,副教授,博士,CCF会员,主要研究方向:大数据、Web数据管理;张智辉(1979-),男,河南洛阳人,讲师,硕士,主要研究方向:数据挖掘;林春杰(1981-),男(朝鲜族),吉林吉林人,讲师,硕士,主要研究方向:数据挖掘、粗糙集。
基金资助:
国家自然科学基金资助项目（61602231）；国家重点研发计划项目（2016YFE0104600）；河南省科技开放合作项目（172106000077，152106000048）；河南省高等学校重点科研项目（16A520022）。

Abstract

Abstract: In order to deeply understand and fully grasp the research progress of similarity join query technology of big data and to promote its wide application in image clustering, entity resolution, similar document detection, similar trajectory retrieval, a comprehensive survey was conducted on similarity join query technology of big data. Firstly, the basic concepts of similarity join query were introduced; then intensive study on the big data similarity join research works for different data types, such as set, vector, spatial data, probabilistic data, string and graph was elaborated, their advantages and disadvantages were analyzed and summarized. Finally, some challenging research problems and future research priorities in big data similarity join query were pointed out.

Key words: big data, similarity join query, MapReduce framework, K-Nearest Neighbors (KNN)

摘要： 为了深入理解和全面把握大数据相似性连接查询技术的研究进展，更好地促进其在图片聚类、实体解析、相似文档检测、相似轨迹检索等领域的广泛应用，对大数据相似性连接查询技术相关研究工作进行了深入调研和分析。首先对相似性连接查询的基本概念进行了介绍，然后分别对集合、向量、空间数据、概率数据、字符串等不同类型大数据的相似性连接查询相关研究工作进行了深入研究，对其优缺点进行了分析和总结。最后，指出了大数据相似性连接查询面临的若干挑战性问题及未来的研究重点。

关键词: 大数据, 相似性连接查询, MapReduce框架, K最近邻

CLC Number:

TP311.13

MA Youzhong, ZHANG Zhihui, LIN Chunjie. Research progress in similarity join query of big data[J]. Journal of Computer Applications, 2018, 38(4): 978-986.

马友忠, 张智辉, 林春杰. 大数据相似性连接查询技术研究进展[J]. 计算机应用, 2018, 38(4): 978-986.

References

[1] 庞俊,谷峪, 许嘉, 等. 相似性连接查询技术研究进展[J]. 计算机科学与探索, 2013, 7(1):1-13.(PANG J, GU Y, XU J, et al. Research advance on similarity join queries[J]. Journal of Frontiers of Computer Science & Technology, 2013, 7(1):1-13.)
[2] 林学民, 王炜. 集合和字符串的相似度查询[J]. 计算机学报, 2011, 34(10):1853-1862.(LIN X M, WANG W. Set and string similarity queries:a survey[J]. Chinese Journal of Computers, 2011, 34(10):1853-1862.)
[3] YU M H, LI G L, DENG D, et al. String similarity search and join:a survey[J]. Frontiers of Computer Science, 2016, 10(3):399-417.
[4] 庞俊, 于戈, 许嘉, 等.基于MapReduce框架的海量数据相似性连接研究进展[J]. 计算机科学, 2015, 42(1):1-5.(PANG J, YU G, XU J, et al. Similarity joins on massive data based on MapReduce framework[J]. Computer Science, 2015, 42(1):1-5.)
[5] SILVA Y, REED J, BROWN K, et al. An experimental survey of MapReduce-based similarity joins[C]//Proceedings of the 9th International Conference on Similarity Search and Applications. Berlin:Springer, 2016:181-195.
[6] KIMMETT B, SRINIVASAN V, THOMO A. Fuzzy joins in MapReduce:an experimental study[J]. Proceedings of the VLDB Endowment, 2015, 8(12):1514-1517.
[7] LIN J. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2009:155-162.
[8] VERNICA R, CAREY M J, LI C. Efficient parallel set-similarity joins using MapReduce[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2010:495-506.
[9] 李瑞, 王朝坤, 郑伟, 等.基于MapReduce框架的近似复制文本检测[J]. 计算机研究与发展, 2010, 47(增刊1):400-406.(LI R, WANG C K, ZHENG W, et al. Near duplicate text detection based on MapReduce[J]. Journal of Computer Research and Development, 2010, 47(S1):400-406.)
[10] RONG C T, LU W, WANG X, et al. Efficient and scalable processing of string similarity join[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10):2217-2230.
[11] ELSAYED T, LIN J, OARD D. Pairwise document similarity in large collections with MapReduce[C]//HLT-Short 2008:Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies. Stroudsburg, PA, USA:ACL, 2008:265-268.
[12] METWALLY A, FALOUTSOS C. V-SMART-Join:a scalable MapReduce framework for all-pair similarity joins of multisets and vectors[J]. Proceedings of the VLDB Endowment, 2012, 5(8):704-715.
[13] BARAGLIA R, MORALES G, LUCCHESE C. Document similarity self-join with MapReduce[C]//Proceedings of the 10th IEEE International Conference on Data Mining. Piscataway, NJ:IEEE, 2010:731-736.
[14] RONG C T, LIN C B, SILVA Y, et al. Fast and scalable distributed set similarity joins for big data analytics[C]//Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. Piscataway, NJ:IEEE, 2017:1-12.
[15] DENG D, LI G L, WEN H, et al. An efficient partition based method for exact set similarity joins[J]. Proceedings of the VLDB Endowment, 2015, 9(4):360-371.
[16] WANG J J, LIN C. MapReduce based personalized locality sensitive hashing for similarity joins on large scale data[J]. Computational Intelligence and Neuroscience, 2015, 2015:Article No. 37.
[17] LUO W, TAN H, MAO H, et al. Efficient similarity joins on massive high-dimensional datasets using MapReduce[C]//Proceedings of the 13th IEEE International Conference on Mobile Data Management. Piscataway, NJ:IEEE, 2012:1-10.
[18] SEIDL T, FRIES S, BODEN B. MR-DSJ:distance-based self-join for large-scale vector data analysis with MapReduce[C]//Proceedings of the 15th BTW Conference on Database Systems for Business, Technology, and Web. Berlin:Springer, 2013:37-56.
[19] FRIES S, BODEN B, STEPIEN G, et al. PHiDJ:parallel similarity self-join for high-dimensional vector data with MapReduce[C]//Proceedings of the 30th IEEE International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:796-807.
[20] SILVA Y N, REED J M, TSOSIE L M. MapReduce-based similarity join for metric spaces[C]//Proceedings of the 1st International Workshop on Cloud Intelligence. New York:ACM, 2012:Article No. 3.
[21] SILVA Y N, REED J M. Exploiting MapReduce-based similarity joins[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2012:693-696.
[22] 徐媛媛, 陈华辉. 基于MapReduce增量式数据集的相似性连接[J]. 计算机应用研究, 2014, 31(11):3369-3384.(XU Y Y, CHEN H H. MapReduce-based similarity join for incremental data set[J]. Application Research of Computers, 2014, 31(11):3369-3384.)
[23] YANG B, KIM H, SHIM J, et al. Fast and scalable vector similarity joins with MapReduce[J]. Journal of Intelligent Information Systems, 2016, 46(3):473-497.
[24] MA Y Z, MENG X F, WANG S Y. Parallel similarity joins on massive high-dimensional data using MapReduce[J]. Concurrency and Computation:Practice and Experience, 2016, 28(1):166-183.
[25] MA Y Z, JIA S J, ZHANG Y X. A novel approach for high-dimensional vector similarity join query[J]. Concurrency and Computation:Practice and Experience, 2017, 29(5):1-12.
[26] JANG M Y, SONG Y, CHANG J. A density-aware similarity join query processing algorithm on MapReduce[M]//PARK J J, JIN H, KHAN M K, et al. Advanced Multimedia and Ubiquitous Engineering. Berlin:Springer, 2016:469-475.
[27] LIU W, SHEN Y M, WANG P. An efficient MapReduce algorithm for similarity join in metric spaces[J]. The Journal of Supercomputing, 2016, 72(3):1179-1200.
[28] ZHANG C, LI F, JESTES J. Efficient parallel kNN joins for large data in MapReduce[C]//Proceedings of the 15th International Conference on Extending Database Technology. New York:ACM, 2012:38-49.
[29] LU W, SHEN Y, CHEN S, et al. Efficient processing of k nearest neighbor joins using MapReduce[J]. Proceedings of the VLDB Endowment, 2012, 5(10):1016-1027.
[30] 戴健, 丁治明. 基于MapReduce快速kNN Join方法[J]. 计算机学报, 2015, 38(1):99-108.(DAI J, DING Z M. MapReduce based fast kNN join[J]. Chinese Journal of Computers, 2015, 38(1):99-108.)
[31] KIM Y, SHIM K. Parallel Top-K similarity join algorithms using MapReduce[C]//Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. Washington, DC:IEEE Computer Society, 2012, 510-521.
[32] 马友忠, 慈祥. 海量高维向量的并行Top-k连接查询[J]. 计算机学报, 2015, 38(1):86-98.(MA Y Z, CI X. Parallel Top-k join on massive high-dimensional vectors[J]. Chinese Journal of Computers, 2015, 38(1):86-98.)
[33] CHEN D H, SHEN C G, FENG J Y, et al. An efficient parallel Top-k similarity join for massive multidimensional data using spark[J]. International Journal of Database Theory and Application, 2015, 8(3):57-68.
[34] ZHANG S B, HAN J Z, LIU Z Y, et al. SJMR:parallelizing spatial join with MapReduce on clusters[C]//Proceedings of 2009 IEEE International Conference on Cluster Computing and Workshops. Piscataway, NJ:IEEE, 2009:1-8.
[35] 刘义, 陈荦, 景宁, 等. 海量空间数据的并行Top-k连接查询[J]. 计算机研究与发展, 2011, 48(增刊3):163-172.(LIU Y, CHEN L, JING N, et al. Parallel Top-k spatial join query processing on massive spatial data[J]. Journal of Computer Research and Development, 2011, 48(S3):163-172.)
[36] LIU Y, CHEN L, JING N, et al. MRFM:an efficient approach to spatial join aggregate[C]//Proceedings of the WAIM 2012 International Workshops:GDMM, IWSN, MDSP, USDM, and XMLDM. Berlin:Springer, 2012, 140-150.
[37] 刘义, 景宁, 陈荦, 等. MapReduce框架下基于R-树的k-近邻连接算法[J]. 软件学报, 2013, 24(8):1836-1851.(LIU Y, JING N, CHEN L, et al. Algorithm for processing k-nearest join based on R-tree in MapReduce[J]. Journal of Software, 2013, 24(8):1836-1851.)
[38] GUPTA H, CHAWDA B, NEGI S, et al. Processing multi-way spatial joins on Map-Reduce[C]//Proceedings of the 16th International Conference on Extending Database Technology. New York:ACM, 2013, 113-124.
[39] ZHANG Y, MA Y, MENG X. Efficient spatio-textual similarity join using MapReduce[C]//Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies. Piscataway, NJ:IEEE, 2014:52-59.
[40] 雷斌, 许嘉, 谷峪, 等. 概率数据上基于EMD距离的并行Top-k相似性连接算法[J]. 软件学报, 2013, 24(增刊2):188-199.(LEI B, XU J, GU Y, et al. Parallel Top-k similarity join algorithm on large probabilistic data based on earth mover's distance[J]. Journal of Software, 2013, 24(S2):188-199.)
[41] HUANG J, ZHANG R, BUYYA R, et al. MELODY-JOIN:efficient earth mover's distance similarity joins using MapReduce[C]//Proceedings of the 30th IEEE International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:808-819.
[42] HUANG J, ZHANG R, BUYYA R, et al. Heads-Join:efficient earth mover's distance similarity joins on Hadoop[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(6):1660-1673.
[43] XU J, LEI B, GU Y, et al. Efficient similarity join based on earth mover's distance using MapReduce[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(8):2148-2162.
[44] MA Y Z, MENG X F. Set similarity join on massive probabilistic data using MapReduce[J]. Distributed and Parallel Databases, 2014, 32(3):447-464.
[45] WANG J N, LI G L, FENG J H. Extending string similarity join to tolerant fuzzy token matching[J]. ACM Transactions on Database Systems, 2014, 39(1):Article No. 7.
[46] LI G L, DENG D, FENG J H. Pass-Join+:a partition-based method for string similarity joins with edit-distance constraints[J]. ACM Transactions on Database Systems, 2013, 38(2):Article No. 9.
[47] JIANG Y, LI G, FENG J H, et al. String similarity joins:an experimental evaluation[J]. Proceedings of the VLDB Endowment, 2014, 7(8):625-636.
[48] WANG W, QIN J B, XIAO C, et al. VChunkJoin:an efficient algorithm for edit similarity joins[J]. IEEE Transactions on Knowledge & Data Engineering, 2013, 25(8):1916-1929.
[49] LU J H, LIN C B, WANG W, et al. String similarity measures and joins with synonyms[C]//SIGMOD 2013:Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2013:373-384.
[50] XIAO C, WANG W, LIN X M, et al. Efficient similarity joins for near duplicate detection[J]. ACM Transaction of Database Systems, 2011, 36(3):Article No. 15.
[51] RHEINLÄNDER A, LESER U. Scalable sequence similarity search and join in main memory on multi-cores[C]//Euro-Par 2011:Proceedings of the 2011 International Conference on Parallel Processing. Berlin:Springer, 2011, 2:13-22.
[52] DENG D, LI G L, HAO S, et al. MassJoin:a MapReduce-based algorithm for string similarity joins[C]//Proceedings of IEEE 30th International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:340-351.
[53] LIN C, YU H Y, WENG W, et al. Large scale similarity join with edit-distance constraints[C]//Proceedings of 19th International Conference on Database Systems for Advanced Applications. Berlin:Springer, 2014:328-342.
[54] LI G L, DENG D, WANG J N, et al. Pass-join:a partition-based method for similarity joins[J]. Proceedings of the VLDB Endowment. Berlin:Springer, 2011, 5(3):253-264.
[55] PANG J, GU Y, XU J, et al. Efficient graph similarity join with scalable prefix-filtering using MapReduce[C]//Proceedings of 15th International Conference on Web-Age Information Management. Berlin:Springer, 2014:415-418.
[56] CHEN Y F, ZHAO X, GE B, et al. Practising scalable graph similarity joins in MapReduce[C]//Proceedings of the 2014 IEEE International Congress on Big Data. Washington, DC:IEEE Computer Society, 2014:112-119.
[57] ZHANG X F, CHEN L, WANG M. Towards efficient join processing over large RDF graph using MapReduce[C]//Proceedings of the 24th International Conference on Scientific and Statistical Database Management. Berlin:Springer, 2012:250-259.

Research progress in similarity join query of big data

大数据相似性连接查询技术研究进展

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Xu LI, Yulin HE, Laizhong CUI, Zhexue HUANG, Fournier‑Viger PHILIPPE. Distributed observation point classifier for big data with random sample partition [J]. Journal of Computer Applications, 2024, 44(6): 1727-1733.
[2]	Meng CAO, Sunjie YU, Hui ZENG, Hongzhou SHI. Hierarchical access control and sharing system of medical data based on blockchain [J]. Journal of Computer Applications, 2023, 43(5): 1518-1526.
[3]	Li YANG, Jianting CHEN, Yang XIANG. Performance optimization strategy of distributed storage for industrial time series big data based on HBase [J]. Journal of Computer Applications, 2023, 43(3): 759-766.
[4]	Yu LING, Zhilong SHAN. Knowledge concept recommendation system based on interest enhancement [J]. Journal of Computer Applications, 2023, 43(12): 3697-3702.
[5]	ZHOU Xiang, ZHAI Junhai, HUANG Yajie, SHEN Ruicai, HOU Yingzhen. Instance selection algorithm for big data based on random forest and voting mechanism [J]. Journal of Computer Applications, 2021, 41(1): 74-80.
[6]	CAO Cejun, LIU Ju. Overview of modeling method of emergency organization decision in disaster operations management [J]. Journal of Computer Applications, 2020, 40(7): 2142-2149.
[7]	ZHU Xiaojie, ZHAO Zihao, DU Yi. PiFlow: model driven big data pipeline framework [J]. Journal of Computer Applications, 2020, 40(6): 1638-1647.
[8]	Wenli WU, Guohua LIU, Junbao ZHANG. Complexity analysis of functional query answering on big data [J]. Journal of Computer Applications, 2020, 40(2): 416-419.
[9]	LI Ziying, SHI Zhenguo. Scheduling method for big data tasks [J]. Journal of Computer Applications, 2020, 40(10): 2923-2928.
[10]	ZHANG Yonglai, ZHOU Yaojian. Review of clustering algorithms [J]. Journal of Computer Applications, 2019, 39(7): 1869-1882.
[11]	MA Jiangang, MA Yinglong. Semantic-driven learning and classification method of judicial documents [J]. Journal of Computer Applications, 2019, 39(6): 1696-1700.
[12]	JI Lina, CHEN Kai, YU Yanwei, SONG Peng, WANG Shuying, WANG Chenrui. Vehicle type mining and application analysis based on urban traffic big data [J]. Journal of Computer Applications, 2019, 39(5): 1343-1350.
[13]	ZHANG Yitian, YU Jiong, LU Liang, LI Ziyang. Task scheduling strategy based on data stream classification in Heron [J]. Journal of Computer Applications, 2019, 39(4): 1106-1116.
[14]	WANG Xuefei, DING Weilong. Short-term traffic prediction method on big data in highway domain [J]. Journal of Computer Applications, 2019, 39(1): 87-92.
[15]	XU Yao, LI Zhuoran, MENG Jinlong, ZHAO Lipo, WEN Jianxin, WANG Guiling. Extraction method of marine lane boundary from exploiting trajectory big data [J]. Journal of Computer Applications, 2019, 39(1): 105-112.