Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (4): 978-986.DOI: 10.11772/j.issn.1001-9081.2017092202
Previous Articles Next Articles
MA Youzhong1,2, ZHANG Zhihui3, LIN Chunjie1,2
Received:
2017-09-11
Revised:
2017-11-27
Online:
2018-04-10
Published:
2018-04-09
Supported by:
通讯作者:
马友忠
作者简介:
马友忠(1981-),男,河南项城人,副教授,博士,CCF会员,主要研究方向:大数据、Web数据管理;张智辉(1979-),男,河南洛阳人,讲师,硕士,主要研究方向:数据挖掘;林春杰(1981-),男(朝鲜族),吉林吉林人,讲师,硕士,主要研究方向:数据挖掘、粗糙集。
基金资助:
CLC Number:
MA Youzhong, ZHANG Zhihui, LIN Chunjie. Research progress in similarity join query of big data[J]. Journal of Computer Applications, 2018, 38(4): 978-986.
马友忠, 张智辉, 林春杰. 大数据相似性连接查询技术研究进展[J]. 计算机应用, 2018, 38(4): 978-986.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.joca.cn/EN/10.11772/j.issn.1001-9081.2017092202
[1] 庞俊,谷峪, 许嘉, 等. 相似性连接查询技术研究进展[J]. 计算机科学与探索, 2013, 7(1):1-13.(PANG J, GU Y, XU J, et al. Research advance on similarity join queries[J]. Journal of Frontiers of Computer Science & Technology, 2013, 7(1):1-13.) [2] 林学民, 王炜. 集合和字符串的相似度查询[J]. 计算机学报, 2011, 34(10):1853-1862.(LIN X M, WANG W. Set and string similarity queries:a survey[J]. Chinese Journal of Computers, 2011, 34(10):1853-1862.) [3] YU M H, LI G L, DENG D, et al. String similarity search and join:a survey[J]. Frontiers of Computer Science, 2016, 10(3):399-417. [4] 庞俊, 于戈, 许嘉, 等.基于MapReduce框架的海量数据相似性连接研究进展[J]. 计算机科学, 2015, 42(1):1-5.(PANG J, YU G, XU J, et al. Similarity joins on massive data based on MapReduce framework[J]. Computer Science, 2015, 42(1):1-5.) [5] SILVA Y, REED J, BROWN K, et al. An experimental survey of MapReduce-based similarity joins[C]//Proceedings of the 9th International Conference on Similarity Search and Applications. Berlin:Springer, 2016:181-195. [6] KIMMETT B, SRINIVASAN V, THOMO A. Fuzzy joins in MapReduce:an experimental study[J]. Proceedings of the VLDB Endowment, 2015, 8(12):1514-1517. [7] LIN J. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2009:155-162. [8] VERNICA R, CAREY M J, LI C. Efficient parallel set-similarity joins using MapReduce[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2010:495-506. [9] 李瑞, 王朝坤, 郑伟, 等.基于MapReduce框架的近似复制文本检测[J]. 计算机研究与发展, 2010, 47(增刊1):400-406.(LI R, WANG C K, ZHENG W, et al. Near duplicate text detection based on MapReduce[J]. Journal of Computer Research and Development, 2010, 47(S1):400-406.) [10] RONG C T, LU W, WANG X, et al. Efficient and scalable processing of string similarity join[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10):2217-2230. [11] ELSAYED T, LIN J, OARD D. Pairwise document similarity in large collections with MapReduce[C]//HLT-Short 2008:Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies. Stroudsburg, PA, USA:ACL, 2008:265-268. [12] METWALLY A, FALOUTSOS C. V-SMART-Join:a scalable MapReduce framework for all-pair similarity joins of multisets and vectors[J]. Proceedings of the VLDB Endowment, 2012, 5(8):704-715. [13] BARAGLIA R, MORALES G, LUCCHESE C. Document similarity self-join with MapReduce[C]//Proceedings of the 10th IEEE International Conference on Data Mining. Piscataway, NJ:IEEE, 2010:731-736. [14] RONG C T, LIN C B, SILVA Y, et al. Fast and scalable distributed set similarity joins for big data analytics[C]//Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. Piscataway, NJ:IEEE, 2017:1-12. [15] DENG D, LI G L, WEN H, et al. An efficient partition based method for exact set similarity joins[J]. Proceedings of the VLDB Endowment, 2015, 9(4):360-371. [16] WANG J J, LIN C. MapReduce based personalized locality sensitive hashing for similarity joins on large scale data[J]. Computational Intelligence and Neuroscience, 2015, 2015:Article No. 37. [17] LUO W, TAN H, MAO H, et al. Efficient similarity joins on massive high-dimensional datasets using MapReduce[C]//Proceedings of the 13th IEEE International Conference on Mobile Data Management. Piscataway, NJ:IEEE, 2012:1-10. [18] SEIDL T, FRIES S, BODEN B. MR-DSJ:distance-based self-join for large-scale vector data analysis with MapReduce[C]//Proceedings of the 15th BTW Conference on Database Systems for Business, Technology, and Web. Berlin:Springer, 2013:37-56. [19] FRIES S, BODEN B, STEPIEN G, et al. PHiDJ:parallel similarity self-join for high-dimensional vector data with MapReduce[C]//Proceedings of the 30th IEEE International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:796-807. [20] SILVA Y N, REED J M, TSOSIE L M. MapReduce-based similarity join for metric spaces[C]//Proceedings of the 1st International Workshop on Cloud Intelligence. New York:ACM, 2012:Article No. 3. [21] SILVA Y N, REED J M. Exploiting MapReduce-based similarity joins[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2012:693-696. [22] 徐媛媛, 陈华辉. 基于MapReduce增量式数据集的相似性连接[J]. 计算机应用研究, 2014, 31(11):3369-3384.(XU Y Y, CHEN H H. MapReduce-based similarity join for incremental data set[J]. Application Research of Computers, 2014, 31(11):3369-3384.) [23] YANG B, KIM H, SHIM J, et al. Fast and scalable vector similarity joins with MapReduce[J]. Journal of Intelligent Information Systems, 2016, 46(3):473-497. [24] MA Y Z, MENG X F, WANG S Y. Parallel similarity joins on massive high-dimensional data using MapReduce[J]. Concurrency and Computation:Practice and Experience, 2016, 28(1):166-183. [25] MA Y Z, JIA S J, ZHANG Y X. A novel approach for high-dimensional vector similarity join query[J]. Concurrency and Computation:Practice and Experience, 2017, 29(5):1-12. [26] JANG M Y, SONG Y, CHANG J. A density-aware similarity join query processing algorithm on MapReduce[M]//PARK J J, JIN H, KHAN M K, et al. Advanced Multimedia and Ubiquitous Engineering. Berlin:Springer, 2016:469-475. [27] LIU W, SHEN Y M, WANG P. An efficient MapReduce algorithm for similarity join in metric spaces[J]. The Journal of Supercomputing, 2016, 72(3):1179-1200. [28] ZHANG C, LI F, JESTES J. Efficient parallel kNN joins for large data in MapReduce[C]//Proceedings of the 15th International Conference on Extending Database Technology. New York:ACM, 2012:38-49. [29] LU W, SHEN Y, CHEN S, et al. Efficient processing of k nearest neighbor joins using MapReduce[J]. Proceedings of the VLDB Endowment, 2012, 5(10):1016-1027. [30] 戴健, 丁治明. 基于MapReduce快速kNN Join方法[J]. 计算机学报, 2015, 38(1):99-108.(DAI J, DING Z M. MapReduce based fast kNN join[J]. Chinese Journal of Computers, 2015, 38(1):99-108.) [31] KIM Y, SHIM K. Parallel Top-K similarity join algorithms using MapReduce[C]//Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. Washington, DC:IEEE Computer Society, 2012, 510-521. [32] 马友忠, 慈祥. 海量高维向量的并行Top-k连接查询[J]. 计算机学报, 2015, 38(1):86-98.(MA Y Z, CI X. Parallel Top-k join on massive high-dimensional vectors[J]. Chinese Journal of Computers, 2015, 38(1):86-98.) [33] CHEN D H, SHEN C G, FENG J Y, et al. An efficient parallel Top-k similarity join for massive multidimensional data using spark[J]. International Journal of Database Theory and Application, 2015, 8(3):57-68. [34] ZHANG S B, HAN J Z, LIU Z Y, et al. SJMR:parallelizing spatial join with MapReduce on clusters[C]//Proceedings of 2009 IEEE International Conference on Cluster Computing and Workshops. Piscataway, NJ:IEEE, 2009:1-8. [35] 刘义, 陈荦, 景宁, 等. 海量空间数据的并行Top-k连接查询[J]. 计算机研究与发展, 2011, 48(增刊3):163-172.(LIU Y, CHEN L, JING N, et al. Parallel Top-k spatial join query processing on massive spatial data[J]. Journal of Computer Research and Development, 2011, 48(S3):163-172.) [36] LIU Y, CHEN L, JING N, et al. MRFM:an efficient approach to spatial join aggregate[C]//Proceedings of the WAIM 2012 International Workshops:GDMM, IWSN, MDSP, USDM, and XMLDM. Berlin:Springer, 2012, 140-150. [37] 刘义, 景宁, 陈荦, 等. MapReduce框架下基于R-树的k-近邻连接算法[J]. 软件学报, 2013, 24(8):1836-1851.(LIU Y, JING N, CHEN L, et al. Algorithm for processing k-nearest join based on R-tree in MapReduce[J]. Journal of Software, 2013, 24(8):1836-1851.) [38] GUPTA H, CHAWDA B, NEGI S, et al. Processing multi-way spatial joins on Map-Reduce[C]//Proceedings of the 16th International Conference on Extending Database Technology. New York:ACM, 2013, 113-124. [39] ZHANG Y, MA Y, MENG X. Efficient spatio-textual similarity join using MapReduce[C]//Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies. Piscataway, NJ:IEEE, 2014:52-59. [40] 雷斌, 许嘉, 谷峪, 等. 概率数据上基于EMD距离的并行Top-k相似性连接算法[J]. 软件学报, 2013, 24(增刊2):188-199.(LEI B, XU J, GU Y, et al. Parallel Top-k similarity join algorithm on large probabilistic data based on earth mover's distance[J]. Journal of Software, 2013, 24(S2):188-199.) [41] HUANG J, ZHANG R, BUYYA R, et al. MELODY-JOIN:efficient earth mover's distance similarity joins using MapReduce[C]//Proceedings of the 30th IEEE International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:808-819. [42] HUANG J, ZHANG R, BUYYA R, et al. Heads-Join:efficient earth mover's distance similarity joins on Hadoop[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(6):1660-1673. [43] XU J, LEI B, GU Y, et al. Efficient similarity join based on earth mover's distance using MapReduce[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(8):2148-2162. [44] MA Y Z, MENG X F. Set similarity join on massive probabilistic data using MapReduce[J]. Distributed and Parallel Databases, 2014, 32(3):447-464. [45] WANG J N, LI G L, FENG J H. Extending string similarity join to tolerant fuzzy token matching[J]. ACM Transactions on Database Systems, 2014, 39(1):Article No. 7. [46] LI G L, DENG D, FENG J H. Pass-Join+:a partition-based method for string similarity joins with edit-distance constraints[J]. ACM Transactions on Database Systems, 2013, 38(2):Article No. 9. [47] JIANG Y, LI G, FENG J H, et al. String similarity joins:an experimental evaluation[J]. Proceedings of the VLDB Endowment, 2014, 7(8):625-636. [48] WANG W, QIN J B, XIAO C, et al. VChunkJoin:an efficient algorithm for edit similarity joins[J]. IEEE Transactions on Knowledge & Data Engineering, 2013, 25(8):1916-1929. [49] LU J H, LIN C B, WANG W, et al. String similarity measures and joins with synonyms[C]//SIGMOD 2013:Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2013:373-384. [50] XIAO C, WANG W, LIN X M, et al. Efficient similarity joins for near duplicate detection[J]. ACM Transaction of Database Systems, 2011, 36(3):Article No. 15. [51] RHEINLÄNDER A, LESER U. Scalable sequence similarity search and join in main memory on multi-cores[C]//Euro-Par 2011:Proceedings of the 2011 International Conference on Parallel Processing. Berlin:Springer, 2011, 2:13-22. [52] DENG D, LI G L, HAO S, et al. MassJoin:a MapReduce-based algorithm for string similarity joins[C]//Proceedings of IEEE 30th International Conference on Data Engineering. Piscataway, NJ:IEEE, 2014:340-351. [53] LIN C, YU H Y, WENG W, et al. Large scale similarity join with edit-distance constraints[C]//Proceedings of 19th International Conference on Database Systems for Advanced Applications. Berlin:Springer, 2014:328-342. [54] LI G L, DENG D, WANG J N, et al. Pass-join:a partition-based method for similarity joins[J]. Proceedings of the VLDB Endowment. Berlin:Springer, 2011, 5(3):253-264. [55] PANG J, GU Y, XU J, et al. Efficient graph similarity join with scalable prefix-filtering using MapReduce[C]//Proceedings of 15th International Conference on Web-Age Information Management. Berlin:Springer, 2014:415-418. [56] CHEN Y F, ZHAO X, GE B, et al. Practising scalable graph similarity joins in MapReduce[C]//Proceedings of the 2014 IEEE International Congress on Big Data. Washington, DC:IEEE Computer Society, 2014:112-119. [57] ZHANG X F, CHEN L, WANG M. Towards efficient join processing over large RDF graph using MapReduce[C]//Proceedings of the 24th International Conference on Scientific and Statistical Database Management. Berlin:Springer, 2012:250-259. |
[1] | ZHOU Xiang, ZHAI Junhai, HUANG Yajie, SHEN Ruicai, HOU Yingzhen. Instance selection algorithm for big data based on random forest and voting mechanism [J]. Journal of Computer Applications, 2021, 41(1): 74-80. |
[2] | CAO Cejun, LIU Ju. Overview of modeling method of emergency organization decision in disaster operations management [J]. Journal of Computer Applications, 2020, 40(7): 2142-2149. |
[3] | ZHU Xiaojie, ZHAO Zihao, DU Yi. PiFlow: model driven big data pipeline framework [J]. Journal of Computer Applications, 2020, 40(6): 1638-1647. |
[4] | WU Wenli, LIU Guohua, ZHANG Junbao. Complexity analysis of functional query answering on big data [J]. Journal of Computer Applications, 2020, 40(2): 416-419. |
[5] | LI Ziying, SHI Zhenguo. Scheduling method for big data tasks [J]. Journal of Computer Applications, 2020, 40(10): 2923-2928. |
[6] | ZHANG Yonglai, ZHOU Yaojian. Review of clustering algorithms [J]. Journal of Computer Applications, 2019, 39(7): 1869-1882. |
[7] | MA Jiangang, MA Yinglong. Semantic-driven learning and classification method of judicial documents [J]. Journal of Computer Applications, 2019, 39(6): 1696-1700. |
[8] | JI Lina, CHEN Kai, YU Yanwei, SONG Peng, WANG Shuying, WANG Chenrui. Vehicle type mining and application analysis based on urban traffic big data [J]. Journal of Computer Applications, 2019, 39(5): 1343-1350. |
[9] | ZHANG Yitian, YU Jiong, LU Liang, LI Ziyang. Task scheduling strategy based on data stream classification in Heron [J]. Journal of Computer Applications, 2019, 39(4): 1106-1116. |
[10] | WANG Xuefei, DING Weilong. Short-term traffic prediction method on big data in highway domain [J]. Journal of Computer Applications, 2019, 39(1): 87-92. |
[11] | XU Yao, LI Zhuoran, MENG Jinlong, ZHAO Lipo, WEN Jianxin, WANG Guiling. Extraction method of marine lane boundary from exploiting trajectory big data [J]. Journal of Computer Applications, 2019, 39(1): 105-112. |
[12] | TANG Xiaochuan, LUO Liang. Big data correlation mining algorithm based on factorial design [J]. Journal of Computer Applications, 2018, 38(9): 2507-2510. |
[13] | CHEN Jingren, WU Yefu, WU Bing. Driver behavior spectrum analysis method based on vehicle driving data [J]. Journal of Computer Applications, 2018, 38(7): 1916-1922. |
[14] | PAN Mingyu, ZHANG Lu, LONG Guobiao, LI Xianglong, MA Dongxue, XU Liang. Efficient block-based sampling algorithm for aggregation query processing on duplicate charged records [J]. Journal of Computer Applications, 2018, 38(6): 1596-1600. |
[15] | WU Renbiao, LIU Chao, QU Jingyi. Storage method for flight delay platform based on HBase and Hive [J]. Journal of Computer Applications, 2018, 38(5): 1339-1345. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||