Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (4): 978-986.DOI: 10.11772/j.issn.1001-9081.2017092202

Previous Articles     Next Articles

Research progress in similarity join query of big data

MA Youzhong1,2, ZHANG Zhihui3, LIN Chunjie1,2   

  1. 1. School of Information Technology, Luoyang Normal University, Luoyang Henan 471934, China;
    2. Henan Key Laboratory for Big Data Processing and Analytics of Electronic Commerce(Luoyang Normal University), Luoyang Henan 471934, China;
    3. Department of Computer, Luoyang Railway Information Engineering School, Luoyang Henan 471900, China
  • Received:2017-09-11 Revised:2017-11-27 Online:2018-04-10 Published:2018-04-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61602231), the National Key R&D Plan Project (2016YFE0104600), the Science and Technology Open Cooperation Project of Henan Province (172106000077, 152106000048), the Key Scientific Research Project of Higher Education of Henan Province (16A520022).

大数据相似性连接查询技术研究进展

马友忠1,2, 张智辉3, 林春杰1,2   

  1. 1. 洛阳师范学院 信息技术学院, 河南 洛阳 471934;
    2. 河南省电子商务大数据处理与分析重点实验室(洛阳师范学院), 河南 洛阳 471934;
    3. 洛阳铁路信息工程学校 计算机教研室, 河南 洛阳 471900
  • 通讯作者: 马友忠
  • 作者简介:马友忠(1981-),男,河南项城人,副教授,博士,CCF会员,主要研究方向:大数据、Web数据管理;张智辉(1979-),男,河南洛阳人,讲师,硕士,主要研究方向:数据挖掘;林春杰(1981-),男(朝鲜族),吉林吉林人,讲师,硕士,主要研究方向:数据挖掘、粗糙集。
  • 基金资助:
    国家自然科学基金资助项目(61602231);国家重点研发计划项目(2016YFE0104600);河南省科技开放合作项目(172106000077,152106000048);河南省高等学校重点科研项目(16A520022)。

Abstract: In order to deeply understand and fully grasp the research progress of similarity join query technology of big data and to promote its wide application in image clustering, entity resolution, similar document detection, similar trajectory retrieval, a comprehensive survey was conducted on similarity join query technology of big data. Firstly, the basic concepts of similarity join query were introduced; then intensive study on the big data similarity join research works for different data types, such as set, vector, spatial data, probabilistic data, string and graph was elaborated, their advantages and disadvantages were analyzed and summarized. Finally, some challenging research problems and future research priorities in big data similarity join query were pointed out.

Key words: big data, similarity join query, MapReduce framework, K-Nearest Neighbors (KNN)

摘要: 为了深入理解和全面把握大数据相似性连接查询技术的研究进展,更好地促进其在图片聚类、实体解析、相似文档检测、相似轨迹检索等领域的广泛应用,对大数据相似性连接查询技术相关研究工作进行了深入调研和分析。首先对相似性连接查询的基本概念进行了介绍,然后分别对集合、向量、空间数据、概率数据、字符串等不同类型大数据的相似性连接查询相关研究工作进行了深入研究,对其优缺点进行了分析和总结。最后,指出了大数据相似性连接查询面临的若干挑战性问题及未来的研究重点。

关键词: 大数据, 相似性连接查询, MapReduce框架, K最近邻

CLC Number: