基于划分的增量式字符串相似性连接方法

doi:10.11772/j.issn.1001-9081.2016.01.0027

计算机应用 ›› 2016, Vol. 36 ›› Issue (1): 27-32.DOI: 10.11772/j.issn.1001-9081.2016.01.0027

• 第32届中国数据库学术会议(NDBC 2015) • 上一篇下一篇

基于划分的增量式字符串相似性连接方法

燕彩蓉, 朱斌, 王健, 黄永锋

东华大学计算机科学与技术学院, 上海 201620

收稿日期:2015-07-12 修回日期:2015-08-08 出版日期:2016-01-10 发布日期:2016-01-09
通讯作者: 燕彩蓉(1978-),女,湖北仙桃人,副教授,博士,主要研究方向:并行计算、分布式计算、大数据处理
作者简介:朱斌(1990-),男,江西吉安人,硕士研究生,主要研究方向:并行计算、分布式计算、大数据处理;王健(1989-),男,河南信阳人,硕士研究生,主要研究方向:并行计算、分布式计算、大数据处理;黄永锋(1971-),男,山东泰安人,副教授,博士,主要研究方向:数据挖掘、机器学习、图像处理。
基金资助:
国家自然科学基金资助项目(61402100);中央高校基本科研业务费专项(2232013D3-15)。

Partition-based incremental processing method for string similarity join

YAN Cairong, ZHU Bin, WANG Jian, HUANG Yongfeng

School of Computer Science and Technology, Donghua University, Shanghai 201620, China

Received:2015-07-12 Revised:2015-08-08 Online:2016-01-10 Published:2016-01-09
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61402100), the Fundamental Research Funds for the Central Universities (2232013D3-15).

摘要/Abstract

摘要： 字符串相似性连接是数据质量管理的基本操作,也是数据价值发现的关键步骤。针对目前已有的方法不能满足面向大数据的增量式处理需求的问题,提出一种面向流式数据的增量式字符串相似性连接方法——Inc-Join,并对方法的索引技术进行了优化。该方法以Pass-Join字符串连接算法为基础,首先,采用字符串划分技术将字符串划分成多个互不相交的子串;然后,建立字符串的反向索引列表并将其作为状态;最后,新增数据只需根据状态进行相似性计算,每次连接操作结束后都对状态进行更新。实验结果表明,Inc-Join方法在不影响连接准确率的同时,有效将长、短字符串重复匹配次数减少为√n(n是批处理方式的匹配次数)。实验对3种数据集进行处理,发现使用批处理方式进行相似性连接的响应时间是Inc-Join的1至4.7倍,并呈现急剧递增的趋势;而且优化后Inc-Join方法的响应时间最小只占优化前的3/4,并随处理数据的增多所占比例越来越小。同时优化后的Inc-Join不需要保存状态,再一次减小了算法执行的时间和空间开销。

关键词: 字符串相似性连接, 增量处理, 划分, 字符串匹配, 反向索引

Abstract: String similarity join is an essential operation of data quality management and a key step to find the value of data. Now in the era of big data, since the existing methods can not meet the demands of incremental processing, an incremental string similarity join method oriented streaming data, called Inc-Join, was proposed. And the string index technique was optimized. Firstly, based on the Pass-Join string join algorithm, strings were divided into some disjoint substrings by utilizing partition technique; secondly, the inverted index of strings was created and acted as a state; finally, the similarity calculation was done according to the state when new data came, and the state would be updated after each operation of string similarity join. The experimental results show that Inc-Join method can reduce the number of reduplicate matching between short or long strings to √n(n is the number of matching with batching processing model) without affecting the join accuracy. The elapsed time of string similarity join with batching processing model was 1 to 4.7 times the time Inc-Join needs when three different datasets were processed, and it tended to increase sharply. And the minimum elapsed time of optimized Inc-Join only accounted for 3/4 of original elapsed time of Inc-Join. With the increasing number of strings, the elapsed time of optimized Inc-Join would account for less and less of proportion in original elapsed time. The state need not to be saved, so the optimized Inc-Join further reduces time and space cost of Inc-Join.

Key words: string similarity join, incremental processing, partition, string matching, inverted index

中图分类号:

TP311

燕彩蓉, 朱斌, 王健, 黄永锋. 基于划分的增量式字符串相似性连接方法[J]. 计算机应用, 2016, 36(1): 27-32.

YAN Cairong, ZHU Bin, WANG Jian, HUANG Yongfeng. Partition-based incremental processing method for string similarity join[J]. Journal of Computer Applications, 2016, 36(1): 27-32.

参考文献

[1] LI G, DENG D, WANG J, et al. Pass-Join: a partition-based method for similarity joins [J]. Proceedings of the VLDB endowment, 2011, 5(3): 253-264.
[2] JIANG Y, DEND D, WANG J, et al. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints [C]// Proceedings of the Joint EDBT/ICDT 2013 Workshops. New York: ACM, 2013: 341-348.
[3] 荣垂田,徐天任,杜小勇.基于划分的集合相似连接[J].计算机研究与发展,2012,49(10):2066-2076.(RONG C T, XU T R, DU X Y. Partition-based set similarity join [J]. Journal of computer research and development, 2012, 49(10): 2066-2076.)
[4] 曹海,骆吉洲,陈懿诚.一种基于数据划分的字符串相似性连接外存算法[J].智能计算机与应用,2012,2(5):31-34.(CAO H, LUO J Z, CHEN Y C. A data-partition based disk algorithm for string join [J]. Intelligent computer and applications, 2012, 2(5): 31-34.)
[5] LU J, LIN C, WANG W, et al. String similarity measures and joins with synonyms [C]// Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2013: 373-384.
[6] ARASU A, GANTI V, KAUSHIK R. Efficient exact set-similarity joins [C]// Proceedings of the 32nd International Conference on Very Large Data Bases. [S.l.]: VLDB Endowment, 2006: 918-929.
[7] XIAO C, WANG W, LIN X. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints [J]. Proceedings of the VLDB endowment, 2008, 1(1): 933-944.
[8] WANG J, FENG J, LI G. Trie-Join: efficient trie-based string similarity joins with edit-distance constraints [J]. Proceedings of the VLDB endowment, 2010, 3(1/2): 1219-1230.
[9] METWALLY A, FALOUTSOS C. V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors [J]. Proceedings of the VLDB endowment, 2012, 5(8): 704-715.
[10] DONG X, SRIVASTAVA D. Big data integration [C]// ICDE 2013: Proceedings of the 2013 IEEE 29th International Conference on Data Engineering. Piscataway, NJ: IEEE, 2013: 1245-1248.
[11] CHRISTEN P. A survey of indexing techniques for scalable record linkage and deduplication [J]. IEEE transactions on knowledge and data engineering, 2012, 24(9): 1537-1555.
[12] CHEN Q, HSU M. Continuous MapReduce for In-DB stream analytics [C]// OTM 2010: Proceedings of the 2010 International Conference on the Move to Meaningful Internet Systems. Berlin: Springer, 2010: 16-34.
[13] YAN C, YANG X, YU Z, et al. IncMR: incremental data processing based on MapReduce [C]// CLOUD 2012: Proceedings of the 2012 IEEE 5th International Conference on Cloud Computing. Piscataway, NJ: IEEE, 2012: 534-541.
[14] LOGOTHETIS D, YOCUM K. Ad-Hoc data processing in the cloud [J]. Proceedings of the VLDB endowment, 2008, 1(2): 1472-1475.
[15] DE FRANCISCI MORALES G, GIONIS A, SOZIO M. Social content matching in MapReduce [J]. Proceedings of the VLDB endowment, 2011, 4(7): 460-469.
[16] THUSOO A, SARMA J S, JAIN N, et al. Hive—a petabyte scale data warehouse using Hadoop [C]// ICDE 2010: Proceedings of the 2010 IEEE 26th International Conference on Data Engineering. Piscataway, NJ: IEEE, 2010: 996-1005.
[17] PEND D, DABEK F. Large-scale incremental processing using distributed transactions and notifications [C]// OSDI'10: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2010: 1-15.
[18] XIAO C, WANG W, LIN X, et al. Efficient similarity joins for near-duplicate detection [J]. ACM transactions on database systems, 2011, 36(3): 15.
[19] HE Q, DU C, WANG Q, et al. A parallel incremental extreme SVM classifier [J]. Neurocomputing, 2011, 74(16): 2532-2540.
[20] 李璐,王宏志,李建中,等.Ed-Sjoin:一种优化的字符串相似性连接算法[J].计算机研究与发展,2009,46(z2):319-325.(LI L, WANG H Z, LI J Z, et al. Ed-Sjoin: an optimal algorithm for similarity joins with edit distance constraints [J]. Journal of computer research and development, 2009, 46(z2): 319-325.)

基于划分的增量式字符串相似性连接方法

Partition-based incremental processing method for string similarity join

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	姜琨, 刘征, 朱磊, 李晓星. 基于有向无环图的倒排链等字长划分压缩算法[J]. 计算机应用, 2021, 41(3): 727-732.
[2]	王家亮, 李树华, 张海涛. 基于贝叶斯估计与区域划分遍历的四轴飞行器避障路径规划算法[J]. 计算机应用, 2021, 41(2): 384-389.
[3]	李文霞, 刘林忠, 代存杰, 李玉. 基于多种群组合策略的人工蜂群算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3113-3119.
[4]	张驰, 李铸洪, 刘舟, 沈未名. 基于场景图划分的无人机影像定位算法[J]. 计算机应用, 2021, 41(10): 3004-3009.
[5]	王松伟, 赵秋阳, 王宇航, 饶小平. 基于深度学习的脑片图像区域划分方法[J]. 计算机应用, 2020, 40(4): 1202-1208.
[6]	刘惠剑, 刘峻松, 王佳伟, 薛岗. 基于过程划分技术的服务组合拆分方法[J]. 计算机应用, 2020, 40(3): 799-805.
[7]	吕一可, 徐凯, 黄振强. 基于面积划分的轨迹相似性度量方法[J]. 计算机应用, 2020, 40(2): 578-583.
[8]	杨程, 陆佳民, 冯钧. 分布式环境下大规模资源描述框架数据划分方法综述[J]. 计算机应用, 2020, 40(11): 3184-3191.
[9]	余林芳, 邓伏虎, 秦少威, 秦志光. 基于眼底图像层次特征的分类方法[J]. 计算机应用, 2019, 39(9): 2575-2579.
[10]	高建, 毛莺池, 李志涛. 基于高斯混合时间序列模型的轨迹预测[J]. 计算机应用, 2019, 39(8): 2261-2270.
[11]	付立东, 郝伟, 李丹, 李凡. 基于共邻节点相似度的社区划分算法[J]. 计算机应用, 2019, 39(7): 2024-2029.
[12]	孙子力, 彭舰, 仝博. 社会网络中基于社群衰减的影响力最大化算法[J]. 计算机应用, 2019, 39(3): 834-838.
[13]	罗靖宇, 唐宁九. 基于边划分理论的谣言传播模型[J]. 计算机应用, 2019, 39(11): 3409-3414.
[14]	郭华平, 周俊, 邬长安, 范明. 面向非平衡类问题的k近邻分类算法[J]. 计算机应用, 2018, 38(4): 955-959.
[15]	张全贵, 蔡丰, 李志强. 基于耦合多隐马尔可夫模型和深度图像数据的人体动作识别[J]. 计算机应用, 2018, 38(2): 454-457.