Partition-based incremental processing method for string similarity join

doi:10.11772/j.issn.1001-9081.2016.01.0027

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (1): 27-32.DOI: 10.11772/j.issn.1001-9081.2016.01.0027

Previous Articles Next Articles

Partition-based incremental processing method for string similarity join

YAN Cairong, ZHU Bin, WANG Jian, HUANG Yongfeng

School of Computer Science and Technology, Donghua University, Shanghai 201620, China

Received:2015-07-12 Revised:2015-08-08 Online:2016-01-09 Published:2016-01-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61402100), the Fundamental Research Funds for the Central Universities (2232013D3-15).

基于划分的增量式字符串相似性连接方法

燕彩蓉, 朱斌, 王健, 黄永锋

东华大学计算机科学与技术学院, 上海 201620

通讯作者: 燕彩蓉(1978-),女,湖北仙桃人,副教授,博士,主要研究方向:并行计算、分布式计算、大数据处理
作者简介:朱斌(1990-),男,江西吉安人,硕士研究生,主要研究方向:并行计算、分布式计算、大数据处理;王健(1989-),男,河南信阳人,硕士研究生,主要研究方向:并行计算、分布式计算、大数据处理;黄永锋(1971-),男,山东泰安人,副教授,博士,主要研究方向:数据挖掘、机器学习、图像处理。
基金资助:
国家自然科学基金资助项目(61402100);中央高校基本科研业务费专项(2232013D3-15)。

Abstract

Abstract: String similarity join is an essential operation of data quality management and a key step to find the value of data. Now in the era of big data, since the existing methods can not meet the demands of incremental processing, an incremental string similarity join method oriented streaming data, called Inc-Join, was proposed. And the string index technique was optimized. Firstly, based on the Pass-Join string join algorithm, strings were divided into some disjoint substrings by utilizing partition technique; secondly, the inverted index of strings was created and acted as a state; finally, the similarity calculation was done according to the state when new data came, and the state would be updated after each operation of string similarity join. The experimental results show that Inc-Join method can reduce the number of reduplicate matching between short or long strings to √n(n is the number of matching with batching processing model) without affecting the join accuracy. The elapsed time of string similarity join with batching processing model was 1 to 4.7 times the time Inc-Join needs when three different datasets were processed, and it tended to increase sharply. And the minimum elapsed time of optimized Inc-Join only accounted for 3/4 of original elapsed time of Inc-Join. With the increasing number of strings, the elapsed time of optimized Inc-Join would account for less and less of proportion in original elapsed time. The state need not to be saved, so the optimized Inc-Join further reduces time and space cost of Inc-Join.

Key words: string similarity join, incremental processing, partition, string matching, inverted index

摘要： 字符串相似性连接是数据质量管理的基本操作,也是数据价值发现的关键步骤。针对目前已有的方法不能满足面向大数据的增量式处理需求的问题,提出一种面向流式数据的增量式字符串相似性连接方法——Inc-Join,并对方法的索引技术进行了优化。该方法以Pass-Join字符串连接算法为基础,首先,采用字符串划分技术将字符串划分成多个互不相交的子串;然后,建立字符串的反向索引列表并将其作为状态;最后,新增数据只需根据状态进行相似性计算,每次连接操作结束后都对状态进行更新。实验结果表明,Inc-Join方法在不影响连接准确率的同时,有效将长、短字符串重复匹配次数减少为√n(n是批处理方式的匹配次数)。实验对3种数据集进行处理,发现使用批处理方式进行相似性连接的响应时间是Inc-Join的1至4.7倍,并呈现急剧递增的趋势;而且优化后Inc-Join方法的响应时间最小只占优化前的3/4,并随处理数据的增多所占比例越来越小。同时优化后的Inc-Join不需要保存状态,再一次减小了算法执行的时间和空间开销。

关键词: 字符串相似性连接, 增量处理, 划分, 字符串匹配, 反向索引

CLC Number:

TP311

YAN Cairong, ZHU Bin, WANG Jian, HUANG Yongfeng. Partition-based incremental processing method for string similarity join[J]. Journal of Computer Applications, 2016, 36(1): 27-32.

燕彩蓉, 朱斌, 王健, 黄永锋. 基于划分的增量式字符串相似性连接方法[J]. 计算机应用, 2016, 36(1): 27-32.

References

[1] LI G, DENG D, WANG J, et al. Pass-Join: a partition-based method for similarity joins [J]. Proceedings of the VLDB endowment, 2011, 5(3): 253-264.
[2] JIANG Y, DEND D, WANG J, et al. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints [C]// Proceedings of the Joint EDBT/ICDT 2013 Workshops. New York: ACM, 2013: 341-348.
[3] 荣垂田,徐天任,杜小勇.基于划分的集合相似连接[J].计算机研究与发展,2012,49(10):2066-2076.(RONG C T, XU T R, DU X Y. Partition-based set similarity join [J]. Journal of computer research and development, 2012, 49(10): 2066-2076.)
[4] 曹海,骆吉洲,陈懿诚.一种基于数据划分的字符串相似性连接外存算法[J].智能计算机与应用,2012,2(5):31-34.(CAO H, LUO J Z, CHEN Y C. A data-partition based disk algorithm for string join [J]. Intelligent computer and applications, 2012, 2(5): 31-34.)
[5] LU J, LIN C, WANG W, et al. String similarity measures and joins with synonyms [C]// Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2013: 373-384.
[6] ARASU A, GANTI V, KAUSHIK R. Efficient exact set-similarity joins [C]// Proceedings of the 32nd International Conference on Very Large Data Bases. [S.l.]: VLDB Endowment, 2006: 918-929.
[7] XIAO C, WANG W, LIN X. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints [J]. Proceedings of the VLDB endowment, 2008, 1(1): 933-944.
[8] WANG J, FENG J, LI G. Trie-Join: efficient trie-based string similarity joins with edit-distance constraints [J]. Proceedings of the VLDB endowment, 2010, 3(1/2): 1219-1230.
[9] METWALLY A, FALOUTSOS C. V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors [J]. Proceedings of the VLDB endowment, 2012, 5(8): 704-715.
[10] DONG X, SRIVASTAVA D. Big data integration [C]// ICDE 2013: Proceedings of the 2013 IEEE 29th International Conference on Data Engineering. Piscataway, NJ: IEEE, 2013: 1245-1248.
[11] CHRISTEN P. A survey of indexing techniques for scalable record linkage and deduplication [J]. IEEE transactions on knowledge and data engineering, 2012, 24(9): 1537-1555.
[12] CHEN Q, HSU M. Continuous MapReduce for In-DB stream analytics [C]// OTM 2010: Proceedings of the 2010 International Conference on the Move to Meaningful Internet Systems. Berlin: Springer, 2010: 16-34.
[13] YAN C, YANG X, YU Z, et al. IncMR: incremental data processing based on MapReduce [C]// CLOUD 2012: Proceedings of the 2012 IEEE 5th International Conference on Cloud Computing. Piscataway, NJ: IEEE, 2012: 534-541.
[14] LOGOTHETIS D, YOCUM K. Ad-Hoc data processing in the cloud [J]. Proceedings of the VLDB endowment, 2008, 1(2): 1472-1475.
[15] DE FRANCISCI MORALES G, GIONIS A, SOZIO M. Social content matching in MapReduce [J]. Proceedings of the VLDB endowment, 2011, 4(7): 460-469.
[16] THUSOO A, SARMA J S, JAIN N, et al. Hive—a petabyte scale data warehouse using Hadoop [C]// ICDE 2010: Proceedings of the 2010 IEEE 26th International Conference on Data Engineering. Piscataway, NJ: IEEE, 2010: 996-1005.
[17] PEND D, DABEK F. Large-scale incremental processing using distributed transactions and notifications [C]// OSDI'10: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2010: 1-15.
[18] XIAO C, WANG W, LIN X, et al. Efficient similarity joins for near-duplicate detection [J]. ACM transactions on database systems, 2011, 36(3): 15.
[19] HE Q, DU C, WANG Q, et al. A parallel incremental extreme SVM classifier [J]. Neurocomputing, 2011, 74(16): 2532-2540.
[20] 李璐,王宏志,李建中,等.Ed-Sjoin:一种优化的字符串相似性连接算法[J].计算机研究与发展,2009,46(z2):319-325.(LI L, WANG H Z, LI J Z, et al. Ed-Sjoin: an optimal algorithm for similarity joins with edit distance constraints [J]. Journal of computer research and development, 2009, 46(z2): 319-325.)

Partition-based incremental processing method for string similarity join

基于划分的增量式字符串相似性连接方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Shiyang LI, Shaojie NI, Ding DENG, Lei CHEN, Honglei LIN. Reliability enhancement algorithm for physical unclonable function based on non-orthogonal discrete transform [J]. Journal of Computer Applications, 2024, 44(7): 2116-2122.
[2]	Xu LI, Yulin HE, Laizhong CUI, Zhexue HUANG, Fournier‑Viger PHILIPPE. Distributed observation point classifier for big data with random sample partition [J]. Journal of Computer Applications, 2024, 44(6): 1727-1733.
[3]	Tianyu HUANG, Yuanxing LI, Hao CHEN, Zijia GUO, Mingjun WEI. User cluster partitioning method based on weighted fuzzy clustering in ground-air collaboration scenarios [J]. Journal of Computer Applications, 2024, 44(5): 1555-1561.
[4]	Jinxing TU, Zhixiong LI, Jianqiang HUANG. Dynamic partition algorithm for diagonal sparse matrix vector multiplication based on GPU [J]. Journal of Computer Applications, 2024, 44(11): 3521-3529.
[5]	Shaofa SHANG, Lin JIANG, Yuancheng LI, Yun ZHU. Adaptive partitioning and scheduling method of convolutional neural network inference model on heterogeneous platforms [J]. Journal of Computer Applications, 2023, 43(9): 2828-2835.
[6]	Ting YANG, Ruoyu MO, Xiujuan ZHANG, Zhousen ZHU. Enhancement and expansion of full-text search in relational databases based on lightweight caching strategy [J]. Journal of Computer Applications, 2023, 43(8): 2431-2438.
[7]	Jiaxing LU, Hua DAI, Yuanlong LIU, Qian ZHOU, Geng YANG. Dictionary partition vector space model for ciphertext ranked search in cloud environment [J]. Journal of Computer Applications, 2023, 43(7): 1994-2000.
[8]	SUN Yuan, SHEN Wenjian, NI Pengbo, MAO Min, XIE Yaqi, XU Chaonong. Sink location algorithm of power domain nonorthogonal multiple access for real-time industrial internet of things [J]. Journal of Computer Applications, 2023, 43(1): 209-214.
[9]	WU Yue, LUO Jiangtao, LIU Rui, HU Zhongyin. Video similarity detection method based on perceptual hashing and dicing [J]. Journal of Computer Applications, 2021, 41(7): 2070-2075.
[10]	JIANG Kun, LIU Zheng, ZHU Lei, LI Xiaoxing. Fixed word-aligned partition compression algorithm of inverted list based on directed acyclic graph [J]. Journal of Computer Applications, 2021, 41(3): 727-732.
[11]	Xiaoling SUN, Guang YANG, Yanping SHEN, Qiuge YANG, Tao CHEN. Searchable encryption scheme based on splittable inverted index [J]. Journal of Computer Applications, 2021, 41(11): 3288-3294.
[12]	LIU Huijian, LIU Junsong, WANG Jiawei, XUE Gang. Service composition partitioning method based on process partitioning technology [J]. Journal of Computer Applications, 2020, 40(3): 799-805.
[13]	YANG Cheng, LU Jiamin, FENG Jun. Survey of large-scale resource description framework data partitioning methods in distributed environment [J]. Journal of Computer Applications, 2020, 40(11): 3184-3191.
[14]	ZHAO Ji, CHENG Cheng. Dynamic cooperative random drift particle swarm optimization algorithm assisted by evolution information [J]. Journal of Computer Applications, 2020, 40(11): 3119-3126.
[15]	LIU Ying, WANG Fengwei, LIU Weihua, AI Da, LI Yun, YANG Fanchao. High dynamic range imaging algorithm based on luminance partition fuzzy fusion [J]. Journal of Computer Applications, 2020, 40(1): 233-238.