计算机应用 ›› 2016, Vol. 36 ›› Issue (1): 27-32.DOI: 10.11772/j.issn.1001-9081.2016.01.0027

• 第32届中国数据库学术会议(NDBC 2015) • 上一篇    下一篇

基于划分的增量式字符串相似性连接方法

燕彩蓉, 朱斌, 王健, 黄永锋   

  1. 东华大学 计算机科学与技术学院, 上海 201620
  • 收稿日期:2015-07-12 修回日期:2015-08-08 出版日期:2016-01-10 发布日期:2016-01-09
  • 通讯作者: 燕彩蓉(1978-),女,湖北仙桃人,副教授,博士,主要研究方向:并行计算、分布式计算、大数据处理
  • 作者简介:朱斌(1990-),男,江西吉安人,硕士研究生,主要研究方向:并行计算、分布式计算、大数据处理;王健(1989-),男,河南信阳人,硕士研究生,主要研究方向:并行计算、分布式计算、大数据处理;黄永锋(1971-),男,山东泰安人,副教授,博士,主要研究方向:数据挖掘、机器学习、图像处理。
  • 基金资助:
    国家自然科学基金资助项目(61402100);中央高校基本科研业务费专项(2232013D3-15)。

Partition-based incremental processing method for string similarity join

YAN Cairong, ZHU Bin, WANG Jian, HUANG Yongfeng   

  1. School of Computer Science and Technology, Donghua University, Shanghai 201620, China
  • Received:2015-07-12 Revised:2015-08-08 Online:2016-01-10 Published:2016-01-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61402100), the Fundamental Research Funds for the Central Universities (2232013D3-15).

摘要: 字符串相似性连接是数据质量管理的基本操作,也是数据价值发现的关键步骤。针对目前已有的方法不能满足面向大数据的增量式处理需求的问题,提出一种面向流式数据的增量式字符串相似性连接方法——Inc-Join,并对方法的索引技术进行了优化。该方法以Pass-Join字符串连接算法为基础,首先,采用字符串划分技术将字符串划分成多个互不相交的子串;然后,建立字符串的反向索引列表并将其作为状态;最后,新增数据只需根据状态进行相似性计算,每次连接操作结束后都对状态进行更新。实验结果表明,Inc-Join方法在不影响连接准确率的同时,有效将长、 短字符串重复匹配次数减少为√n(n是批处理方式的匹配次数)。 实验对3种数据集进行处理,发现使用批处理方式进行相似性连接的响应时间是Inc-Join的1至4.7倍,并呈现急剧递增的趋势;而且优化后Inc-Join方法的响应时间最小只占优化前的3/4,并随处理数据的增多所占比例越来越小。同时优化后的Inc-Join不需要保存状态,再一次减小了算法执行的时间和空间开销。

关键词: 字符串相似性连接, 增量处理, 划分, 字符串匹配, 反向索引

Abstract: String similarity join is an essential operation of data quality management and a key step to find the value of data. Now in the era of big data, since the existing methods can not meet the demands of incremental processing, an incremental string similarity join method oriented streaming data, called Inc-Join, was proposed. And the string index technique was optimized. Firstly, based on the Pass-Join string join algorithm, strings were divided into some disjoint substrings by utilizing partition technique; secondly, the inverted index of strings was created and acted as a state; finally, the similarity calculation was done according to the state when new data came, and the state would be updated after each operation of string similarity join. The experimental results show that Inc-Join method can reduce the number of reduplicate matching between short or long strings to √n(n is the number of matching with batching processing model) without affecting the join accuracy. The elapsed time of string similarity join with batching processing model was 1 to 4.7 times the time Inc-Join needs when three different datasets were processed, and it tended to increase sharply. And the minimum elapsed time of optimized Inc-Join only accounted for 3/4 of original elapsed time of Inc-Join. With the increasing number of strings, the elapsed time of optimized Inc-Join would account for less and less of proportion in original elapsed time. The state need not to be saved, so the optimized Inc-Join further reduces time and space cost of Inc-Join.

Key words: string similarity join, incremental processing, partition, string matching, inverted index

中图分类号: