Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (7): 1797-1800.DOI: 10.11772/j.issn.1001-9081.2016.07.1797

Previous Articles     Next Articles

Fast deduplication for massive images

HAN Fengqing1, SONG Zhijian2, YU Rui3   

  1. 1. College of Mathematics and Statistics, Chongqing Jiaotong University, Chongqing 400000, China;
    2. College of Information Science and Engineering, Chongqing Jiaotong University, Chongqing 400000, China;
    3. College of Traffic and Transportation, Chongqing Jiaotong University, Chongqing 400000, China
  • Received:2015-12-28 Revised:2016-03-18 Online:2016-07-14 Published:2016-07-10


韩逢庆1, 宋志坚2, 余锐3   

  1. 1. 重庆交通大学 数学与统计学院, 重庆 400000;
    2. 重庆交通大学 信息科学与工程学院, 重庆 400000;
    3. 重庆交通大学 交通运输学院, 重庆 400000
  • 通讯作者: 宋志坚
  • 作者简介:韩逢庆(1968-),男,重庆人,教授,博士,主要研究方向:人工智能、小波理论及应用、机器学习、数据挖掘;宋志坚(1990-),男,内蒙古包头人,硕士研究生,主要研究方向:人工智能、机器学习、数据挖掘;余锐(1991-),女,四川眉山人,硕士研究生,主要研究方向:智能控制、车辆控制。

Abstract: To solve the problem of low efficiency for fast deduplicating the same image in massive images, a parallel deduplication technology for massive images based on image features was proposed. Firstly, the image color, texture, shape and other features were extracted to fully represent images. Secondly, the metric was used to calculate the distance between images. Finally, according to these distances, the same image could be fast located and deduplicated by the thought that the two points might be the same point if they had same distance to any other point. It has been analyzed and verified in combination with the experimental data that this technology is accurate in deduplicating images, besides, it just needs 10 minutes to deal with 5 million images by one computer with i5 processor. Compared with one by one calculation, the technology improves the efficiency of massive image deduplication and can shorten the calculation time.

Key words: massive image, fast deduplication, parallelization, monolithic computing, image feature

摘要: 针对海量图片中的去除重复图片效率低的问题,提出一种基于图片特征的并行化海量图片快速去重技术。首先,对图片提取图片颜色、纹理、形状等特征,用来全面描述图片;其次,使用度量标准对图片之间的特征距离进行度量计算;最后,利用如果两个点到任意一点距离相等则这两点有可能是同一个点的思想实现根据特征距离对重复图片的快速定位,达到重复图片检测与去重的目的。结合实验数据分析验证该技术不仅能够准确地去重图片,且采用i5四核处理器的单机计算方式仅10 min左右即可处理500万级图片量,与一般的两两计算相比,提高了海量图片去重的时效性,使得计算时间大幅度缩短。

关键词: 海量图片, 快速去重, 并行化, 单机计算, 图片特征

CLC Number: