计算机应用 ›› 2019, Vol. 39 ›› Issue (9): 2789-2794.DOI: 10.11772/j.issn.1001-9081.2019020792

• 应用前沿、交叉与综合 • 上一篇    

基于pHash分块局部探测的海量图像查重算法

唐林川, 邓思宇, 吴彦学, 温柳英   

  1. 西南石油大学 计算机科学学院, 成都 610500
  • 收稿日期:2019-03-22 修回日期:2019-05-07 发布日期:2019-06-03 出版日期:2019-09-10
  • 通讯作者: 温柳英
  • 作者简介:唐林川(1993-),男,四川成都人,硕士研究生,主要研究方向:主动学习、推荐系统;邓思宇(1993-),女,四川遂宁人,硕士研究生,主要研究方向:主动学习;吴彦学(1995-),男,四川巴中人,硕士研究生,主要研究方向:深度学习、特征学习;温柳英(1983-),女,广西柳州人,讲师,博士,CCF会员,主要研究方向:粗糙集、属性提取、粒计算。
  • 基金资助:

    浙江省海洋大数据挖掘与应用重点实验室开放课题项目(OBDMA201601)。

Duplicate detection algorithm for massive images based on pHash block detection

TANG Linchuan, DENG Siyu, WU Yanxue, WEN Liuying   

  1. School of Computer Science, Southwest Petroleum University, Chengdu 610500, China
  • Received:2019-03-22 Revised:2019-05-07 Online:2019-06-03 Published:2019-09-10
  • Supported by:

    This work is partially supported by the Open Project of Key Laboratory of Data Mining and Application of Zhejiang Ocean University (OBDMA201601).

摘要:

数据库中大量重复图片的存在不仅影响学习器性能,而且耗费大量存储空间。针对海量图片去重,提出一种基于pHash分块局部探测的海量图像查重算法。首先,生成所有图片的pHash值;其次,将pHash值划分成若干等长的部分,若两张图片的某一个pHash部分的值一致,则这两张图片可能是重复的;最后,探讨了图片重复的传递性问题,针对传递和非传递两种情况分别进行了算法实现。实验结果表明,所提算法在处理海量图片时具有非常高的效率,在设定相似度阈值为13的条件下,传递性算法对近30万张图片的查重仅需2 min,准确率达到了53%。

关键词: 重复图片检测, 海量数据, 感知Hash, 局部探测, 传递性

Abstract:

The large number of duplicate images in the database not only affects the performance of the learner, but also consumes a lot of storage space. For massive image deduplication, a duplicate detection algorithm for massive images was proposed based on pHash (perception Hashing). Firstly, the pHash values of all images were generated. Secondly, the pHash values were divided into several parts with the same length. If the values of one of the pHash parts of the two images were equal to each other, the two images might be duplicate. Finally, the transitivity of image duplicate was discussed, and corresponding algorithms for transitivity case and non-transitivity case were proposed. Experimental results show that the proposed algorithms are effective in processing massive images. When the similarity threshold is 13, detecting the duplicate of nearly 300000 images by the proposed transitive algorithm only takes about two minutes with the accuracy around 53%.

Key words: duplicate image detection, massive data, perception Hashing (pHash), block detection, transitivity

中图分类号: