基于pHash分块局部探测的海量图像查重算法

doi:10.11772/j.issn.1001-9081.2019020792

计算机应用 ›› 2019, Vol. 39 ›› Issue (9): 2789-2794.DOI: 10.11772/j.issn.1001-9081.2019020792

• 应用前沿、交叉与综合 • 上一篇

基于pHash分块局部探测的海量图像查重算法

唐林川, 邓思宇, 吴彦学, 温柳英

西南石油大学计算机科学学院, 成都 610500

收稿日期:2019-03-22 修回日期:2019-05-07 发布日期:2019-06-03 出版日期:2019-09-10
通讯作者: 温柳英
作者简介:唐林川(1993-),男,四川成都人,硕士研究生,主要研究方向:主动学习、推荐系统;邓思宇(1993-),女,四川遂宁人,硕士研究生,主要研究方向:主动学习;吴彦学(1995-),男,四川巴中人,硕士研究生,主要研究方向:深度学习、特征学习;温柳英(1983-),女,广西柳州人,讲师,博士,CCF会员,主要研究方向:粗糙集、属性提取、粒计算。
基金资助:
浙江省海洋大数据挖掘与应用重点实验室开放课题项目（OBDMA201601）。

Duplicate detection algorithm for massive images based on pHash block detection

TANG Linchuan, DENG Siyu, WU Yanxue, WEN Liuying

School of Computer Science, Southwest Petroleum University, Chengdu 610500, China

Received:2019-03-22 Revised:2019-05-07 Online:2019-06-03 Published:2019-09-10
Supported by:
This work is partially supported by the Open Project of Key Laboratory of Data Mining and Application of Zhejiang Ocean University (OBDMA201601).

摘要/Abstract

摘要：

数据库中大量重复图片的存在不仅影响学习器性能，而且耗费大量存储空间。针对海量图片去重，提出一种基于pHash分块局部探测的海量图像查重算法。首先，生成所有图片的pHash值；其次，将pHash值划分成若干等长的部分，若两张图片的某一个pHash部分的值一致，则这两张图片可能是重复的；最后，探讨了图片重复的传递性问题，针对传递和非传递两种情况分别进行了算法实现。实验结果表明，所提算法在处理海量图片时具有非常高的效率，在设定相似度阈值为13的条件下，传递性算法对近30万张图片的查重仅需2 min，准确率达到了53%。

关键词: 重复图片检测, 海量数据, 感知Hash, 局部探测, 传递性

Abstract:

The large number of duplicate images in the database not only affects the performance of the learner, but also consumes a lot of storage space. For massive image deduplication, a duplicate detection algorithm for massive images was proposed based on pHash (perception Hashing). Firstly, the pHash values of all images were generated. Secondly, the pHash values were divided into several parts with the same length. If the values of one of the pHash parts of the two images were equal to each other, the two images might be duplicate. Finally, the transitivity of image duplicate was discussed, and corresponding algorithms for transitivity case and non-transitivity case were proposed. Experimental results show that the proposed algorithms are effective in processing massive images. When the similarity threshold is 13, detecting the duplicate of nearly 300000 images by the proposed transitive algorithm only takes about two minutes with the accuracy around 53%.

Key words: duplicate image detection, massive data, perception Hashing (pHash), block detection, transitivity

中图分类号:

TP391

唐林川, 邓思宇, 吴彦学, 温柳英. 基于pHash分块局部探测的海量图像查重算法[J]. 计算机应用, 2019, 39(9): 2789-2794.

TANG Linchuan, DENG Siyu, WU Yanxue, WEN Liuying. Duplicate detection algorithm for massive images based on pHash block detection[J]. Journal of Computer Applications, 2019, 39(9): 2789-2794.

参考文献

[1] SMEULDERS A W M, WORRING M, SANTINI S, et al. Content-based image retrieval at the end of the early years[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(12):1349-1380.
[2] LIU Y, ZHANG D, LU G, et al. A survey of content-based image retrieval with high-level semantics[J]. Pattern Recognition, 2007, 40(1):262-282.
[3] KUO Y H, CHEN K T, CHIANG C H, et al. Query expansion for hash-based image object retrieval[C]//MM'09:Proceedings of the 17th ACM International Conference on Multimedia. New York:ACM, 2009:65-74.
[4] ZHEN Y, YEUNG D Y. Active hashing and its application to image and text retrieval[J]. Data Mining and Knowledge Discovery, 2013, 26(2):255-274.
[5] LI Q-N, LI Y-Q, JIANG S-J. Application of a new Hash function in e-commerce[C]//ICEE'12:Proceedings of the 2012 3rd International Conference on E-Business and E-Government. Washington, DC:IEEE Computer Society, 2012, 4:223-225.
[6] WRIGHT A. Controlling risks of e-commerce content[J]. Computers and Security, 2001, 20(2):147-154.
[7] 张文丽,钟晓燕,冯前进,等.基于Hash函数敏感性的医学图像精确认证[J].中国图象图形学报,2008,13(2):204-208.(ZHANG W L, ZHONG X Y, FENG Q J, et al. Hard authentication for medical image based on sensitivity of Hash function[J]. Journal of Image and Graphics, 2008, 13(2):204-208.)
[8] 赵峰.Hash签名在电子商务中的应用[J].计算机与数字工程,2014,42(3):531-534.(ZHAO F. Hash signature application in the electronic commerce[J]. Computer and Digital Engineering, 2014, 42(3):531-534.)
[9] ZHAN S, ZHAO J, TANG Y, et al. Face image retrieval:super-resolution based on sketch-photo transformation[J]. Soft Computing, 2018, 22(4):1351-1360.
[10] AL-MANSOORI S, KUNHU A. Hybrid DWT-DCT-Hash function based digital image watermarking for copyright protection and content authentication of DubaiSat-2 images[C]//Proceedings of the High-Performance Computing in Remote Sensing IV. Bellingham, WA:SPIE, 2014, 9247:924707.
[11] DEEPAKUMARA J, HEYS H M, VENKATESAN R. FPGA implementation of MD5 hash algorithm[C]//Proceedings of the 2001 Canadian Conference on Electrical and Computer Engineering. Piscataway, NJ:IEEE, 2001, 2:919-924.
[12] GREMBOWSKI T, LIEN R, GAJ K, et al. Comparative analysis of the hardware implementations of hash functions SHA-1 and SHA-512[C]//Proceedings of the 2002 International Conference on Information Security, LNCS 2433. Berlin:Springer, 2002:75-89.
[13] SHIM H. PHash:a memory-efficient, high-performance key-value store for large-scale data-intensive applications[J]. Journal of Systems and Software, 2017, 123:33-44.
[14] LIN K, YANG H-F, HSIAO J-H, et al. Deep learning of binary hash codes for fast image retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ:IEEE, 2015:27-35.
[15] 刘明生,王艳,赵新生.基于Hash函数的RFID安全认证协议的研究[J].传感技术学报,2011,24(9):1317-1321.(LIU M S, WANG Y, ZHAO X S. Research on RFID security authentication protocol based on Hash function[J]. Chinese Journal of Sensors and Actuators, 2011, 24(9):1317-1321.)
[16] 周国强,田先桃,张卫丰,等.基于图像感知哈希技术的钓鱼网页检测[J].南京邮电大学学报(自然科学版),2012, 32(4):59-63.(ZHOU G Q, TIAN X T, ZHANG W F, et al. Detecting phishing Web pages based on image perceptual hashing technology[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science), 2012, 32(4):59-63.)
[17] KURSA M B, JANKOWSKI A, RUDNICKI W R. Boruta-a system for feature selection[J]. Fundamenta Informaticae, 2010, 101(4):271-285.
[18] 宁星,蒋年德.基于LBP人脸识别算法的预处理研究[J].电子质量,2012(4):76-77.(NING X, JIANG N D. Pretreatment research for face recognition based on LBP[J]. Electronic Quality, 2012(4):76-77.)
[19] DATAR M, IMMORLICAL N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions[C]//Proceedings of the 20th Annual Symposium on Computational Geometry. New York:ACM, 2004:253-262.

基于pHash分块局部探测的海量图像查重算法

Duplicate detection algorithm for massive images based on pHash block detection

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	勾承甫, 陈斌, 赵雪专, 陈刚. 基于随机一致性采样估计的目标跟踪算法[J]. 计算机应用, 2016, 36(9): 2566-2569.
[2]	陈羽中, 郭松荣, 陈宏, 李婉华, 郭昆, 黄启成. 基于并行分类算法的电力客户欠费预警[J]. 计算机应用, 2016, 36(6): 1757-1761.
[3]	赵霞, 林天华, 马素霞, 齐林海. 基于选择性加载策略的电能质量数据处理[J]. 计算机应用, 2016, 36(5): 1434-1438.
[4]	刘超, 胡成玉, 姚宏, 梁庆中, 颜雪松. 面向海量非结构化数据的非关系型存储管理机制[J]. 计算机应用, 2016, 36(3): 670-674.
[5]	梁俊杰, 甘文婷, 余敦辉. 基于位置编码索引树的个性化推荐算法[J]. 计算机应用, 2016, 36(2): 419-423.
[6]	白鹤翔, 王健, 李德玉, 陈千. 基于粗糙集的非监督快速属性选择算法[J]. 计算机应用, 2015, 35(8): 2355-2359.
[7]	孙远帅陈垚官新均林琛. 基于Hadoop的大矩阵乘法处理方法[J]. 计算机应用, 2013, 33(12): 3339-3344.
[8]	张永浮盼盼张玉婷. 基于分层聚类及重采样的大规模数据分类[J]. 计算机应用, 2013, 33(10): 2801-2803.
[9]	周典瑞周莲英. 海量数据的相似重复记录检测算法[J]. 计算机应用, 2013, 33(08): 2208-2211.
[10]	蒋新华廖律超邹复民. 基于浮动车移动轨迹的新增道路自动发现算法[J]. 计算机应用, 2013, 33(02): 579-582.
[11]	徐翔邹复民廖律超朱铨. 基于GemFire的海量数据计算性能实验分析[J]. 计算机应用, 2013, 33(01): 226-229.
[12]	高训兵马春光赵平肖亮. 社交网络中具有可传递性的细粒度访问控制方案[J]. 计算机应用, 2013, 33(01): 8-11.
[13]	夏秀峰赵龙. 基于三层存储模型的RFID数据压缩存储方法[J]. 计算机应用, 2012, 32(03): 625-628.
[14]	黄海峰张珂珩张鸿季学纯陈鹏. 电力系统动态信息数据库关键技术[J]. 计算机应用, 2011, 31(06): 1681-1684.
[15]	王灿秦志光冯朝胜彭静. 面向重复数据消除的备份数据加密方法[J]. 计算机应用, 2010, 30(07): 1763-1766.