计算机应用 ›› 2013, Vol. 33 ›› Issue (03): 667-669.DOI: 10.3724/SP.J.1087.2013.00667

• 多媒体处理技术 • 上一篇    下一篇

基于Bag-of-words和Hash编码的近似重复图像检测算法

王誉天1*,袁江涛2,秦海权1,刘鑫1   

  1. 1.公安部 第一研究所, 北京 100048;
    2.天津市公安局 北辰分局,天津 300400
  • 收稿日期:2012-09-10 修回日期:2012-11-02 出版日期:2013-03-01 发布日期:2013-03-01
  • 通讯作者: 王誉天
  • 作者简介:王誉天(1975-),男,陕西咸阳人,工程师,硕士,主要研究方向:电子与计算机测试; 袁江涛(1976-),男,甘肃天水人,工程师,硕士,主要研究方向:计算机网络; 秦海权(1981-),男,湖南永州人,工程师,硕士,主要研究方向:计算机信息安全、数据鉴定; 刘鑫(1980-),女,山东济宁人,工程师,硕士,主要研究方向:计算机信息安全。

Algorithm of near-duplicate image detection based on Bag-of-words and Hash coding

WANG Yutian1*, YUAN Jiangtao2, QIN Haiquan1, LIU Xin1   

  1. 1.The First Research Institute, Ministry of Public Security, Beijing 100048, China;
    2.Beichen Branch, Tianjin Municipal Public Security Bureau, Tianjin 300400, China
  • Received:2012-09-10 Revised:2012-11-02 Online:2013-03-01 Published:2013-03-01

摘要: 针对近似重复图像检测的传统算法存在检测效率和准确率不够高的缺点,提出了基于Bag-of-words和哈希编码的近似重复图像检测算法。该算法首先利用Bag-of-words把一幅图像表示成一个500维的特征向量; 然后,利用主成分分析(PCA)和尺度不变特征转换(SIFT)进行特征降维,并利用Hash编码技术对特征进行编码; 最后,利用动态距离度量技术实现近似重复图像的检测。实验结果表明,利用该算法进行近似重复图像检测是完全可行的,在准确度和查全率之间做到了较好的平衡,查准率可达90%~95%,查全率可达70%~80%。

关键词: 近似重复图像, Bag-of-words, 主成分分析, 哈希编码, 动态距离度量

Abstract: To solve the low efficiency and precision of the traditional methods, a near-duplicate image detection algorithm based on Bag-of-words and Hash coding was proposed in this paper. Firstly, a 500-dimensional feature vector was used to represent an image by Bag-of-words; secondly, feature dimension was reduced by Principal Component Analysis (PCA) and Scale-Invariant Feature Transform (SIFT) and features were encoded by Hash coding; finally, dynamic distance metric was used to detect near-duplicate images. The experimental results show that the algorithm based on Bag-of-words and Hash coding is feasible in detecting near-duplicate images. This algorithm can achieve a good balance between precision and recall rate: the precision rate can reach 90%-95%, and entire recall rate can reach 70%-80%.

Key words: near-duplicate image, Bag-of-words, Principal Component Analysis (PCA), Hash coding, dynamic distance metric

中图分类号: