Deep unsupervised discrete cross-modal hashing based on knowledge distillation

doi:10.11772/j.issn.1001-9081.2020111785

Abstract

Abstract: Cross-modal hashing has attracted much attention due to its low storage cost and high retrieval efficiency. Most of the existing cross-modal hashing methods require the inter-instance association information provided by additional manual labels. However, the deep features learned by pre-trained deep unsupervised cross-modal hashing methods can also provide similar information. In addition, the discrete constraints are relaxed in the learning process of Hash codes, resulting in a large quantization loss. To solve the above two issues, a Deep Unsupervised Discrete Cross-modal Hashing (DUDCH) method based on knowledge distillation was proposed. Firstly, combined with the idea of knowledge transfer in knowledge distillation, the latent association information of the pre-trained unsupervised teacher model was used to reconstruct the symmetric similarity matrix, so as to replace the manual labels to help the supervised student method training. Secondly, the Discrete Cyclic Coordinate descent (DCC) was adopted to update the discrete Hash codes iteratively, thereby reducing the quantization loss between the real-value Hash codes learned by neural network and the discrete Hash codes. Finally, with the end-to-end neural network adopted as teacher model and the asymmetric neural network constructed as student model, the time complexity of the combination model was reduced. Experimental results on two commonly used benchmark datasets MIRFLICKR-25K and NUS-WIDE show that compared with Deep Joint-Semantics Reconstructing Hashing (DJSRH), the proposed method has the mean Average Precision (mAP) in image-to-text/text-to-image tasks increased by 2.83 percentage points/0.70 percentage points and 6.53 percentage points/3.95 percentage points averagely and respectively, proving its effectiveness in large-scale cross-modal retrieval.

Key words: cross-modal hashing, knowledge distillation, reconstruction of similarity matrix, Discrete Cyclic Coordinate descent (DCC), asymmetric

摘要： 跨模态哈希因其低存储花费和高检索效率得到了广泛的关注。现有的大部分跨模态哈希方法需要额外的手工标签来提供实例间的关联信息，然而，预训练好的深度无监督跨模态哈希方法学习到的深度特征同样能提供相似信息；且哈希码学习过程中放松了离散约束，造成较大的量化损失。针对以上两个问题，提出基于知识蒸馏的深度无监督离散跨模态哈希（DUDCH）方法。首先，结合知识蒸馏中知识迁移的思想，利用预训练无监督老师模型潜藏的关联信息以重构对称相似度矩阵，从而代替手工标签帮助有监督学生模型训练；其次，采用离散循环坐标下降法（DCC）迭代更新离散哈希码，以此减少神经网络学习到的实值哈希码与离散哈希码间的量化损失；最后，采用端到端神经网络作为老师模型，构建非对称神经网络作为学生模型，从而降低组合模型的时间复杂度。在两个常用的基准数据集MIRFLICKR-25K和NUS-WIDE上的实验结果表明，该方法相较于深度联合语义重构哈希（DJSRH）方法在图像检索文本/文本检索图像两个任务上的平均精度均值（mAP）分别平均提升了2.83个百分点/0.70个百分点和6.53个百分点/3.95个百分点，充分体现了其在大规模跨模态数据检索中的有效性。

关键词: 跨模态哈希, 知识蒸馏, 相似度矩阵重构, 离散循环坐标下降法, 非对称

CLC Number:

TP181

ZHANG Cheng, WAN Yuan, QIANG Haopeng. Deep unsupervised discrete cross-modal hashing based on knowledge distillation[J]. Journal of Computer Applications, 2021, 41(9): 2523-2531.

张成, 万源, 强浩鹏. 基于知识蒸馏的深度无监督离散跨模态哈希[J]. 计算机应用, 2021, 41(9): 2523-2531.

References

[1] ANDONI A, INDYK P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C]//Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science. Piscataway:IEEE,2006:459-468.
[2] KULIS B,GRAUMAN K. Kernelized locality-sensitive hashing for scalable image search[C]//Proceedings of the IEEE 12th International Conference on Computer Vision. Piscataway:IEEE, 2009:2130-2137.
[3] 严双咏, 刘长红, 江爱文, 等. 语义耦合相关的判别式跨模态哈希学习算法[J]. 计算机学报,2019,42(1):164-175.(YAN S Y, LIU C H,JIANG A W,et al. Discriminative cross-modal hashing with coupled semantic correlation[J]. Chinese Journal of Computers,2019,42(1):164-175.)
[4] 董震, 裴明涛. 基于异构哈希网络的跨模态人脸检索方法[J]. 计算机学报,2019,42(1):73-84.(DONG Z,PEI M T. Crossmodality face retrieval based on heterogeneous hashing network[J]. Chinese Journal of Computers,2019,42(1):73-84.)
[5] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision,2004,60(2):91-110.
[6] KUMAR S,UDUPA R. Learning hash functions for cross-view similarity search[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press, 2011:1360-1365.
[7] ZHANG D Q,LI W J. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the 28th AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2014:2177-2183.
[8] BRONSTEIN M M,BRONSTEIN A M,MICHEL F,et al. Data fusion through cross-modality metric learning using similaritysensitive hashing[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2010:3594-3601.
[9] LIN Z J,DING G G,HU M Q,et al. Semantics-preserving hashing for cross-view retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3864-3872.
[10] SUN Y,CHEN Y H,WANG X G,et al. Deep learning face representation by joint identification-verification[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press,2014:1988-1996.
[11] YAO T,LONG F C,MEI T,et al. Deep semantic-preserving and ranking-based hashing for image retrieval[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2016:3931-3937.
[12] LIU W,ANGUELOV D,ERHAN D,et al. SSD:single shot multibox detector[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9905. Cham:Springer, 2016:21-37.
[13] JIANG Q Y,LI W J. Deep cross-modal hashing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:3270-3278.
[14] LI C,DENG C,LI N,et al. Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:4242-4251.
[15] YANG E K,DENG C,LIU W,et al. Pairwise relationship guided deep hashing for cross-modal retrieval[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2017:1618-1625.
[16] DENG C,CHEN Z J,LIU X L,et al. Triplet-based deep hashing network for cross-modal retrieval[J]. IEEE Transactions on Image Processing,2018,27(8):3893-3903.
[17] LIU X W,YU G X,DOMENICONI C,et al. Ranking-based deep cross-modal hashing[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto, CA:AAAI Press,2019:4400-4407.
[18] 邓一姣, 张凤荔, 陈学勤, 等. 面向跨模态检索的协同注意力网络模型[J]. 计算机科学,2020,47(4):54-59.(DENG Y J, ZHANG F L,CHEN X Q,et al. Collaborative attention network modal for cross-modal retrieval[J]. Computer Science,2020,47(4):54-59.)
[19] WU G S,LIN Z J,HAN J G,et al. Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval[C]//Proceedings of the 27th International Joint Conferences on Artificial Intelligence. Palo Alto,CA:AAAI Press,2018:2854-2860.
[20] ZHANG J,PENG Y X,YUAN M K. Unsupervised generative adversarial cross-modal hashing[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2018:539-546.
[21] WANG T,ZHU L,CHENG Z Y,et al. Unsupervised deep crossmodal hashing with virtual label regression[J]. Neurocomputing, 2020,386:84-96.
[22] SU S P, ZHONG Z S, ZHANG C. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE,2019:3027-3035.
[23] SONG J K,YANG Y,YANG Y,et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York:ACM,2013:785-796.
[24] DING G G,GUO Y C,ZHOU J L. Collective matrix factorization hashing for multimodal data[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2014:2083-2090.
[25] LONG M S,CAO Y,WANG J M,et al. Composite correlation quantization for efficient multimodal retrieval[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2016:579-588.
[26] ZHOU J L,DING G G,GUO Y C. Latent semantic sparse hashing for cross-modal similarity search[C]//Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2014:415-424.
[27] HU H T,XIE L X,HONG R C,et al. Creating something from nothing:unsupervised knowledge distillation for cross-modal hashing[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2020:3120-3129.
[28] LI C, DENG C, WANG L, et al. Coupled cycleGAN:unsupervised hashing network for cross-modal retrieval[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2019:176-183.
[29] HINTON G,VINYALS O,DEAN J. Distilling the knowledge in a neural network[EB/OL]. (2015-03-09)[2020-11-10]. https://arxiv.org/pdf/1503.02531.pdf.
[30] FURLANELLO T,LIPTON Z C,TSCHANNEN M,et al. Born again neural networks[C]//Proceedings of the 35th International Conference on Machine Learning. New York:JMLR. org,2018:1607-1616.
[31] YANG C L,XIE L X,QIAO S Y,et al. Training deep neural networks in generations:a more tolerant teacher educates better students[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2019:5628-5635.
[32] CHEN Y T,WANG N Y,ZHANG Z X. DarkRank:accelerating deep metric learning via cross sample similarities transfer[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2018:2852-2859.
[33] HE K M,ZHANG X Y,REN S Q,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2016:770-778.
[34] CHATFIELD K,SIMONYAN K,VEDALDI A,et al. Return of the devil in the details:delving deep into convolutional nets[C]//Proceedings of the 2014 British Machine Vision Conference. Durham:BMVA Press,2014:No. 054.
[35] JIANG Q Y,LI W J. Asymmetric deep supervised hashing[C]//Proceedings of the 32nd AAAI conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2018:3342-3349.
[36] SHEN F M,SHEN C H,LIU W,et al. Supervised discrete hashing[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:37-45.
[37] KANG W C,LI W J,ZHOU Z H. Column sampling based discrete supervised hashing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. Palo Alto, CA:AAAI Press,2016:1230-1236.
[38] HUISKES M J,LEW M S. The MIR Flickr retrieval evaluation[C]//Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. New York:ACM, 2008:39-43.
[39] CHUA T S,TANG J H,HONG R C,et al. NUS-WIDE:a realworld Web image database from National University of Singapore[C]//Proceedings of the 2009 ACM International Conference on Image and Video Retrieval. New York:ACM,2009:No. 48.