自适应混合注意力深度跨模态哈希

doi:10.11772/j.issn.1001-9081.2021101806

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3663-3670.DOI: 10.11772/j.issn.1001-9081.2021101806

• 人工智能 • 上一篇

自适应混合注意力深度跨模态哈希

柳兴华¹^,², 曹桂涛³, 林秋斌¹^,², 曹文明¹^,²()

^1.深圳大学电子与信息工程学院, 广东深圳 518060
^2.广东省多媒体信息服务工程技术研究中心(深圳大学), 广东深圳 518060
^3.华东师范大学软件工程学院, 上海 200062

收稿日期:2021-10-22 修回日期:2021-12-20 接受日期:2021-12-23 发布日期:2021-12-31 出版日期:2022-12-10
通讯作者: 曹文明
作者简介:柳兴华（1995—），男，河南信阳人，硕士研究生，主要研究方向：多媒体信息处理、跨模态检索
曹桂涛（1970—），女，山东烟台人，教授，博士，CCF会员，主要研究方向：多媒体信息处理、人工智能
林秋斌（1994—），男，广东潮州人，博士研究生，主要研究方向：多媒体信息处理、跨模态检索
基金资助:
国家自然科学基金资助项目(61771322)

Adaptive hybrid attention hashing for deep cross-modal retrieval

Xinghua LIU¹^,², Guitao CAO³, Qiubin LIN¹^,², Wenming CAO¹^,²()

^1.College of Electronics and Information Engineering，Shenzhen University，Shenzhen Guangdong 518060，China
^2.Guangdong Multimedia Information Service Engineering Technology Research Center （Shenzhen University），Shenzhen Guangdong 518060，China
^3.Software Engineering Institute，East China Normal University，Shanghai 200062，China

Received:2021-10-22 Revised:2021-12-20 Accepted:2021-12-23 Online:2021-12-31 Published:2022-12-10
Contact: Wenming CAO
About author:LIU Xinghua， born in 1995， M. S. candidate. His research interests include multimedia information processing， cross-modal retrieval.
CAO Guitao，born in 1970， Ph. D.， professor. Her research interests include multimedia information processing， artificial intelligence.
LIN Qiubin， born in 1994， Ph. D. candidate. His research interests include multimedia information processing， cross-modal retrieval.
Supported by:
National Natural Science Foundation of China(61771322)

摘要/Abstract

摘要：

针对现有哈希方法在特征学习过程中无法区分各区域特征信息的重要程度和不能充分利用标签信息来深度挖掘模态间相关性的问题，提出了自适应混合注意力深度跨模态哈希检索（AHAH）模型。首先，通过自主学习得到的权重将通道注意力和空间注意力有机结合来强化对特征图中相关目标区域的关注度，同时弱化对不相关目标区域的关注度；其次，通过对模态标签进行统计分析，并使用所提出的相似度计算方法将相似度量化为0~1的数字以更精细地表示模态间的相似性。在4个常用的数据集MIRFLICKR-25K、NUS-WIDE、MSCOCO和IAPR TC-12上，当哈希码长度为16 bit时，与最先进的方法多标签语义保留哈希（MLSPH）相比，所提方法的检索平均准确率均值（mAP）分别提高了2.25%、1.75%、6.8%和2.15%。此外，消融实验和效率分析也证明了所提方法的有效性。

关键词: 跨模态检索, 哈希方法, 深度神经网络, 自适应, 混合注意力

Abstract:

In feature learning process， the existing hashing methods cannot distinguish the importance of the feature information of each region， and cannot utilize the label information to explore the correlation between modalities. Therefore， an Adaptive Hybrid Attention Hashing for deep cross-modal retrieval （AHAH） model was proposed. Firstly， channel attention and spatial attention were combined by the weights obtained by autonomous learning to strengthen the attention to the relevant target area and weaken the attention to the irrelevant target area. Secondly， the similarity between modalities was expressed more finely through the statistical analysis of modality labels and quantification of similarity degrees to numbers between 0 and 1 by using the proposed similarity measurement method. Compared with the most advanced method Multi-Label Semantics Preserving Hashing （MLSPH） on four commonly used datasets MIRFLICKR-25K， NUS-WIDE， MSCOCO， and IAPR TC-12， when the hash code length is 16 bit， the proposed method has the retrieval mean Average Precision （mAP） increased by 2.25%， 1.75%， 6.8%， and 2.15%， respectively. In addition， ablation experiments and efficiency analysis also prove the effectiveness of the proposed method.

Key words: cross-modal retrieval, hashing method, deep neural network, adaptive, hybrid attention

中图分类号:

TP391.3

柳兴华, 曹桂涛, 林秋斌, 曹文明. 自适应混合注意力深度跨模态哈希[J]. 计算机应用, 2022, 42(12): 3663-3670.

Xinghua LIU, Guitao CAO, Qiubin LIN, Wenming CAO. Adaptive hybrid attention hashing for deep cross-modal retrieval[J]. Journal of Computer Applications, 2022, 42(12): 3663-3670.

图/表 9

参考文献 22

1	ARYA S， MOUNT D M， NETANYAHU N S， et al. An optimal algorithm for approximate nearest neighbor searching fixed dimensions［J］. Journal of the ACM， 1998， 45（6）： 891-923. 10.1145/293347.293348
2	SONG J K， YANG Y， YANG Y， et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources［C］// Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York： ACM， 2013： 785-796. 10.1145/2463676.2465274
3	DING G G， GUO Y C， ZHOU J L. Collective matrix factorization hashing for multimodal data［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2014： 2083-2090. 10.1109/cvpr.2014.267
4	ZHOU J， DING G， GUO Y. Latent semantic sparse hashing for cross-modal similarity search［C］// Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2014： 415-424. 10.1145/2600428.2609610
5	MANDAL D， CHAUDHURY K N， BISWAS S. Generalized semantic preserving hashing for n-label cross-modal retrieval ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2633-2641. 10.1109/cvpr.2017.282
6	ZHANG D Q， LI W J. Large-scale supervised multimodal hashing with semantic correlation maximization［C］// Proceedings of the 28th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2014： 2177-2183. 10.1609/aaai.v28i1.8995
7	MANDAL D， CHAUDHURY K N， BISWAS S. Generalized semantic preserving hashing for cross-modal retrieval［J］. IEEE Transactions on Image Processing， 2019， 28（1）： 102-112. 10.1109/tip.2018.2863040
8	WANG H T， MENG M， CHEN H， et al. Supervised consistent and specific hashing［C］// Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2019： 1822-1827. 10.1109/icme.2019.00313
9	JIANG Q Y， LI W J. Deep cross-modal hashing［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3270-3278. 10.1109/cvpr.2017.348
10	YANG E K， DENG C， LIU W， et al. Pairwise relationship guided deep hashing for cross-modal retrieval［C］// Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2017： 1618-1625. 10.1609/aaai.v31i1.10719
11	LIN Q M， CAO W M， HE Z H， et al. Semantic deep cross-modal hashing［J］. Neurocomputing， 2020， 396： 113-122. 10.1016/j.neucom.2020.02.043
12	LIU H， FENG Y， ZHOU M L， et al. Semantic ranking structure preserving for cross-modal retrieval［J］. Applied Intelligence， 2021， 51（3）： 1802-1812. 10.1007/s10489-020-01930-x
13	LI C， DENG C， LI N， et al. Self-supervised adversarial hashing networks for cross-modal retrieval［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 4242-4251. 10.1109/cvpr.2018.00446
14	MA X H， ZHANG T Z， XU C S. Multi-level correlation adversarial hashing for cross-modal retrieval［J］. IEEE Transactions on Multimedia， 2020， 22（12）： 3101-3114. 10.1109/tmm.2020.2969792
15	ZOU X T， WANG X Z， BAKKER E M， et al. Multi-label semantics preserving based deep cross-modal hashing［J］. Signal Processing： Image Communication， 2021， 93： No.116131. 10.1016/j.image.2020.116131
16	刘芳名，张鸿. 基于多级语义的判别式跨模态哈希检索算法［J］. 计算机应用， 2021， 41（8）： 2187-2192. 10.11772/j.issn.1001-9081.2020101607
	LIU F M， ZHANG H. Cross-modal retrieval algorithm based on multi-level semantic discriminative guided hashing［J］. Journal of Computer Applications， 2021， 41（8）： 2187-2192. 10.11772/j.issn.1001-9081.2020101607
17	张成，万源，强浩鹏. 基于知识蒸馏的深度无监督离散跨模态哈希［J］.计算机应用， 2021， 41（9）：2523-2531. 10.11772/j.issn.1001-9081.2020111785
	ZHANG C， WAN Y， QIANG H P. Deep unsupervised discrete cross-modal hashing based on knowledge distillation［J］. Journal of Computer Applications， 2021， 41（9）：2523-2531. 10.11772/j.issn.1001-9081.2020111785
18	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
19	LI X， WANG W H， HU X L， et al. Selective kernel networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 510-519. 10.1109/cvpr.2019.00060
20	WOO S， PARK J， LEE J Y， et al. CBAM： convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 3-19.
21	HUANG Z L， WANG X G， HUANG L C， et al. CCNet： criss-cross attention for semantic segmentation［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 603-612. 10.1109/iccv.2019.00069
22	FU J， LIU J， TIAN H J， et al. Dual attention network for scene segmentation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 3141-3149. 10.1109/cvpr.2019.00326

数据集	训练集	检索集	索引集	文本
MIRFLICKR-25k	10 000	18 015	2 000	1 386
NUS-WIDE	10 500	188 321	2 100	1 000
MSCOCO	10 000	80 000	5 000	2 000
IAPR TC-12	10 000	18 000	2 000	2 912

数据集	训练集	检索集	索引集	文本
MIRFLICKR-25k	10 000	18 015	2 000	1 386
NUS-WIDE	10 500	188 321	2 100	1 000
MSCOCO	10 000	80 000	5 000	2 000
IAPR TC-12	10 000	18 000	2 000	2 912

检索任务	方法	MIRFLICKR-15K			NUS-WIDE			MS COCO			IAPR TC-12
检索任务	方法	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit
图像检索文本	SCM	0.685	0.693	0.697	0.497	0.502	0.499	0.498	0.556	0.565	0.369	0.367	0.380
	SePH	0.709	0.711	0.716	0.479	0.501	0.492	0.489	0.502	0.499	0.444	0.456	0.464
	GSPH	0.607	0.619	0.623	0.402	0.415	0.421	0.443	0.473	0.484	0.372	0.392	0.402
	DCMH	0.677	0.703	0.725	0.590	0.603	0.609	0.497	0.506	0.511	0.453	0.473	0.484
	SSAH	0.797	0.809	0.810	0.636	0.636	0.637	0.550	0.577	0.576	0.544	0.537	0.549
	MLCAH	0.808	0.816	0.828	0.645	0.640	0.653	0.582	0.597	0.595	0.550	0.565	0.563
	MLSPH	0.808	0.824	0.834	0.641	0.660	0.673	0.574	0.592	0.601	0.426	0.426	0.475
	本文方法	0.821	0.832	0.836	0.662	0.682	0.692	0.613	0.655	0.675	0.557	0.587	0.602
文本检索图像	SCM	0.707	0.714	0.719	0.567	0.583	0.597	0.492	0.556	0.568	0.345	0.341	0.347
	SePH	0.722	0.723	0.727	0.487	0.493	0.488	0.485	0.495	0.485	0.442	0.456	0.465
	GSPH	0.628	0.646	0.650	0.500	0.523	0.535	0.544	0.604	0.646	0.418	0.445	0.464
	DCMH	0.705	0.717	0.724	0.620	0.634	0.643	0.507	0.520	0.527	0.519	0.538	0.549
	SSAH	0.782	0.797	0.799	0.653	0.676	0.683	0.552	0.578	0.578	0.531	0.534	0.566
	MLCAH	0.793	0.811	0.807	0.679	0.698	0.704	0.569	0.593	0.583	0.554	0.563	0.566
	MLSPH	0.785	0.804	0.815	0.643	0.663	0.672	0.556	0.586	0.613	0.435	0.452	0.473
	本文方法	0.816	0.825	0.831	0.685	0.706	0.713	0.617	0.659	0.672	0.571	0.603	0.620

检索任务	方法	MIRFLICKR-15K			NUS-WIDE			MS COCO			IAPR TC-12
检索任务	方法	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit
图像检索文本	SCM	0.685	0.693	0.697	0.497	0.502	0.499	0.498	0.556	0.565	0.369	0.367	0.380
	SePH	0.709	0.711	0.716	0.479	0.501	0.492	0.489	0.502	0.499	0.444	0.456	0.464
	GSPH	0.607	0.619	0.623	0.402	0.415	0.421	0.443	0.473	0.484	0.372	0.392	0.402
	DCMH	0.677	0.703	0.725	0.590	0.603	0.609	0.497	0.506	0.511	0.453	0.473	0.484
	SSAH	0.797	0.809	0.810	0.636	0.636	0.637	0.550	0.577	0.576	0.544	0.537	0.549
	MLCAH	0.808	0.816	0.828	0.645	0.640	0.653	0.582	0.597	0.595	0.550	0.565	0.563
	MLSPH	0.808	0.824	0.834	0.641	0.660	0.673	0.574	0.592	0.601	0.426	0.426	0.475
	本文方法	0.821	0.832	0.836	0.662	0.682	0.692	0.613	0.655	0.675	0.557	0.587	0.602
文本检索图像	SCM	0.707	0.714	0.719	0.567	0.583	0.597	0.492	0.556	0.568	0.345	0.341	0.347
	SePH	0.722	0.723	0.727	0.487	0.493	0.488	0.485	0.495	0.485	0.442	0.456	0.465
	GSPH	0.628	0.646	0.650	0.500	0.523	0.535	0.544	0.604	0.646	0.418	0.445	0.464
	DCMH	0.705	0.717	0.724	0.620	0.634	0.643	0.507	0.520	0.527	0.519	0.538	0.549
	SSAH	0.782	0.797	0.799	0.653	0.676	0.683	0.552	0.578	0.578	0.531	0.534	0.566
	MLCAH	0.793	0.811	0.807	0.679	0.698	0.704	0.569	0.593	0.583	0.554	0.563	0.566
	MLSPH	0.785	0.804	0.815	0.643	0.663	0.672	0.556	0.586	0.613	0.435	0.452	0.473
	本文方法	0.816	0.825	0.831	0.685	0.706	0.713	0.617	0.659	0.672	0.571	0.603	0.620

模型	图像检索文本			文本检索图像
模型	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit
通道	0.808	0.828	0.829	0.811	0.818	0.825
空间	0.809	0.826	0.830	0.809	0.819	0.824
混合	0.821	0.832	0.836	0.817	0.825	0.831

自适应混合注意力深度跨模态哈希

Adaptive hybrid attention hashing for deep cross-modal retrieval

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 22

相关文章 15

编辑推荐

Metrics

任务	方法	MIRFKICKR-25K			NUS-WIDE			MS COCO			IAPR TC-12
任务	方法	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit	16 bit	32 bit	64 bit
图像检索文本	AHAH	0.821	0.832	0.836	0.662	0.682	0.692	0.613	0.655	0.685	0.557	0.587	0.602
	AHAH-1	0.815	0.823	0.829	0.651	0.674	0.685	0.607	0.44	0.676	0.548	0.579	0.595
	AHAH-2	0.817	0.824	0.830	0.655	0.677	0.684	0.609	0.650	0.680	0.550	0.582	0.596
文本检索图像	AHAH	0.816	0.825	0.831	0.685	0.706	0.713	0.617	0.659	0.682	0.571	0.603	0.620
	AHAH-1	0.811	0.820	0.823	0.678	0.695	0.702	0.610	0.650	0.673	0.562	0.595	0.609
	AHAH-2	0.810	0.813	0.825	0.681	0.699	0.706	0.607	0.654	0.677	0.566	0.593	0.614

[1]	肖智豪, 胡志华, 朱琳. 求解冷链物流时间依赖型车辆路径问题的混合自适应大邻域搜索算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2926-2935.
[2]	吉爱国, 栾云哲. 基于缓存补偿的视频码率自适应算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2816-2822.
[3]	程南江, 余贞侠, 陈琳, 乔贺辙. 基于领域自适应的多源多标签行人属性识别[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2401-2406.
[4]	李坤, 侯庆. 基于注意力机制的轻量型人体姿态估计[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2407-2414.
[5]	张剑, 程培源, 邵思羽. 基于改进残差卷积自编码网络的类自适应旋转机械故障诊断[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2440-2449.
[6]	王晓雨, 王展青, 熊威. 深度非对称离散跨模态哈希方法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2461-2470.
[7]	杨博, 张恒巍, 李哲铭, 徐开勇. 基于图像翻转变换的对抗样本生成方法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2319-2325.
[8]	玄英律, 万源, 陈嘉慧. 基于多尺度卷积和注意力机制的LSTM时间序列分类[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2343-2352.
[9]	陈荣源, 姚剑敏, 严群, 林志贤. 基于深度神经网络的视频播放速度识别[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2043-2051.
[10]	刘尚旺, 张新明, 张非. 改进字体自适应神经网络的图像字符编辑方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2227-2238.
[11]	朱亮, 徐华, 成金海, 朱深. AdaBoost的样本权重与组合系数的分析及改进[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2022-2029.
[12]	毛文涛, 吴桂芳, 吴超, 窦智. 基于中国写意风格迁移的动漫视频生成模型[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2162-2169.
[13]	范厚明, 牟爽, 岳丽君. 考虑冲突和拥堵的自动导引车调度与路径规划协同优化[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2281-2291.
[14]	狄筝, 曹一凡, 仇超, 罗韬, 王晓飞. 新型算力网络架构及其应用案例分析[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1656-1661.
[15]	张仲华, 赵福媛, 郭钧枫, 赵高长. 柯西自适应回溯搜索与最小二乘支持向量机的集成预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1829-1836.