Journal of Computer Applications

    Next Articles

Text-based person retrieval method based on multi-granularity shared semantic center association

  

  • Received:2024-10-11 Revised:2024-11-26 Accepted:2024-12-02 Online:2025-01-06 Published:2025-01-06
  • Supported by:
    Research on industrial unknown task reasoning and learning integrating online small sample learning and offline domain adaptation methods

基于多粒度共享语义中心关联的文本到人物检索方法

康斌1,2,陈斌2,3*,王俊杰3,4,李昱林3,4, 赵军智5,咸伟志6   

  1. 1.中国科学院 成都计算机应用研究所,成都 610213; 2.中国科学院大学 计算机科学与技术学院,北京 100049; 3.哈尔滨工业大学(深圳) 国际人工智能研究院,深圳 广东 518067; 4.哈尔滨工业大学(深圳) 计算机科学与技术学院,深圳 广东518067; 5.西南交通大学 信息科学与技术学院,成都,6117565; 6.哈尔滨工业大学 重庆研究院,重庆 401151


  • 通讯作者: 陈斌
  • 基金资助:
    融合在线小样本学习和离线域适应方法的工业未知任务推理与学习研究

Abstract: Text-based person retrieval aims to identify specific individuals using textual descriptions as queries. Existing state-of-the-art methods typically design multiple alignment mechanisms to achieve correspondence between cross-modal data at both global and local levels but often overlook the mutual influence among these mechanisms. To address this, a multi-granularity shared semantic center association mechanism that explores the interactions between global and local alignments was proposed. First, a multi-granularity cross-alignment module was introduced, enhancing interactions between images and sentences as well as local regions and words, achieving multi-level alignment in a joint embedding space. Then, a shared semantic center was established, serving as a learnable semantic hub, associating with global and local features to enhance semantic consistency among different alignment mechanisms and promote their collaborative effect. Within the shared semantic center, we compute local and global cross-modal similarity relationships between image and text features, providing a complementary measure from both perspectives and maximizing positive effects among multiple alignment mechanisms. Finally, experimental validation on the CUHK-PEDES dataset shows that the proposed method significantly improves the Rank-1 metric by 8.69% and the mAP metric by 6.85% compared to the baseline. It also achieves excellent performance on the ICFG-PEDES and RSTPReid benchmarks, significantly surpassing all competing methods.

Key words: vision-language model, person retrieval, global alignment, local alignment, shared semantic center

摘要: 基于文本的人物检索旨在通过文本描述作为查询识别特定人物。现有先进方法通常设计多种对齐机制以实现跨模态数据在全局和局部的对应关系,但忽略了不同对齐机制之间的相互影响。为此,提出一种多粒度共享语义中心关联机制,深入探索全局对齐和局部对齐之间的促进和抑制效应。首先引入一个多粒度交叉对齐模块,通过增强图像-句子和局部区域-分词之间的交互,实现跨模态数据在联合嵌入空间的多层次对齐。随后,建立一个共享语义中心,作为一个可学习的语义枢纽,通过与全局特征和局部特征的关联,增强不同对齐机制之间的语义一致性,促进全局和局部特征的协同作用。在共享语义中心内,计算图像特征和文本特征之间的局部和全局跨模态相似性关系,提供一种全局视角与局部视角的互补度量,从而最大程度地促进多种对齐机制之间的正向效应。最后,在CUHK-PEDES数据集上的实验验证表明,所提方法在Rank-1指标上较基线方法显著提升了8.69%,mAP指标提升了6.85%。在ICFG-PEDES和RSTPReid基准上也取得了优异的性能,明显超越了所有竞争方法。

关键词: 视觉语言模型, 人物检索, 全局对齐, 局部对齐, 共享语义中心

CLC Number: