Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 808-814.DOI: 10.11772/j.issn.1001-9081.2024101434

• Frontier research and typical applications of large models • Previous Articles     Next Articles

Text-based person retrieval method based on multi-granularity shared semantic center association

Bin KANG1,2, Bin CHEN2,3(), Junjie WANG3,4, Yulin LI3,4, Junzhi ZHAO5, Weizhi XIAN6   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610213,China
    2.School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
    3.International Research Institute of Artificial Intelligence,Harbin Institute of Technology,Shenzhen,Shenzhen Guangdong 518055,China
    4.School of Computer Science and Technology,Harbin Institute of Technology,Shenzhen,Shenzhen Guangdong 518055,China
    5.School of Information Science and Technology,Southwest Jiaotong University,Chengdu Sichuan 611756,China
    6.Chongqing Research Institute,Harbin Institute of Technology,Chongqing 401151,China
  • Received:2024-10-11 Revised:2024-11-26 Accepted:2024-12-04 Online:2025-01-06 Published:2025-03-10
  • Contact: Bin CHEN
  • About author:KANG Bin, born in 1998, Ph. D. candidate. His research interests include multi-modal retrieval, object detection.
    WANG Junjie, born in 1997, Ph. D. candidate. His research interests include multi-modal large model, object detection.
    LI Yulin, born in 1999, Ph. D. candidate. His research interests include multimodal large model, video understanding.
    ZHAO Junzhi, born in 1999. His research interests include image processing, information security, computer vision.
    XIAN Weizhi, born in 1994, Ph. D. His research interests include computer vision, pattern recognition.
  • Supported by:
    Surface Project of Science and Technology of Shenzhen(GXWD-20220811170603002)

基于多粒度共享语义中心关联的文本到人物检索方法

康斌1,2, 陈斌2,3(), 王俊杰3,4, 李昱林3,4, 赵军智5, 咸伟志6   

  1. 1.中国科学院 成都计算机应用研究所,成都 610213
    2.中国科学院大学 计算机科学与技术学院,北京 100049
    3.哈尔滨工业大学(深圳) 国际人工智能研究院,广东 深圳 518055
    4.哈尔滨工业大学(深圳) 计算机科学与技术学院,广东 深圳 518055
    5.西南交通大学 信息科学与技术学院,成都 611756
    6.哈尔滨工业大学 重庆研究院,重庆 401151
  • 通讯作者: 陈斌
  • 作者简介:康斌(1998—),男,甘肃临洮人,博士研究生,主要研究方向:多模态检索、目标检测
    王俊杰(1997—),男,四川内江人,博士研究生,主要研究方向:多模态大模型、目标检测
    李昱林(1999—),男,山东日照人,博士研究生,主要研究方向:多模态大模型、视频理解
    赵军智(1999—),男,河南南阳人,主要研究方向:图像处理、信息安全、计算机视觉
    咸伟志(1994—),男,江苏南京人,博士,主要研究方向:计算机视觉、模式识别。
  • 基金资助:
    深圳市稳定支持面上项目(GXWD-20220811170603002)

Abstract:

Text-based person retrieval aims to identify specific person using textual descriptions as queries. The existing state-of-the-art methods typically design multiple alignment mechanisms to achieve correspondence among cross-modal data at both global and local levels, but they neglect the mutual influence among these mechanisms. To address this, a multi-granularity shared semantic center association mechanism was proposed to explore the promoting and inhibiting effects between global and local alignments. Firstly, a multi-granularity cross-alignment module was introduced to enhance interactions of image-sentence and local region-word, achieving multi-level alignment of the cross-modal data in a joint embedding space. Then, a shared semantic center was established and served as a learnable semantic hub, and associations among global and local features were used to enhance semantic consistency among different alignment mechanisms and promote the collaborative effect of global and local features. In the shared semantic center, the local and global cross-modal similarity relationships among image and text features were calculated, providing a complementary measure from both global and local perspectives and maximizing positive effects among multiple alignment mechanisms. Finally, experiments were carried out on CUHK-PEDES dataset. Results show that the proposed method improves the Rank-1 by 8.69 percentage points and the mean Average Precision (mAP) by 6.85 percentage points compared to the baseline method significantly. The proposed method also achieves excellent performance on ICFG-PEDES and RSTPReid datasets, significantly surpassing all the compared methods.

Key words: Visual-Language Model (VLM), person retrieval, global alignment, local alignment, shared semantic center

摘要:

基于文本的人物检索旨在通过使用文本描述作为查询来识别特定人物。现有的先进方法通常设计多种对齐机制实现跨模态数据在全局和局部的对应关系,然而忽略了不同对齐机制之间的相互影响。因此,提出一种多粒度共享语义中心关联机制,深入探索全局对齐和局部对齐之间的促进和抑制效应。首先,引入一个多粒度交叉对齐模块,并通过增强图像-句子和局部区域-分词之间的交互,实现跨模态数据在联合嵌入空间的多层次对齐;其次,建立一个共享语义中心,将它作为一个可学习的语义枢纽,并通过全局特征和局部特征的关联,增强不同对齐机制之间的语义一致性,促进全局和局部特征的协同作用。在共享语义中心内,计算图像特征和文本特征之间的局部和全局跨模态相似性关系,提供一种全局视角与局部视角的互补度量,并最大限度地促进多种对齐机制之间的正向效应;最后,在CUHK-PEDES数据集上进行实验。结果表明:所提方法在Rank-1指标上较基线方法显著提升了8.69个百分点,平均精度均值(mAP)提升了6.85个百分点。在ICFG-PEDES和RSTPReid数据集上所提方法也取得了优异的性能,明显超越了所有对比方法。

关键词: 视觉-语言模型, 人物检索, 全局对齐, 局部对齐, 共享语义中心

CLC Number: