Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (3): 901-908.DOI: 10.11772/j.issn.1001-9081.2023040412

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Cross-view matching model based on attention mechanism and multi-granularity feature fusion

Meiyu CAI, Runzhe ZHU, Fei WU(), Kaiyu ZHANG, Jiale LI   

  1. School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
  • Received:2023-04-12 Revised:2023-07-08 Accepted:2023-07-13 Online:2024-03-12 Published:2024-03-10
  • Contact: Fei WU
  • About author:CAI Meiyu, born in 1998, M. S. candidate. Her research interests include visual positioning, scene matching and positioning.
    ZHU Runzhe, born in 1998, M. S. candidate. His research interests include visual geo-localization, cross-view matching.
    ZHANG Kaiyu, born in 1999, M. S. candidate. His research interests include target detection, target tracking, semantic segmentation, image generation.
    LI Jiale, born in 1999, M. S. candidate. His research interests include target detection, document layout analysis.
  • Supported by:
    China University Industry-University-Research Innovation Fund of Ministry of Education(2021ZYA08008);Project of Shanghai Municipal Science and Technology Commission(N22DZ1100803)


蔡美玉, 朱润哲, 吴飞(), 张开昱, 李家乐   

  1. 上海工程技术大学 电子电气工程学院,上海 201620
  • 通讯作者: 吴飞
  • 作者简介:蔡美玉(1998—),女,山东德州人,硕士研究生,主要研究方向:视觉定位、景象匹配定位
  • 基金资助:


Cross-view scene matching refers to the discovery of images of the same geographical target from different platforms (such as drones and satellites). However, different image platforms lead to low accuracy of UAV (Unmanned Aerial Vehicle) positioning and navigation tasks, and the existing methods usually focus only on a single dimension of the image and ignore the multi-dimensional features of the image. To solve the above problems, GAMF (Global Attention and Multi-granularity feature Fusion) deep neural network was proposed to improve feature representation and feature distinguishability. Firstly, the images from the UAV perspective and the satellite perspective were combined, and the three branches were extended under the unified network architecture, the spatial location, channel and local features of the images from three dimensions were extracted. Then, by establishing the SGAM (Spatial Global relationship Attention Module) and CGAM (Channel Global Attention Module), the spatial global relationship mechanism and channel attention mechanism were introduced to capture global information, so as to better carry out attention learning. Secondly, in order to fuse local perception features, a local division strategy was introduced to better improve the model’s ability to extract fine-grained features. Finally, the features of the three dimensions were combined as the final features to train the model. The test results on the public dataset University-1652 show that the AP (Average Precision) of the GAMF model on UAV visual positioning tasks reaches 87.41%, and the Recall (R@1) in UAV visual navigation tasks reaches 90.30%, which verifies that the GAMF model can effectively aggregate the multi-dimensional features of the image and improve the accuracy of UAV positioning and navigation tasks.

Key words: Unmanned Aerial Vehicle (UAV), scene matching and positioning, visual positioning, measurement learning, global relationship attention, deep learning



关键词: 无人机, 景象匹配定位, 视觉定位, 度量学习, 全局关系注意力, 深度学习

CLC Number: