Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 47-57.DOI: 10.11772/j.issn.1001-9081.2023060861

• Cross-media representation learning and cognitive reasoning • Previous Articles     Next Articles

Video dynamic scene graph generation model based on multi-scale spatial-temporal Transformer

Jia WANG-ZHU, Zhou YU(), Jun YU, Jianping FAN   

  1. School of Computer Science,Hangzhou Dianzi University,Hangzhou Zhejiang 310018,China
  • Received:2023-07-01 Revised:2023-09-05 Accepted:2023-09-11 Online:2023-10-09 Published:2024-01-10
  • Contact: Zhou YU
  • About author:WANG-ZHU Jia, born in 1998, M. S. candidate. Her research interests include multimedia understanding.
    YU Jun, born in 1980, Ph. D., professor. His research interests include multimedia analysis and retrieval.
    FAN Jianping, born in 1968, Ph. D., professor. His research interests include multimedia analysis.
  • Supported by:
    National Natural Science Foundation of China(62072147);Natural Science Foundation of Zhejiang Province(LR22F020001)

基于多尺度时空Transformer的视频动态场景图生成模型

王朱佳, 余宙(), 俞俊, 范建平   

  1. 杭州电子科技大学 计算机学院,杭州 310018
  • 通讯作者: 余宙
  • 作者简介:王朱佳(1998—),女,上海人,硕士研究生,主要研究方向:多媒体理解;
    俞俊(1980—),男,浙江杭州人,教授,博士,CCF会员,主要研究方向:多媒体分析与检索;
    范建平(1968—),男,浙江杭州人,教授,博士,CCF会员,主要研究方向:多媒体分析。
    第一联系人:余宙(1988—),男,浙江杭州人,教授,博士,CCF会员,主要研究方向:多媒体数据分析与推理;
  • 基金资助:
    国家自然科学基金资助项目(62072147);浙江省自然科学基金资助项目(LR22F020001)

Abstract:

To address the challenge of dynamic changes in object relationships over time in videos, a video dynamic scene graph generation model based on multi-scale spatial-temporal Transformer was proposed. The multi-scale modeling idea was introduced into the classic Transformer architecture to precisely model dynamic fine-grained semantics in videos. First, in the spatial dimension, the attention was given to both the global spatial correlations of objects, similar to traditional models, and the local spatial correlations among objects’ relative positions, which facilitated a better understanding of interactive dynamics between people and objects, leading to more accurate semantic analysis results. Then, in the temporal dimension, not only the traditional short-term temporal correlations of objects in videos were modeled, but also the long-term temporal correlations of the same object pairs throughout the entire videos were emphasized. Comprehensive modeling of long-term relationships between objects assisted in generating more accurate and coherent scene graphs, mitigating issues arising from occlusions, overlaps, etc. during scene graph generation. Finally, through the collaborative efforts of the spatial encoder and temporal encoder, dynamic fine-grained semantics in videos were captured more accurately by the model, avoiding limitations inherent in traditional single-scale approaches. The experimental results show that, compared to the baseline model STTran, the proposed model achieves an increase of 5.0 percentage points, 2.8 percentage points, and 2.9 percentage points in terms of Recall@10 for the tasks of predicate classification, scene graph classification, and scene graph detection, respectively, on the Action Genome benchmark dataset. This demonstrates that the multi-scale modeling concept can enhance precision and effectively boost performance in dynamic video scene graph generation tasks.

Key words: dynamic scene graph generation, attention mechanism, multi-scale modeling, video understanding, semantic analysis

摘要:

为应对动态视频中物体间关系在时间维度上的动态变化,提出一种基于多尺度时空Transformer的视频动态场景图生成模型,在经典的Transformer架构基础上引入了多尺度建模思想,以实现对视频动态细粒度语义的精确建模。首先,在空间维度上保留了传统模型对物体在全局空间相关性的关注;同时还对物体间的相对位置进行了局部空间相关性建模,以便更好地理解人和物之间的交互动态,提供更准确的语义分析结果。其次,在时间维度上,除了保留传统模型对视频中物体短期时间相关性的关注外,还关注了同一对物体在完整视频中的长期时间相关性,通过更全面地建模物体之间的长期关系,生成更准确、连贯的场景图,在一定程度上缓解了由遮挡、重合等引起的场景图生成问题。最后,通过空间编码器与时间编码器的共同作用,更加精准地建模视频动态细粒度语义,克服了传统的单尺度模型的局限性。实验结果显示,在Action Genome基准数据集上,与基线模型STTran相比,在谓词分类、场景图分类与场景图检测三个任务的Recall@10指标上分别提升了5.0、2.8、2.9个百分点。实验结果表明,多尺度建模思想能够更加精确地建模,并有效地提高在视频动态场景图生成任务上的性能。

关键词: 动态场景图生成, 注意力机制, 多尺度建模, 视频理解, 语义分析

CLC Number: