Journal of Computer Applications

    Next Articles

Scene recognition method based on structured co-occurrence representation learning

  

  • Received:2026-01-22 Revised:2026-04-01 Online:2026-05-29 Published:2026-05-29

基于结构化共生表征学习的场景识别方法

付庆龙,赵其鲁,樊晓曼   

  1. 青岛大学计算机科学技术学院
  • 通讯作者: 赵其鲁

Abstract: To address the difficulty of existing scene recognition methods in balancing structured associations among visual elements and efficient inference computation, a scene recognition method based on Structured Co-occurrence Representation Learning (S-CRL) was proposed. First, image space was discretized into semantic visual words using a lightweight segmentation model and feature clustering techniques. Statistical co-occurrence priors were constructed to guide sequence generation. Second, a bidirectional Transformer architecture was introduced to perform masked visual modeling. By aggregating context information to predict masked visual words, compositional rules and co-occurrence dependencies among visual elements were internalized. Finally, a multiple instance aggregation strategy based on an attention mechanism was designed. This strategy adaptively transformed generated context features into global scene representations. Experimental results show that the recognition accuracy of the proposed method on the SUN397 dataset reaches 80.61%. Compared with NEM (Nested Ensemble Model), SpaCoNet (Spatial relation and Co-occurrence Network), and DGN-Net (Discriminative Graph Network), accuracy increases by 4.55, 4.46, and 1.31 percentage points, respectively. In addition, inference computation of the method is only 4.2 GFLOPs. Accounting for approximately 23.9% of that of ViT-B/16. This balance between high accuracy and high inference efficiency validates effectiveness and practicality of the proposed method in resource-constrained scenarios. Furthermore, it provides a new interpretable perspective for understanding latent structures of images.

Key words: Multiple Instance Learning &, #40

摘要: 针对现有场景识别方法难以兼顾视觉元素间的结构化关联与高效推理计算的问题,提出一种基于结构化共生表征学习的场景识别方法S-CRL (Structured Co-occurrence Representation Learning)。首先,利用轻量级分割模型与特征聚类技术将图像空间离散化为语义视觉词,并构建统计共生先验以引导序列生成;其次,引入双向Transformer架构执行掩码视觉建模,通过聚合上下文信息预测被遮盖的视觉词,从而内化视觉元素间的组合规则与共生依赖;最后,设计基于注意力机制的多示例聚合策略,将生成的上下文特征自适应转化为全局场景表征。实验结果表明,所提方法在SUN397数据集上的识别准确率达到了80.61%,与NEM (Nested Ensemble Model)、SpaCoNet (Spatial relation and Co-occurrence Network) 和DGN-Net (Discriminative Graph Network) 模型相比,分别提高了4.55、4.46和1.31个百分点。此外,该方法的推理计算量仅为4.2 GFLOPs,约为ViT-B/16的23.9%。这种高准确率与高推理效率的平衡,验证了所提方法在资源受限场景下的有效性与实用性,为理解图像潜在结构提供了可解释的新视角。

CLC Number: