Scene recognition method based on  structured co-occurrence representation learning

doi:10.11772/j.issn.1001-9081.2026010051

Abstract

Abstract: To address the difficulty of existing scene recognition methods in balancing structured associations among visual elements and efficient inference computation, a scene recognition method based on Structured Co-occurrence Representation Learning (S-CRL) was proposed. First, image space was discretized into semantic visual words using a lightweight segmentation model and feature clustering techniques. Statistical co-occurrence priors were constructed to guide sequence generation. Second, a bidirectional Transformer architecture was introduced to perform masked visual modeling. By aggregating context information to predict masked visual words, compositional rules and co-occurrence dependencies among visual elements were internalized. Finally, a multiple instance aggregation strategy based on an attention mechanism was designed. This strategy adaptively transformed generated context features into global scene representations. Experimental results show that the recognition accuracy of the proposed method on the SUN397 dataset reaches 80.61%. Compared with NEM (Nested Ensemble Model), SpaCoNet (Spatial relation and Co-occurrence Network), and DGN-Net (Discriminative Graph Network), accuracy increases by 4.55, 4.46, and 1.31 percentage points, respectively. In addition, inference computation of the method is only 4.2 GFLOPs. Accounting for approximately 23.9% of that of ViT-B/16. This balance between high accuracy and high inference efficiency validates effectiveness and practicality of the proposed method in resource-constrained scenarios. Furthermore, it provides a new interpretable perspective for understanding latent structures of images.

Key words: Multiple Instance Learning &, #40

摘要： 针对现有场景识别方法难以兼顾视觉元素间的结构化关联与高效推理计算的问题，提出一种基于结构化共生表征学习的场景识别方法S-CRL (Structured Co-occurrence Representation Learning)。首先，利用轻量级分割模型与特征聚类技术将图像空间离散化为语义视觉词，并构建统计共生先验以引导序列生成；其次，引入双向Transformer架构执行掩码视觉建模，通过聚合上下文信息预测被遮盖的视觉词，从而内化视觉元素间的组合规则与共生依赖；最后，设计基于注意力机制的多示例聚合策略，将生成的上下文特征自适应转化为全局场景表征。实验结果表明，所提方法在SUN397数据集上的识别准确率达到了80.61%，与NEM (Nested Ensemble Model)、SpaCoNet (Spatial relation and Co-occurrence Network) 和DGN-Net (Discriminative Graph Network) 模型相比，分别提高了4.55、4.46和1.31个百分点。此外，该方法的推理计算量仅为4.2 GFLOPs，约为ViT-B/16的23.9%。这种高准确率与高推理效率的平衡，验证了所提方法在资源受限场景下的有效性与实用性，为理解图像潜在结构提供了可解释的新视角。

CLC Number:

TP391.41

付庆龙赵其鲁樊晓曼. 基于结构化共生表征学习的场景识别方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2026010051.

[1]	WANG Xin, AN Junxiu, MAO Ke. Image captioning with block-prototype contrastive alignment based on dynamic semantic mapping [J]. Journal of Computer Applications, 0, (): 0-0.
[2]	. Attention-guided symmetric positive definite second-order representation for facial expression recognition [J]. Journal of Computer Applications, 0, (): 0-0.
[3]	. Noninvasive fetal electrocardiogram signal extraction method based on Mamba-UNETR [J]. Journal of Computer Applications, 0, (): 0-0.
[4]	. Multimodal bio-coupling correlation driven audio-visual deepfake detection [J]. Journal of Computer Applications, 0, (): 0-0.
[5]	. UAV remote sensing image small object detection algorithm based on improved RT-DETR [J]. Journal of Computer Applications, 0, (): 0-0.
[6]	. Collaborative perception method based on closed-loop trajectory sharing [J]. Journal of Computer Applications, 0, (): 0-0.
[7]	Hongrui ZHANG, Weiming FENG, Luxia YANG, Yongjie MA. CSAF-YOLO： improved YOLO11 algorithm for underwater small object detection [J]. Journal of Computer Applications, 2026, 46(5): 1578-1585.
[8]	Kaiyan CUI, Shuna WEI. Wavelet-domain sparse Bayesian learning for uncertainty-aware MRI reconstruction [J]. Journal of Computer Applications, 2026, 46(5): 1634-1646.
[9]	Wenchao MING, Suzhen LIN, Zanxia JIN. Multi-band image captioning method based on scene concept-guided feature fusion [J]. Journal of Computer Applications, 2026, 46(5): 1560-1567.
[10]	Chi ZHANG, Xianjing MENG, Changhao DOU, Qian WANG, Leilei GENG, Xiaoming XI. MD-FVR： cascaded finger vein recognition network based on multi-domain feature fusion [J]. Journal of Computer Applications, 2026, 46(5): 1658-1666.
[11]	Wen PENG, Bokai ZHANG, Jinwei LIN. Chromosome cascaded classification framework integrating image texture enhancement and super-resolution [J]. Journal of Computer Applications, 2026, 46(5): 1647-1657.
[12]	Miaomiao YUAN, Yihong CHU, Guanjun YIN, Chunhua DENG. High-precision recognition method for imperfect grain images based on TransNeXt [J]. Journal of Computer Applications, 2026, 46(5): 1684-1691.
[13]	Yuanhao HE, Jun ZHAO. Defect detection algorithm for train bearing rollers based on FHC-DETR [J]. Journal of Computer Applications, 2026, 46(5): 1624-1633.
[14]	Gengxin FAN, Huiyan HAN, Liqun KUANG, Ziyang JIN, Huafeng ZHAO. VU-RED-F： improved CAD model replacement for U-RED single-view point clouds [J]. Journal of Computer Applications, 2026, 46(5): 1534-1544.
[15]	. Tile surface micro-defect detection method based on improved detr with enhanced matching for fast convergence [J]. Journal of Computer Applications, 0, (): 0-0.

Scene recognition method based on structured co-occurrence representation learning

基于结构化共生表征学习的场景识别方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics