Fine-grained image recognition based on mid-level subtle feature extraction and multi-scale feature fusion

doi:10.11772/j.issn.1001-9081.2022071090

Abstract

Abstract:

In the field of fine-grained visual recognition， due to subtle differences between highly similar categories， precise extraction of subtle image features has a crucial impact on recognition accuracy. It has become a trend for the existing related hot research algorithms to use attention mechanism to extract categorical features， however， these algorithms ignore the subtle but distinguishable features， and isolate the feature relationships between different discriminative regions of objects. Aiming at these problems， a fine-grained image recognition algorithm based on mid-level subtle feature extraction and multi-scale feature fusion was proposed. First， the salient features of image were extracted by using the weight variance measures of channel and position information fused mid-level features. Then， the mask matrix was obtained through the channel average pooling to suppress salient features and enhance the extraction of subtle features in other discriminative regions. Finally， channel weight information and pixel complementary information were used to obtain multi-scale fusion features of channels and pixels to enhance the diversity and richness of different discriminative regional features. Experimental results show that the proposed algorithm achieves 89.52% Top-1 accuracy and 98.46% Top-5 accuracy on dataset CUB-200-211， and 94.64% Top-1 accuracy and 98.62% Top-5 accuracy on dataset Stanford Cars， and 93.20% Top-1 accuracy and 97.98% Top-5 accuracy on dataset Fine-Grained Visual Classification of Aircraft （FGVC-Aircraft）. Compared with recurrent collaborative attention feature learning network PCA-Net （Progressive Co-Attention Network） algorithm， the proposed algorithm has the Top-1 accuracy increased by 1.22， 0.34 and 0.80 percentage points respectively， and the Top-5 accuracy increased by 1.03， 0.88 and 1.12 percentage points respectively.

Key words: fine-grained image recognition, attention mechanism, weight variance, mask matrix, multi-scale fusion, mid-level feature

摘要：

在细粒度视觉识别领域，由于高度近似的类别之间差异细微，图像细微特征的精确提取对识别的准确率有着至关重要的影响。现有的相关热点研究算法中使用注意力机制提取类别特征已经成为一种趋势，然而这些算法忽略了不明显但可区分的细微部分特征，并且孤立了对象不同判别性区域之间的特征关系。针对这些问题，提出了基于中层细微特征提取与多尺度特征融合的图像细粒度识别算法。首先，利用通道与位置信息融合中层特征的权重方差度量提取图像显著特征，之后通过通道平均池化获得掩码矩阵抑制显著特征，并增强其他判别性区域细微特征的提取；然后，通过通道权重信息与像素互补信息获得通道与像素多尺度融合特征，以增强不同判别性区域特征的多样性与丰富性。实验结果表明，所提算法在数据集CUB-200-2011上达到89.52%的Top-1准确率、98.46%的Top-5准确率；在Stanford Cars数据集上达到94.64%的Top-1准确率、98.62%的Top-5准确率；在飞行器细粒度分类（FGVC-Aircraft）数据集上达到93.20%的Top-1准确率、97.98%的Top-5准确率。与循环协同注意力特征学习网络PCA-Net （Progressive Co-Attention Network）算法相比，所提算法的Top-1准确率分别提升了1.22、0.34和0.80个百分点，Top-5准确率分别提升了1.03、0.88和1.12个百分点。

关键词: 细粒度图像识别, 注意力机制, 权重方差, 掩码矩阵, 多尺度融合, 中层特征

CLC Number:

TP391.4

Ailing QI, Xuanlin WANG. Fine-grained image recognition based on mid-level subtle feature extraction and multi-scale feature fusion[J]. Journal of Computer Applications, 2023, 43(8): 2556-2563.

齐爱玲, 王宣淋. 基于中层细微特征提取与多尺度特征融合细粒度图像识别[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2556-2563.

Figures/Tables 12

Fig. 1 Overall network structure

Fig. 2 Structure of CPFDEN

Fig. 3 Structure of CSMFN

Fig. 4 Residual block structure in ResNet

Tab. 1 Statistics of three fine-grained datasets

数据集	名字	类别数	样本数
数据集	名字	类别数	训练集	测试集
CUB-200-2011	Bird	200	5 994	5 794
Stanford Cars	Car	196	8 144	8 041
FGVC-Aircraft	Aircraft	100	6 667	3 333

Fig. 5 Examples from datasets

Tab. 2 Top-1 Accuracy of different ? values on datasets

$ϕ$	CUB-200-2011	Stanford Cars	FGVC-Aircraft
0.5	87.64	91.10	90.94
0.6	88.10	92.54	91.46
0.7	88.94	93.36	93.20
0.8	89.52	94.64	92.79
0.9	89.18	93.87	92.56

Tab. 2 Top-1 Accuracy of different ? values on datasets

$ϕ$	CUB-200-2011	Stanford Cars	FGVC-Aircraft
0.5	87.64	91.10	90.94
0.6	88.10	92.54	91.46
0.7	88.94	93.36	93.20
0.8	89.52	94.64	92.79
0.9	89.18	93.87	92.56

Fig. 6 Training process of the proposed algorithm on each dataset

Tab. 3 Results of ablation experiments on three datasets

算法	CUB-200-2011		Stanford Cars		FGVC Aircraft
算法	Top-1	Top-5	Top-1	Top-5	Top-1	Top-5
ResNet	85.50	92.54	89.80	94.63	90.30	94.41
ResNet-CPFDEN	88.94	96.44	93.40	97.82	92.60	96.83
Resnet-CPFDEN-CSMFN	89.52	98.46	94.64	98.62	93.20	97.98

Fig. 7 Effect comparison of heatmaps by two algorithms

Tab. 4 Comparison of Top-1 classification accuracy of different algorithms on three datasets

算法	CUB-200-2011	Stanford Cars	FGVC-Aircraft
DCL-Net	87.40	93.10	91.70
TPA-CNN	88.00	94.00	91.70
ACB-Net	88.10	94.60	92.40
本文算法	89.52	94.64	93.20

Tab. 5 Comparison of accuracy of the proposed algorithm with PPL-Net and PCA-Net algorithms on three datasets

算法	CUB-200-2011		Stanford Cars		FGVC-Aircraft
算法	Top-1	Top-5	Top-1	Top-5	Top-1	Top-5
PPL-Net	88.30	—	94.00	—	92.60	—
PCA-Net	88.30	97.43	94.30	97.74	92.40	96.86
本文算法	89.52	98.46	94.64	98.62	93.20	97.98

References 22

1	马瑶，智敏，殷雁君，等. CNN和Transformer在细粒度图像识别中的应用综述［J］. 计算机工程与应用， 2022， 58（19）：53-63. 10.3778/j.issn.1002-8331.2201-0374
	MA Y， ZHI M， YIN Y J， et al. Review of applications of CNN and Transformer in fine-grained image recognition［J］. Computer Engineering and Applications， 2022， 58（19）：53-63. 10.3778/j.issn.1002-8331.2201-0374
2	WEI X S， XIE C W， WU J X， et al. Mask-CNN： localizing parts and selecting descriptors for fine-grained bird species categorization［J］. Pattern Recognition， 2018， 76：704-714. 10.1016/j.patcog.2017.10.002
3	ZHANG N， DONAHUE J， GIRSHICK R， et al. Part-based R-CNNs for fine-grained category detection［C］// Proceedings of the 2014 European Conference on Computer Vision， LNCS 8689. Cham： Springer， 2014：834-849.
4	ZHANG X F， LIN W S， HUANG Q M. Fine-grained image quality assessment： a revisit and further thinking［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2022， 32（5）：2746-2759. 10.1109/tcsvt.2021.3096528
5	CHEN Y， BAI Y L， ZHANG W， et al. Destruction and construction learning for fine-grained image recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019：5152-5161. 10.1109/cvpr.2019.00530
6	YAN T T， WANG S J， WANG Z H， et al. Progressive learning for weakly supervised fine-grained classification［J］. Signal Processing， 2020， 171： No.107519. 10.1016/j.sigpro.2020.107519
7	ZHANG T， CHANG D L， MA Z Y， et al. Progressive co-attention network for fine-grained visual classification［C］// Proceedings of the 2021 International Conference on Visual Communications and Image Processing. Piscataway： IEEE， 2021：1-5. 10.1109/vcip53242.2021.9675376
8	ZHAO Y F， YAN K， HUANG F Y， et al. Graph-based high order relation discovery for fine-grained recognition［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 15074-15083. 10.1109/cvpr46437.2021.01483
9	WEI H， ZHU M， WANG B， et al. Two-level progressive attention convolutional network for fine-grained image recognition［J］. IEEE Access， 2020， 8：104985-104995. 10.1109/access.2020.2999722
10	东南大学. 一种基于多尺度特征融合的图像细粒度识别方法： 201910282865.4［P］. 2019-08-06.
	Southeast University. A fine-grained image recognition method based on multi-scale feature fusion： 201910282865.4［P］. 2019-08-06.
11	JI R Y， WEN L Y， ZHANG L B， et al. Attention convolutional binary neural tree for fine-grained visual categorization［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020：10465-10474. 10.1109/cvpr42600.2020.01048
12	YAN T T， SHI J， LI H J. Discriminative information restoration and extraction for weakly supervised low-resolution fine-grained image recognition［J］. Pattern Recognition， 2022， 127： No.108629. 10.1016/j.patcog.2022.108629
13	CAO S Y， WANG W， ZHANG J， et al. A few-shot fine-grained image classification method leveraging global and local structures［J］. International Journal of Machine Learning and Cybernetics， 2022， 13（8）：2273-2281. 10.1007/s13042-022-01522-w
14	WANG L， HE K， FENG X， et al. Multilayer feature fusion with parallel convolutional block for fine-grained image classification［J］. Applied Intelligence， 2022， 52（3）：2872-2883. 10.1007/s10489-021-02573-2
15	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016：770-778. 10.1109/cvpr.2016.90
16	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018：7132-7141. 10.1109/cvpr.2018.00745
17	WOO S， PARK J， LEE J Y， et al. CBAM： convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 3-19.
18	WAH C， BRANSON S， WELINDER P， et al. The Caltech-UCSD Birds-200-2011 dataset［EB/OL］. ［2020-07-05］..
19	KRAUSE J， STARK M， DENG J， et al. 3D object representations for fine-grained categorization［C］// Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops. Piscataway： IEEE， 2013： 554-561. 10.1109/iccvw.2013.77
20	MAJI S， RAHTU E， KANNALA J， et al. Fine-grained visual classification of aircraft［EB/OL］. （2013-06-21）［2020-07-05］..
21	LI P H， XIE J T， WANG Q L， et al. Towards faster training of global covariance pooling networks by iterative matrix square root normalization［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 947-955. 10.1109/cvpr.2018.00105
22	LERMA M， LUCAS M. Grad-CAM++ is equivalent to Grad-CAM with positive gradients［C/OL］// Proceedings of the 24th Irish Machine Vision and Image Processing Conference ［2022-05-22］.. 10.56541/awjv6348

[1]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[2]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[3]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[4]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[5]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[6]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[7]	Chenqian LI, Jun LIU. Ultrasound carotid plaque segmentation method based on semi-supervision and multi-scale cascaded attention [J]. Journal of Computer Applications, 2024, 44(8): 2604-2610.
[8]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[9]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[10]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[11]	Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025.
[12]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[13]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[14]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[15]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.