Fish image classification based on positional overlapping patch embedding and multi-scale channel interactive attention

doi:10.11772/j.issn.1001-9081.2023101466

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (10): 3209-3216.DOI: 10.11772/j.issn.1001-9081.2023101466

• Multimedia computing and computer simulation • Previous Articles Next Articles

Fish image classification based on positional overlapping patch embedding and multi-scale channel interactive attention

Wen ZHOU¹, Yuzhang CHEN¹(), Zhiyuan WEN², Shiqi WANG¹

^1.School of Artificial Intelligence，Hubei University，Wuhan Hubei 430062，China
^2.School of Computer Science and Information Engineering，Hubei University，Wuhan Hubei 430062，China

Received:2023-10-30 Revised:2024-01-28 Accepted:2024-01-29 Online:2024-10-15 Published:2024-10-10
Contact: Yuzhang CHEN
About author:ZHOU Wen， born in 1999， M. S. candidate. Her research interests include deep learning， image classification.
WEN Zhiyuan， born in 1997， M. S. candidate. His research interests include multimodal deep learning.
WANG Shiqi， born in 1999， M. S. candidate. Her research interests include deep learning， neural network.
Supported by:
Industry-University Cooperative Education Program of the Ministry of Education(202101142041)

基于位置编码重叠切块嵌入和多尺度通道交互注意力的鱼类图像分类

周雯¹, 谌雨章¹(), 温志远², 王诗琦¹

^1.湖北大学人工智能学院，武汉 430062
^2.湖北大学计算机与信息工程学院，武汉 430062

通讯作者: 谌雨章
作者简介:周雯（1999—），女，湖北武汉人，硕士研究生，CCF会员，主要研究方向：深度学习、图像分类
谌雨章（1984—），男，湖北武汉人，副教授，博士，主要研究方向：光电探测、图像处理 hubucyz@foxmail.com
温志远（1997—），男，湖北孝感人，硕士研究生，主要研究方向：多模态深度学习
王诗琦（1999—），女，湖北武汉人，硕士研究生，主要研究方向：深度学习、神经网络。

Abstract

Abstract:

Underwater fish image classification is a highly challenging task. The traditional Vision Transformer （ViT） network backbone is limited to process local continuous features， and it does not perform well in fish classification with lower image quality. To solve this problem， a Transformer-based image classification network based on Overlapping Patch Embedding （OPE） and Multi-scale Channel Interactive Attention （MCIA）， called PIFormer （Positional overlapping and Interactive attention transFormer）， was proposed. PIFormer was built in a multi-layer format with each layer stacked at different times to facilitate the extraction of features at different depths. Firstly， the deep Positional Overlapping Patch Embedding （POPE） module was introduced to overlap and slice the feature map and edge information， so as to retain the local continuous features of the fish body. At the same time， position information was added for sorting， thereby helping PIFormer integrate the detailed features and build the global map. Then， the MCIA module was proposed to process the local and global features in parallel， and establish the long-distance dependencies of different parts of the fish body. Finally， the high-level features were processed by Group Multi-Layer Perceptron （GMLP） to improve the efficiency of the network and realize the final fish classification. To verify the effectiveness of PIFormer， a self-built dataset of freshwater fishes in East Lake was proposed， and the public datasets Fish4Knowledge and NCFM （Nature Conservancy Fisheries Monitoring） were used to ensure experimental fairness. Experimental results demonstrate that the Top-1 classification accuracy of the proposed network on each dataset reaches 97.99%， 99.71% and 90.45% respectively. Compared with ViT， Swin Transformer and PVT （Pyramid Vision Transformer） of the same depth， the proposed network has the number of parameters reduced by 72.62×10⁶， 14.34×10⁶ and 11.30×10⁶ respectively， and the FLoating point Operation Per second （FLOPs） saved by 14.52×10⁹， 2.02×10⁹ and 1.48×10⁹ respectively. It can be seen that PIFormer has strong fish image classification capability with reduced computational burden， achieving superior performance.

Key words: fish image classification, position encoding, Overlapping Patch Embedding (OPE), channel interaction attention, Vision Transformer (ViT)

摘要：

水下鱼类图像分类是一项极具挑战性的任务。传统Vision Transformer （ViT）网络骨干的局限性较大，难以处理局部连续特征，在图像质量较低的鱼类分类中效果表现不佳。为解决此问题，提出一种基于位置编码的重叠切块嵌入（OPE）和多尺度通道交互注意力（MCIA）的Transformer图像分类网络PIFormer （Positional overlapping and Interactive attention transFormer）。PIFormer采用多层级形式构建，每层以不同次数堆叠，利于提取不同深度的特征。首先，引入深度位置编码重叠切块嵌入（POPE）模块对特征图与边缘信息进行重叠切块，以保留鱼体的局部连续特征，并添加位置信息以排序，帮助PIFormer整合细节特征和构建全局映射；其次，提出MCIA模块并行处理局部与全局特征，并建立鱼体不同部位的长距离依赖关系；最后，由分组多层感知机（GMLP）分组处理高层次特征，以提升网络效率，并实现最终的鱼类分类。为验证PIFormer的有效性，提出自建东湖淡水鱼类数据集，并使用公共数据集Fish4Knowledge与NCFM（Nature Conservancy Fisheries Monitoring）以确保实验公平性。实验结果表明，所提网络在各数据集上的Top-1分类准确率分别达到了97.99%、99.71%和90.45%，与同级深度的ViT、Swin Transformer和PVT （Pyramid Vision Transformer）相比，参数量分别减少了72.62×10⁶、14.34×10⁶和11.30×10⁶，浮点运算量（FLOPs）分别节省了14.52×10⁹、2.02×10⁹和1.48×10⁹。可见，PIFormer在较少的计算负荷下，具有较强的鱼类图像分类能力，取得了优越的性能。

关键词: 鱼类图像分类, 位置编码, 重叠切块嵌入, 通道交互注意力, Vision Transformer

CLC Number:

TP391.4

Wen ZHOU, Yuzhang CHEN, Zhiyuan WEN, Shiqi WANG. Fish image classification based on positional overlapping patch embedding and multi-scale channel interactive attention[J]. Journal of Computer Applications, 2024, 44(10): 3209-3216.

周雯, 谌雨章, 温志远, 王诗琦. 基于位置编码重叠切块嵌入和多尺度通道交互注意力的鱼类图像分类[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3209-3216.

Figures/Tables 10

References 27

1	KRIZHEVSKY A， SUTSKEVER I， HINTON G E. ImageNet classification with deep convolutional neural networks［J］. Communications of the ACM， 2017， 60（6）： 84-90.
2	HE K， ZHANG X， REN S， et al. Identity mappings in deep residual networks［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9908. Cham： Springer， 2016： 630-645.
3	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778.
4	TAN M， LE Q V. EfficientNet： rethinking model scaling for convolutional neural networks［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 6105-6114.
5	TAN M， LE Q V. EfficientNetV2： smaller models and faster training［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 10096-10106.
6	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： Transformers for image recognition at scale［EB/OL］. （2021-06-03）［2024-01-05］. .
7	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017：6000-6010.
8	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002.
9	WANG W， XIE E， LI X， et al. Pyramid Vision Transformer： a versatile backbone for dense prediction without convolutions［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 548-558.
10	WANG W， XIE E， LI X， et al. PVT v2： improved baselines with pyramid vision Transformer［J］. Computational Visual Media， 2022， 8（3）： 415-424.
11	GUO M H， LU C Z， LIU Z N， et al. Visual attention network［J］. Computational Visual Media， 2023， 9（3）： 733-752.
12	DING M， XIAO B， CODELLA N， et al. DaViT： dual attention vision Transformers［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13684. Cham： Springer， 2022： 74-92.
13	LI K， WANG Y， ZHANG J， et al. UniFormer： unifying convolution and self-attention for visual recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023， 45（10）： 12581-12600.
14	YANG J， LI C， DAI X， et al. Focal modulation networks［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 4203-4217.
15	YU W， LUO M， ZHOU P， et al. MetaFormer is actually what you need for vision［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 10809-10819.
16	LI X， LI F， YU J， et al. A high-precision underwater object detection based on joint self-supervised deblurring and improved spatial Transformer network［EB/OL］. （2022-03-09）［2024-01-05］..
17	XU X， QIN Y， XI D， et al. MulTNet： a multi-scale Transformer network for marine image segmentation toward fishing［J］. Sensors， 2022， 22（19）： No.7224.
18	GONG B， DAI K， SHAO J， et al. Fish-TViT： a novel fish species classification method in multi water areas based on transfer learning and vision Transformer［J］. Heliyon， 2023， 9（6）： No.e16761.
19	崔颖，韩佳成，高山，等. 基于改进Deformable-DETR的水下图像目标检测方法［J］. 应用科技， 2024， 51（1）：30-36， 91.
	CUI Y， HAN J C， GAO S， et al. An object detection method of underwater image based on improved Deformable-DETR［J］. Applied Science and Technology， 2024， 51（1）：30-36， 91.
20	HOWARD A G， ZHU M， CHEN B， et al. MobileNets： efficient convolutional neural networks for mobile vision applications［EB/OL］. （2017-04-17）［2024-01-05］..
21	DAI Y， GIESEKE F， OEHMCKE S， et al. Attentional feature fusion［C］// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2021： 3559-3568.
22	CHU X， TIAN Z， ZHANG B， et al. Conditional positional encodings for vision Transformers［EB/OL］. （2023-02-13）［2024-01-05］..
23	WU H， XIAO B， CODELLA N， et al. CvT： introducing convolutions to vision Transformers［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021：22-31.
24	CHU X， TIAN Z， WANG Y， et al. Twins： revisiting the design of spatial attention in vision Transformers［C］// Proceedings of the 35th Conference on Neural Information Processing Systems. New York： ACM， 2024： 9355-9366.
25	LOSHCHILOV I， HUTTER F. Fixing weight decay regularization in Adam［EB/OL］. ［2024-01-05］..
26	FISHER R B， CHEN-BUEGER Y H， GIORDANO D， et al. Fish4Knowledge： Collecting and Analyzing Massive Coral Reef Fish Video Data［M］. Cham： Springer， 2016： 1-319.
27	FalkSCHUETZENMEISTER， MATT M， RISDAL MEG， et al. The nature conservancy fisheries monitoring［DS/OL］. ［2024-01-05］..

序号	模块	Top-1分类准确率/%	参数量/10⁶	浮点运算量/10⁹
1	+OPE	84.98	13.05	2.23
2	+POPE	92.31	13.05	2.23
3	++MCIA	93.65	13.06	2.32
4	+++GMLP	97.99	13.19	2.34

序号	模块	Top-1分类准确率/%	参数量/10⁶	浮点运算量/10⁹
1	+OPE	84.98	13.05	2.23
2	+POPE	92.31	13.05	2.23
3	++MCIA	93.65	13.06	2.32
4	+++GMLP	97.99	13.19	2.34

模型	深度	Top-1准确率/%			参数量/10⁶	浮点运算量/10⁹
模型	深度	东湖淡水鱼	Fish4Knowledge	NCFM	参数量/10⁶	浮点运算量/10⁹
ResNet34^［2］	［3，4，6，3］	94.07	97.43	86.73	21.29	3.68
EfficientNetV2-Small^［5］	［2，4，4，6，9，15］	97.62	99.03	88.59	20.19	2.85
ViT-Base^［6］	12	74.82	94.76	82.27	85.81	16.86
Swin-Transformer-Tiny^［8］	［2，2，6，2］	94.52	97.89	86.21	27.53	4.36
VAN-B1^［11］	［2，2，4，2］	96.70	99.67	89.17	13.36	2.52
PVT-Small^［9］	［3，4，6，3］	89.56	98.60	82.60	24.49	3.82
PVT V2-B1^［10］	［2，2，2，2］	83.88	99.08	84.78	13.51	2.12
PIFormer	［2，2，4，2］	97.99	99.71	90.45	13.19	2.34

模型	深度	Top-1准确率/%			参数量/10⁶	浮点运算量/10⁹
模型	深度	东湖淡水鱼	Fish4Knowledge	NCFM	参数量/10⁶	浮点运算量/10⁹
ResNet34^［2］	［3，4，6，3］	94.07	97.43	86.73	21.29	3.68
EfficientNetV2-Small^［5］	［2，4，4，6，9，15］	97.62	99.03	88.59	20.19	2.85
ViT-Base^［6］	12	74.82	94.76	82.27	85.81	16.86
Swin-Transformer-Tiny^［8］	［2，2，6，2］	94.52	97.89	86.21	27.53	4.36
VAN-B1^［11］	［2，2，4，2］	96.70	99.67	89.17	13.36	2.52
PVT-Small^［9］	［3，4，6，3］	89.56	98.60	82.60	24.49	3.82
PVT V2-B1^［10］	［2，2，2，2］	83.88	99.08	84.78	13.51	2.12
PIFormer	［2，2，4，2］	97.99	99.71	90.45	13.19	2.34

[1]	Jieru JIA, Jianchao YANG, Shuorui ZHANG, Tao YAN, Bin CHEN. Unsupervised person re-identification based on self-distilled vision Transformer [J]. Journal of Computer Applications, 2024, 44(9): 2893-2902.
[2]	Rong HUANG, Junjie SONG, Shubo ZHOU, Hao LIU. Image aesthetic quality evaluation method based on self-supervised vision Transformer [J]. Journal of Computer Applications, 2024, 44(4): 1269-1276.
[3]	Ruiyan LIANG, Hui YANG. Lightweight fall detection algorithm framework based on RPEpose and XJ-GCN [J]. Journal of Computer Applications, 2024, 44(11): 3639-3646.

Fish image classification based on positional overlapping patch embedding and multi-scale channel interactive attention

基于位置编码重叠切块嵌入和多尺度通道交互注意力的鱼类图像分类

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 27

Related Articles 3

Recommended Articles

Metrics