Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (10): 3209-3216.DOI: 10.11772/j.issn.1001-9081.2023101466

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Fish image classification based on positional overlapping patch embedding and multi-scale channel interactive attention

Wen ZHOU1, Yuzhang CHEN1(), Zhiyuan WEN2, Shiqi WANG1   

  1. 1.School of Artificial Intelligence,Hubei University,Wuhan Hubei 430062,China
    2.School of Computer Science and Information Engineering,Hubei University,Wuhan Hubei 430062,China
  • Received:2023-10-30 Revised:2024-01-28 Accepted:2024-01-29 Online:2024-10-15 Published:2024-10-10
  • Contact: Yuzhang CHEN
  • About author:ZHOU Wen, born in 1999, M. S. candidate. Her research interests include deep learning, image classification.
    WEN Zhiyuan, born in 1997, M. S. candidate. His research interests include multimodal deep learning.
    WANG Shiqi, born in 1999, M. S. candidate. Her research interests include deep learning, neural network.
  • Supported by:
    Industry-University Cooperative Education Program of the Ministry of Education(202101142041)

基于位置编码重叠切块嵌入和多尺度通道交互注意力的鱼类图像分类

周雯1, 谌雨章1(), 温志远2, 王诗琦1   

  1. 1.湖北大学 人工智能学院,武汉 430062
    2.湖北大学 计算机与信息工程学院,武汉 430062
  • 通讯作者: 谌雨章
  • 作者简介:周雯(1999—),女,湖北武汉人,硕士研究生,CCF会员,主要研究方向:深度学习、图像分类
    谌雨章(1984—),男,湖北武汉人,副教授,博士,主要研究方向:光电探测、图像处理 hubucyz@foxmail.com
    温志远(1997—),男,湖北孝感人,硕士研究生,主要研究方向:多模态深度学习
    王诗琦(1999—),女,湖北武汉人,硕士研究生,主要研究方向:深度学习、神经网络。

Abstract:

Underwater fish image classification is a highly challenging task. The traditional Vision Transformer (ViT) network backbone is limited to process local continuous features, and it does not perform well in fish classification with lower image quality. To solve this problem, a Transformer-based image classification network based on Overlapping Patch Embedding (OPE) and Multi-scale Channel Interactive Attention (MCIA), called PIFormer (Positional overlapping and Interactive attention transFormer), was proposed. PIFormer was built in a multi-layer format with each layer stacked at different times to facilitate the extraction of features at different depths. Firstly, the deep Positional Overlapping Patch Embedding (POPE) module was introduced to overlap and slice the feature map and edge information, so as to retain the local continuous features of the fish body. At the same time, position information was added for sorting, thereby helping PIFormer integrate the detailed features and build the global map. Then, the MCIA module was proposed to process the local and global features in parallel, and establish the long-distance dependencies of different parts of the fish body. Finally, the high-level features were processed by Group Multi-Layer Perceptron (GMLP) to improve the efficiency of the network and realize the final fish classification. To verify the effectiveness of PIFormer, a self-built dataset of freshwater fishes in East Lake was proposed, and the public datasets Fish4Knowledge and NCFM (Nature Conservancy Fisheries Monitoring) were used to ensure experimental fairness. Experimental results demonstrate that the Top-1 classification accuracy of the proposed network on each dataset reaches 97.99%, 99.71% and 90.45% respectively. Compared with ViT, Swin Transformer and PVT (Pyramid Vision Transformer) of the same depth, the proposed network has the number of parameters reduced by 72.62×106, 14.34×106 and 11.30×106 respectively, and the FLoating point Operation Per second (FLOPs) saved by 14.52×109, 2.02×109 and 1.48×109 respectively. It can be seen that PIFormer has strong fish image classification capability with reduced computational burden, achieving superior performance.

Key words: fish image classification, position encoding, Overlapping Patch Embedding (OPE), channel interaction attention, Vision Transformer (ViT)

摘要:

水下鱼类图像分类是一项极具挑战性的任务。传统Vision Transformer (ViT)网络骨干的局限性较大,难以处理局部连续特征,在图像质量较低的鱼类分类中效果表现不佳。为解决此问题,提出一种基于位置编码的重叠切块嵌入(OPE)和多尺度通道交互注意力(MCIA)的Transformer图像分类网络PIFormer (Positional overlapping and Interactive attention transFormer)。PIFormer采用多层级形式构建,每层以不同次数堆叠,利于提取不同深度的特征。首先,引入深度位置编码重叠切块嵌入(POPE)模块对特征图与边缘信息进行重叠切块,以保留鱼体的局部连续特征,并添加位置信息以排序,帮助PIFormer整合细节特征和构建全局映射;其次,提出MCIA模块并行处理局部与全局特征,并建立鱼体不同部位的长距离依赖关系;最后,由分组多层感知机(GMLP)分组处理高层次特征,以提升网络效率,并实现最终的鱼类分类。为验证PIFormer的有效性,提出自建东湖淡水鱼类数据集,并使用公共数据集Fish4Knowledge与NCFM(Nature Conservancy Fisheries Monitoring)以确保实验公平性。实验结果表明,所提网络在各数据集上的Top-1分类准确率分别达到了97.99%、99.71%和90.45%,与同级深度的ViT、Swin Transformer和PVT (Pyramid Vision Transformer)相比,参数量分别减少了72.62×106、14.34×106和11.30×106,浮点运算量(FLOPs)分别节省了14.52×109、2.02×109和1.48×109。可见,PIFormer在较少的计算负荷下,具有较强的鱼类图像分类能力,取得了优越的性能。

关键词: 鱼类图像分类, 位置编码, 重叠切块嵌入, 通道交互注意力, Vision Transformer

CLC Number: