U-shaped Network (U-Net) based on Fully Convolutional Network (FCN) is widely used as the backbone of medical image segmentation models, but Convolutional Neural Network (CNN) is not good at capturing long-range dependency, which limits the further performance improvement of segmentation models. To solve the above problem, researchers have applied Transformer to medical image segmentation models to make up for the deficiency of CNN, and U-shaped segmentation networks combining Transformer have become the hot research topics. After a detailed introduction of U-Net and Transformer, the related medical image segmentation models were categorized by the position in which the Transformer module was located, including only in the encoder or decoder, both in the encoder and decoder, as a skip-connection, and others, the basic contents, design concepts and possible improvement aspects about these models were discussed, the advantages and disadvantages of having Transformer in different positions were also analyzed. According to the analysis results, it can be seen that the biggest factor to decide the position of Transformer is the characteristics of the target segmentation task, and the segmentation models of Transformer combined with U-Net can make better use of the advantages of CNN and Transformer to improve segmentation performance of models, which has great development prospect and research value.
Aiming at the problems of low detection precision, poor robustness, and imperfect related systems in the current small object detection of electric vehicle helmet, an electric vehicle helmet detection model was proposed based on improved YOLOv5s algorithm. In the proposed model, Convolutional Block Attention Module (CBAM) and Coordinate Attention (CA) module were introduced, and the improved Non-Maximum Suppression (NMS) - Distance Intersection over Union-Non Maximum Suppression (DIoU-NMS) was used. At the same time, multi-scale feature fusion detection was added and densely connected network was combined to improve feature extraction effect. Finally, a helmet detection system for electric vehicle drivers was established. The improved YOLOv5s algorithm had the mean Average Precision (mAP) increased by 7.1 percentage points when the Intersection over Union (IoU) is 0.5, and Recall increased by 1.6 percentage points compared with the original YOLOv5s on the self-built electric vehicle helmet wearing dataset. Experimental results show that the improved YOLOv5s algorithm can better meet the requirements for detection precision of electric vehicles and the helmets of their drivers in actual situations, and reduce the incidence rate of electric vehicle traffic accidents to a certain extent.
6 Degree of Freedom (DoF) pose estimation is a key technology in computer vision and robotics, and has become a crucial task in the fields such as robot operation, automatic driving, augmented reality by estimating 6 DoF pose of an object from a given input image, that is, 3 DoF translation and 3 DoF rotation. Firstly, the concept of 6 DoF pose and the problems of traditional methods based on feature point correspondence, template matching, and three-dimensional feature descriptors were introduced. Then, the current mainstream 6 DoF pose estimation algorithms based on deep learning were introduced in detail from different angles of feature correspondence-based, pixel voting-based, regression-based and multi-object instances-oriented, synthesis data-oriented, and category level-oriented. At the same time, the datasets and evaluation indicators commonly used in pose estimation were summarized and sorted out, and some algorithms were evaluated experimentally to show their performance. Finally, the challenges and the key research directions in the future of pose estimation were given.
Aiming at the problem of small object miss detection in object detection process, an improved YOLOv5 (You Only Look Once) object detection algorithm based on attention mechanism and multi-scale context information was proposed. Firstly, Multiscale Dilated Separable Convolutional Module (MDSCM) was added to the feature extraction structure to extract multi-scale feature information, increasing the receptive field while avoiding the loss of small object information. Secondly, the attention mechanism was added to the backbone network, and the location awareness information was embedded in the channel information, so as to further enhance the feature expression ability of the algorithm. Finally, Soft-NMS (Soft-Non-Maximum Suppression) was used instead of the NMS (Non-Maximum Suppression) used by YOLOv5 to reduce the missed detection rate of the algorithm. Experimental results show that the improved algorithm achieves detection precisions of 82.80%, 71.74% and 77.11% respectively on PASCAL VOC dataset, DOTA aerial image dataset and DIOR optical remote sensing dataset, which are 3.70, 1.49 and 2.48 percentage points higer than those of YOLOv5, and it has better detection effect on small objects. Therefore, the improved YOLOv5 can be better applied to small object detection scenarios in practice.
Aiming at the problem of low accuracy of ship target detection at sea, a lightweight ship target detection algorithm YOLOShip was proposed on the basis of the improved YOLOv5. Firstly, dilated convolution and channel attention were introduced into Spatial Pyramid Pooling-Fast (SPPF) module, which integrated spatial feature details of different scales, strengthened semantic information, and improved the model’s ability to distinguish foreground and background. Secondly, coordinate attention and lightweight mixed depthwise convolution were introduced into Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) structures to strengthen important features in the network, obtain features with more detailed information, and improve model detection ability and positioning precision. Thirdly, considering the uneven distribution and relatively small scale changes of targets in the dataset, the model performance was further improved while the model was simplified by modifying the anchors and decreasing the number of detection heads. Finally, a more flexible Polynomial Loss (PolyLoss) was introduced to optimize Binary Cross Entropy Loss (BCE Loss) to improve the model convergence speed and model precision. Experimental results show that on dataset SeaShips, in comparison with YOLOv5s, YOLOShip has the Precision, Recall, mAP@0.5 and mAP@0.5:0.95 increased by 4.2, 5.7, 4.6 and 8.5 percentage points. Thus, by using the proposed algorithm, better detection precision can be obtained while meeting the requirements of detection speed, effectively achieving high-speed and high-precision ship detection.
Aiming at the problem of low accuracy of the existing cross-view image matching algorithms, an Unmanned Aerial Vehicle (UAV) image localization method based on Multi-view and Multi-supervision Network (MMNet) was proposed. Firstly, in the proposed method, satellite perspective and UAV perspective were integrated, global and local features were learnt under a unified network architecture, then classification network was trained and metric tasks were performed in multi-supervision way. Specifically, the Reweighted Regularization Triplet loss (RRT) was mainly used by MMNet to learn global features. In this loss, the reweighting and distance regularization strategies were to solve the problems of imbalance of multi-view samples and structure disorder of the feature space. Simultaneously, in order to pay attention to the context information of the central building in target location, the local features were obtained by MMNet via square ring cutting. After that, the cross entropy loss and RRT were used to perform classification and metric tasks respectively. Finally, the global and local features were aggregated by using a weighted strategy to present target location images. MMNet achieved Recall@1 (R@1) of 83.97% and Average Precision (AP) of 86.96% in UAV localization tasks on the currently popular UAV dataset University-1652. Experimental results show that MMNet significantly improves the accuracy of cross-view image matching, and then enhances the practicability of UAV image localization compared with LCM (cross-view Matching based on Location Classification), SFPN (Salient Feature Partition Network) and other methods.
Aiming at the problems such as small object size, arbitrary object direction and complex background of remote sensing images, on the basis of YOLOv5 (You Only Look Once version 5) algorithm, an algorithm involved with geometric adaptation and global perception was proposed. Firstly, deformable convolutions and adaptive spatial attention modules were stacked alternately in series through dense connections. As a result, a Dense Context-Aware Module (DenseCAM) which can model local geometric features was constructed on the basis of taking full advantage of different levels of semantic and location information. Secondly, by introducing Transformer in the end of the backbone network, the global perception ability of the model was enhanced at a low cost and the relationships between objects and scenario content were modeled. On UCAS-AOD and RSOD datasets, compared with YOLOv5s6 algorithm, the proposed algorithm has the mean Average Precision (mAP) increased by 1.8 percentage points and 1.5 percentage points, respectively. Experimental results show that the proposed algorithm can effectively improve the precision of object detection in remote sensing images.
2D/3D medical image registration is a key technology in 3D real-time navigation of orthopedic surgery. However, the traditional 2D/3D registration methods based on optimization iteration require multiple iterative calculations, which cannot meet the requirements of doctors for real-time registration during surgery. To solve this problem, a pose regression network based on autoencoder was proposed. In this network, the geometric pose information was captured through hidden space decoding, thereby quickly regressing the 3D pose of preoperative spine pose corresponding to the intraoperative X-ray image, and the final registration image was generated through reprojection. By introducing new loss functions, the model was constrained by “Rough to Fine” combined registration method to ensure the accuracy of pose regression. In CTSpine1K spine dataset, 100 CT scan image sets were extracted for 10-fold cross-validation. Experimental results show that the registration result image generated by the proposed model has the Mean Absolute Error (MAE) with the X-ray image of 0.04, the mean Target Registration Error (mTRE) with the X-ray image of 1.16 mm, and the single frame consumption time of 1.7 s. Compared to the traditional optimization based method, the proposed model has registration time greatly shortened. Compared with the learning-based method, this model ensures a high registration accuracy with quick registration. Therefore, the proposed model can meet the requirements of intraoperative real-time high-precision registration.
Aiming at the problems of low accuracy and poor robustness of traditional point cloud registration algorithms and the inability of accurate radiotherapy for cancer patients before and after radiotherapy, an Attention Dynamic Graph Convolutional Neural Network Lucas-Kanade (ADGCNNLK) was proposed. Firstly, residual attention mechanism was added to Dynamic Graph Convolutional Neural Network (DGCNN) to effectively utilize spatial information of point cloud and reduce information loss. Then, the DGCNN added with residual attention mechanism was used to extract point cloud features, this process was not only able to capture the local geometric features of the point cloud while maintaining the invariance of the point cloud replacement, but also able to semantically aggregate the information, thereby improving the registration efficiency. Finally, the extracted feature points were mapped to a high-dimensional space, and the classic image iterative registration algorithm LK (Lucas-Kanade) was used for registration of the nodes. Experimental results show that compared with Iterative Closest Point (ICP), Globally optimal ICP (Go-ICP) and PointNetLK, the proposed algorithm has the best registration effect with or without noise. Among them, in the case without noise, compared with PointNetLK, the proposed algorithm has the rotation mean squared error reduced by 74.61%, and the translation mean squared error reduced by 47.50%; in the case with noise, compared with PointNetLK, the proposed algorithm has the rotation mean squared error reduced by 73.13%, and the translational mean squared error reduced by 44.18%, indicating that the proposed algorithm is more robust than PointNetLK. And the proposed algorithm is applied to the registration of human point cloud models of cancer patients before and after radiotherapy, assisting doctors in treatment, and realizing precise radiotherapy.
In order to generate more accurate and smooth virtual human animation, the Kinect device was used to capture the 3D human body pose data, and the monocular 3D human body pose estimation algorithm was used to reason the skeleton points in the color information of the Kinect at the same time, thereby optimizing the human pose estimation effect at real time, and driving the virtual character model to generate animation. Firstly, a spatio-temporal optimization method of skeleton point data processing was proposed to improve the stability of monocular estimation of the 3D human body pose. Secondly, a human pose estimation method based on the fusion of Kinect and Occlusion-Robust Pose-Maps (ORPM) algorithm was proposed to solve the occlusion problem of Kinect. Finally, a virtual human animation system based on quaternion vector interpolation and inverse kinematics constraints was developed, which was able to perform motion simulation and real-time animation generation. Compared with the animation generation method that only uses Kinect to capture human motion, the proposed method has the human body estimation data more robust, and has a certain anti-occlusion ability. The animation frame rate generated by this method is two times higher compared to that of the ORPM-based animation generation method, so that the effect of the animation generated by the proposed method is more realistic and smooth.
Aiming at the problem that the existing Handwritten Mathematical Expression Recognition (HMER) methods reduce image resolution and lose feature information after multiple pooling operations in Convolutional Neural Network (CNN), which leads to parsing errors, an encoder-decoder model for HMER based on attention mechanism was proposed. Firstly, Densely connected convolutional Network (DenseNet) was used as the encoder, so that the dense connections were used to enhance feature extraction, promote gradient propagation, and alleviate vanishing gradient. Secondly, Gated Recurrent Unit (GRU) was used as the decoder, and attention mechanism was introduced, so that, the attention was allocated to different regions of image to realize symbol recognition and structural analysis accurately. Finally, the handwritten mathematical expression images were encoded, and the encoding results were decoded into LaTeX sequences. Experimental results on Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) dataset show that the proposed model has the recognition rate improved to 40.39%. And within the allowable error range of three levels, the model has the recognition rate improved to 52.74%, 58.82% and 62.98%, respectively. Compared with the Bidirectional Long Short-Term Memory (BLSTM) network model, the proposed model increases the recognition rate by 3.17 percentage points. And within the allowable error range of three levels, the proposed model has the recognition rate increased by 8.52 percentage points, 11.56 percentage points, and 12.78 percentage points, respectively. It can be seen that the proposed model can accurately parse the handwritten mathematical expression images, generate LaTeX sequences, and improve the recognition rate.
Infrared small targets occupy few pixels and lack features such as color, texture and shape, so it is difficult to track them effectively. To solve this problem, an infrared small target tracking method based on state information was proposed. Firstly, the target, background and distractors in the local area of the small target to be detected were encoded to obtain dense local state information between consecutive frames. Secondly, feature information of the current and the previous frames were input into the classifier to obtain the classification score. Thirdly, the state information and the classification score were fused to obtain the final degree of confidence and determine the center position of the small target to be detected. Finally, the state information was updated and propagated between the consecutive frames. After that, the propagated state information was used to track the infrared small target in the entire sequences. The proposed method was validated on an open dataset DIRST (Dataset for Infrared detection and tRacking of dim-Small aircrafT). Experimental results show that for infrared small target tracking, the recall of the proposed method reaches 96.2%, and the precision of the method reaches 97.3%, which are 3.7% and 3.7% higher than those of the current best tracking method KeepTrack. It proves that the proposed method can effectively complete the tracking of small infrared targets under complex background and interference.
Camouflaged Object Detection (COD) aims to detect objects hidden in complex environments. The existing COD algorithms ignore the influence of feature expression and fusion methods on detection performance when combining multi-level features. Therefore, a COD algorithm based on progressive feature enhancement aggregation was proposed. Firstly, multi-level features were extracted through the backbone network. Then, in order to improve the expression ability of features, an enhancement network composed of Feature Enhancement Module (FEM) was used to enhance the multi-level features. Finally, Adjacency Aggregation Module (AAM) was designed in the aggregation network to achieve information fusion between adjacent features to highlight the features of the camouflaged object area, and a new Progressive Aggregation Strategy (PAS) was proposed to aggregate adjacent features in a progressive way to achieve effective multi-level feature fusion while suppressing noise. Experimental results on 3 public datasets show that the proposed algorithm achieves the best performance on 4 objective evaluation indexes compared with 12 state-of-the-art algorithms, especially on COD10K dataset, the weighted F-measure and the Mean Absolute Error (MAE) of the proposed algorithm reach 0.809 and 0.037 respectively. It can be seen that the proposed algorithm achieves better performance on COD tasks.
Aiming at the problems in Multi-scale Generative Adversarial Networks Image Inpainting algorithm (MGANII), such as unstable training in the process of image inpainting, poor structural consistency, insufficient details and textures of the inpainted image, an image inpainting algorithm of multi-scale generative adversarial network was proposed based on multi-feature fusion. Firstly, aiming at the problems of poor structural consistency and insufficient details and textures, a Multi-Feature Fusion Module (MFFM) was introduced in the traditional generator, and a perception-based feature reconstruction loss function was introduced to improve the ability of feature extraction in the dilated convolutional network, thereby supplying more details and texture features for the inpainted image. Then, a perception-based feature matching loss function was introduced into local discriminator to enhance the discrimination ability of the discriminator, thereby improving the structural consistency of the inpainted image. Finally, a risk penalty term was introduced into the adversarial loss function to meet the Lipschitz continuity condition, so that the network was able to converge rapidly and stably in the training process. On the dataset CelebA, compared with MANGII, the proposed multi-feature fusion image inpainting algorithm can converges faster. Meanwhile, the Peak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity (SSIM) of the images inpainted by the proposed algorithm are improved by 0.45% to 8.67% and 0.88% to 8.06% respectively compared with those of the images inpainted by the baseline algorithms, and Frechet Inception Distance score (FID) of the images inpainted by the proposed algorithm is reduced by 36.01% to 46.97% than the images inpainted by the baseline algorithms. Experimental results show that the inpainting performance of the proposed algorithm is better than that of the baseline algorithms.
For the problems of unbalanced detection speed and recognition accuracy of traffic sign recognition models, and that it is difficult to detect occluded targets and small targets, YOLOv5 (You Only Look Once version 5) model was improved, and a lightweight traffic sign recognition model based on Coordinate Attention (CA) was proposed. Firstly, CA mechanism was integrated into the backbone network to effectively capture the relationships between location information and channels, so as to obtain the regions of interest more accurately and avoid too much computational overhead. Then, cross layer connections were added to the feature fusion network to fuse more feature information without increasing the cost, improve the feature extraction ability of the network and the detection effect of occluded targets. Finally, the improved CIoU (Complete Intersection over Union) function was introduced to calculate the localization loss, thereby alleviating the uneven distribution of sample size in the detection process, and further improving the recognition accuracy of small targets. Applying this model on TT100K (Tsinghua-Tencent 100K) dataset, the recognition accuracy is 91.5%, the recall is 86.64%, which are improved by 20.96% and 11.62% respectively compared with those of the traditional YOLOv5n model, and the frame processing rate is 140.84 FPS (Frames Per Second). These experimental results fully verify the accuracy and real-time performance of the proposed model for traffic sign detection and recognition in real scenes.