Deploying the YOLOv8L model on edge devices for road crack detection can achieve high accuracy, but it is difficult to guarantee real-time detection. To solve this problem, a target detection algorithm based on the improved YOLOv8 model that can be deployed on the edge computing device Jetson AGX Xavier was proposed. First, the Faster Block structure was designed using partial convolution to replace the Bottleneck structure in the YOLOv8 C2f module, and the improved C2f module was recorded as C2f-Faster; second, an SE (Squeeze-and-Excitation) channel attention layer was connected after each C2f-Faster module in the YOLOv8 backbone network to further improve the detection accuracy. Experimental results on the open source road damage dataset RDD20 (Road Damage Detection 20) show that the average F1 score of the proposed method is 0.573, the number of detection Frames Per Second (FPS) is 47, and the model size is 55.5 MB. Compared with the SOTA (State-Of-The-Art) model of GRDDC2020 (Global Road Damage Detection Challenge 2020), the F1 score is increased by 0.8 percentage points, the FPS is increased by 291.7%, and the model size is reduced by 41.8%, which realizes the real-time and accurate detection of road cracks on edge devices.
U-shaped Network (U-Net) based on Fully Convolutional Network (FCN) is widely used as the backbone of medical image segmentation models, but Convolutional Neural Network (CNN) is not good at capturing long-range dependency, which limits the further performance improvement of segmentation models. To solve the above problem, researchers have applied Transformer to medical image segmentation models to make up for the deficiency of CNN, and U-shaped segmentation networks combining Transformer have become the hot research topics. After a detailed introduction of U-Net and Transformer, the related medical image segmentation models were categorized by the position in which the Transformer module was located, including only in the encoder or decoder, both in the encoder and decoder, as a skip-connection, and others, the basic contents, design concepts and possible improvement aspects about these models were discussed, the advantages and disadvantages of having Transformer in different positions were also analyzed. According to the analysis results, it can be seen that the biggest factor to decide the position of Transformer is the characteristics of the target segmentation task, and the segmentation models of Transformer combined with U-Net can make better use of the advantages of CNN and Transformer to improve segmentation performance of models, which has great development prospect and research value.
Aiming at the problems of low detection precision, poor robustness, and imperfect related systems in the current small object detection of electric vehicle helmet, an electric vehicle helmet detection model was proposed based on improved YOLOv5s algorithm. In the proposed model, Convolutional Block Attention Module (CBAM) and Coordinate Attention (CA) module were introduced, and the improved Non-Maximum Suppression (NMS) - Distance Intersection over Union-Non Maximum Suppression (DIoU-NMS) was used. At the same time, multi-scale feature fusion detection was added and densely connected network was combined to improve feature extraction effect. Finally, a helmet detection system for electric vehicle drivers was established. The improved YOLOv5s algorithm had the mean Average Precision (mAP) increased by 7.1 percentage points when the Intersection over Union (IoU) is 0.5, and Recall increased by 1.6 percentage points compared with the original YOLOv5s on the self-built electric vehicle helmet wearing dataset. Experimental results show that the improved YOLOv5s algorithm can better meet the requirements for detection precision of electric vehicles and the helmets of their drivers in actual situations, and reduce the incidence rate of electric vehicle traffic accidents to a certain extent.
6 Degree of Freedom (DoF) pose estimation is a key technology in computer vision and robotics, and has become a crucial task in the fields such as robot operation, automatic driving, augmented reality by estimating 6 DoF pose of an object from a given input image, that is, 3 DoF translation and 3 DoF rotation. Firstly, the concept of 6 DoF pose and the problems of traditional methods based on feature point correspondence, template matching, and three-dimensional feature descriptors were introduced. Then, the current mainstream 6 DoF pose estimation algorithms based on deep learning were introduced in detail from different angles of feature correspondence-based, pixel voting-based, regression-based and multi-object instances-oriented, synthesis data-oriented, and category level-oriented. At the same time, the datasets and evaluation indicators commonly used in pose estimation were summarized and sorted out, and some algorithms were evaluated experimentally to show their performance. Finally, the challenges and the key research directions in the future of pose estimation were given.
Aiming at the problem of low accuracy of ship target detection at sea, a lightweight ship target detection algorithm YOLOShip was proposed on the basis of the improved YOLOv5. Firstly, dilated convolution and channel attention were introduced into Spatial Pyramid Pooling-Fast (SPPF) module, which integrated spatial feature details of different scales, strengthened semantic information, and improved the model’s ability to distinguish foreground and background. Secondly, coordinate attention and lightweight mixed depthwise convolution were introduced into Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) structures to strengthen important features in the network, obtain features with more detailed information, and improve model detection ability and positioning precision. Thirdly, considering the uneven distribution and relatively small scale changes of targets in the dataset, the model performance was further improved while the model was simplified by modifying the anchors and decreasing the number of detection heads. Finally, a more flexible Polynomial Loss (PolyLoss) was introduced to optimize Binary Cross Entropy Loss (BCE Loss) to improve the model convergence speed and model precision. Experimental results show that on dataset SeaShips, in comparison with YOLOv5s, YOLOShip has the Precision, Recall, mAP@0.5 and mAP@0.5:0.95 increased by 4.2, 5.7, 4.6 and 8.5 percentage points. Thus, by using the proposed algorithm, better detection precision can be obtained while meeting the requirements of detection speed, effectively achieving high-speed and high-precision ship detection.
Aiming at the problem of small object miss detection in object detection process, an improved YOLOv5 (You Only Look Once) object detection algorithm based on attention mechanism and multi-scale context information was proposed. Firstly, Multiscale Dilated Separable Convolutional Module (MDSCM) was added to the feature extraction structure to extract multi-scale feature information, increasing the receptive field while avoiding the loss of small object information. Secondly, the attention mechanism was added to the backbone network, and the location awareness information was embedded in the channel information, so as to further enhance the feature expression ability of the algorithm. Finally, Soft-NMS (Soft-Non-Maximum Suppression) was used instead of the NMS (Non-Maximum Suppression) used by YOLOv5 to reduce the missed detection rate of the algorithm. Experimental results show that the improved algorithm achieves detection precisions of 82.80%, 71.74% and 77.11% respectively on PASCAL VOC dataset, DOTA aerial image dataset and DIOR optical remote sensing dataset, which are 3.70, 1.49 and 2.48 percentage points higer than those of YOLOv5, and it has better detection effect on small objects. Therefore, the improved YOLOv5 can be better applied to small object detection scenarios in practice.
Aiming at the high computational complexity and large memory consumption of the existing super-resolution reconstruction networks, a lightweight image super-resolution reconstruction network based on Transformer-CNN was proposed, which made the super-resolution reconstruction network more suitable to be applied on embedded terminals such as mobile platforms. Firstly, a hybrid block based on Transformer-CNN was proposed, which enhanced the ability of the network to capture local-global depth features. Then, a modified inverted residual block, with special attention to the characteristics of the high-frequency region, was designed, so that the improvement of feature extraction ability and reduction of inference time were realized. Finally, after exploring the best options for activation function, the GELU (Gaussian Error Linear Unit) activation function was adopted to further improve the network performance. Experimental results show that the proposed network can achieve a good balance between image super-resolution performance and network complexity, and reaches inference speed of 91 frame/s on the benchmark dataset Urban100 with scale factor of 4, which is 11 times faster than the excellent network called SwinIR (Image Restoration using Swin transformer), indicates that the proposed network can efficiently reconstruct the textures and details of the image and reduce a significant amount of inference time.
2D/3D medical image registration is a key technology in 3D real-time navigation of orthopedic surgery. However, the traditional 2D/3D registration methods based on optimization iteration require multiple iterative calculations, which cannot meet the requirements of doctors for real-time registration during surgery. To solve this problem, a pose regression network based on autoencoder was proposed. In this network, the geometric pose information was captured through hidden space decoding, thereby quickly regressing the 3D pose of preoperative spine pose corresponding to the intraoperative X-ray image, and the final registration image was generated through reprojection. By introducing new loss functions, the model was constrained by “Rough to Fine” combined registration method to ensure the accuracy of pose regression. In CTSpine1K spine dataset, 100 CT scan image sets were extracted for 10-fold cross-validation. Experimental results show that the registration result image generated by the proposed model has the Mean Absolute Error (MAE) with the X-ray image of 0.04, the mean Target Registration Error (mTRE) with the X-ray image of 1.16 mm, and the single frame consumption time of 1.7 s. Compared to the traditional optimization based method, the proposed model has registration time greatly shortened. Compared with the learning-based method, this model ensures a high registration accuracy with quick registration. Therefore, the proposed model can meet the requirements of intraoperative real-time high-precision registration.
Aiming at the problem of low accuracy of the existing cross-view image matching algorithms, an Unmanned Aerial Vehicle (UAV) image localization method based on Multi-view and Multi-supervision Network (MMNet) was proposed. Firstly, in the proposed method, satellite perspective and UAV perspective were integrated, global and local features were learnt under a unified network architecture, then classification network was trained and metric tasks were performed in multi-supervision way. Specifically, the Reweighted Regularization Triplet loss (RRT) was mainly used by MMNet to learn global features. In this loss, the reweighting and distance regularization strategies were to solve the problems of imbalance of multi-view samples and structure disorder of the feature space. Simultaneously, in order to pay attention to the context information of the central building in target location, the local features were obtained by MMNet via square ring cutting. After that, the cross entropy loss and RRT were used to perform classification and metric tasks respectively. Finally, the global and local features were aggregated by using a weighted strategy to present target location images. MMNet achieved Recall@1 (R@1) of 83.97% and Average Precision (AP) of 86.96% in UAV localization tasks on the currently popular UAV dataset University-1652. Experimental results show that MMNet significantly improves the accuracy of cross-view image matching, and then enhances the practicability of UAV image localization compared with LCM (cross-view Matching based on Location Classification), SFPN (Salient Feature Partition Network) and other methods.
Aiming at the problems such as small object size, arbitrary object direction and complex background of remote sensing images, on the basis of YOLOv5 (You Only Look Once version 5) algorithm, an algorithm involved with geometric adaptation and global perception was proposed. Firstly, deformable convolutions and adaptive spatial attention modules were stacked alternately in series through dense connections. As a result, a Dense Context-Aware Module (DenseCAM) which can model local geometric features was constructed on the basis of taking full advantage of different levels of semantic and location information. Secondly, by introducing Transformer in the end of the backbone network, the global perception ability of the model was enhanced at a low cost and the relationships between objects and scenario content were modeled. On UCAS-AOD and RSOD datasets, compared with YOLOv5s6 algorithm, the proposed algorithm has the mean Average Precision (mAP) increased by 1.8 percentage points and 1.5 percentage points, respectively. Experimental results show that the proposed algorithm can effectively improve the precision of object detection in remote sensing images.
Aiming at the problems of low accuracy and poor robustness of traditional point cloud registration algorithms and the inability of accurate radiotherapy for cancer patients before and after radiotherapy, an Attention Dynamic Graph Convolutional Neural Network Lucas-Kanade (ADGCNNLK) was proposed. Firstly, residual attention mechanism was added to Dynamic Graph Convolutional Neural Network (DGCNN) to effectively utilize spatial information of point cloud and reduce information loss. Then, the DGCNN added with residual attention mechanism was used to extract point cloud features, this process was not only able to capture the local geometric features of the point cloud while maintaining the invariance of the point cloud replacement, but also able to semantically aggregate the information, thereby improving the registration efficiency. Finally, the extracted feature points were mapped to a high-dimensional space, and the classic image iterative registration algorithm LK (Lucas-Kanade) was used for registration of the nodes. Experimental results show that compared with Iterative Closest Point (ICP), Globally optimal ICP (Go-ICP) and PointNetLK, the proposed algorithm has the best registration effect with or without noise. Among them, in the case without noise, compared with PointNetLK, the proposed algorithm has the rotation mean squared error reduced by 74.61%, and the translation mean squared error reduced by 47.50%; in the case with noise, compared with PointNetLK, the proposed algorithm has the rotation mean squared error reduced by 73.13%, and the translational mean squared error reduced by 44.18%, indicating that the proposed algorithm is more robust than PointNetLK. And the proposed algorithm is applied to the registration of human point cloud models of cancer patients before and after radiotherapy, assisting doctors in treatment, and realizing precise radiotherapy.
Visual object tracking is one of the important tasks in computer vision, in order to achieve high-performance object tracking, a large number of object tracking methods have been proposed in recent years. Among them, Transformer-based object tracking methods become a hot topic in the field of visual object tracking due to their ability to perform global modeling and capture contextual information. Firstly, existing Transformer-based visual object tracking methods were classified based on their network structures, an overview of the underlying principles and key techniques for model improvement were expounded, and the advantages and disadvantages of different network structures were also summarized. Then, the experimental results of the Transformer-based visual object tracking methods on public datasets were compared to analyze the impact of network structure on performance. in which MixViT-L (ConvMAE) achieved tracking success rates of 73.3% and 86.1% on LaSOT and TrackingNet, respectively, proving that the object tracking methods based on pure Transformer two-stage architecture have better performance and broader development prospects. Finally, the limitations of these methods, such as complex network structure, large number of parameters, high training requirements, and difficulty in deploying on edge devices, were summarized, and the future research focus was outlooked, by combining model compression, self-supervised learning, and Transformer interpretability analysis, more kinds of feasible solutions for Transformer-based visual target tracking could be presented.
Aiming at the problem that the existing Handwritten Mathematical Expression Recognition (HMER) methods reduce image resolution and lose feature information after multiple pooling operations in Convolutional Neural Network (CNN), which leads to parsing errors, an encoder-decoder model for HMER based on attention mechanism was proposed. Firstly, Densely connected convolutional Network (DenseNet) was used as the encoder, so that the dense connections were used to enhance feature extraction, promote gradient propagation, and alleviate vanishing gradient. Secondly, Gated Recurrent Unit (GRU) was used as the decoder, and attention mechanism was introduced, so that, the attention was allocated to different regions of image to realize symbol recognition and structural analysis accurately. Finally, the handwritten mathematical expression images were encoded, and the encoding results were decoded into LaTeX sequences. Experimental results on Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) dataset show that the proposed model has the recognition rate improved to 40.39%. And within the allowable error range of three levels, the model has the recognition rate improved to 52.74%, 58.82% and 62.98%, respectively. Compared with the Bidirectional Long Short-Term Memory (BLSTM) network model, the proposed model increases the recognition rate by 3.17 percentage points. And within the allowable error range of three levels, the proposed model has the recognition rate increased by 8.52 percentage points, 11.56 percentage points, and 12.78 percentage points, respectively. It can be seen that the proposed model can accurately parse the handwritten mathematical expression images, generate LaTeX sequences, and improve the recognition rate.
In order to generate more accurate and smooth virtual human animation, the Kinect device was used to capture the 3D human body pose data, and the monocular 3D human body pose estimation algorithm was used to reason the skeleton points in the color information of the Kinect at the same time, thereby optimizing the human pose estimation effect at real time, and driving the virtual character model to generate animation. Firstly, a spatio-temporal optimization method of skeleton point data processing was proposed to improve the stability of monocular estimation of the 3D human body pose. Secondly, a human pose estimation method based on the fusion of Kinect and Occlusion-Robust Pose-Maps (ORPM) algorithm was proposed to solve the occlusion problem of Kinect. Finally, a virtual human animation system based on quaternion vector interpolation and inverse kinematics constraints was developed, which was able to perform motion simulation and real-time animation generation. Compared with the animation generation method that only uses Kinect to capture human motion, the proposed method has the human body estimation data more robust, and has a certain anti-occlusion ability. The animation frame rate generated by this method is two times higher compared to that of the ORPM-based animation generation method, so that the effect of the animation generated by the proposed method is more realistic and smooth.
Infrared small targets occupy few pixels and lack features such as color, texture and shape, so it is difficult to track them effectively. To solve this problem, an infrared small target tracking method based on state information was proposed. Firstly, the target, background and distractors in the local area of the small target to be detected were encoded to obtain dense local state information between consecutive frames. Secondly, feature information of the current and the previous frames were input into the classifier to obtain the classification score. Thirdly, the state information and the classification score were fused to obtain the final degree of confidence and determine the center position of the small target to be detected. Finally, the state information was updated and propagated between the consecutive frames. After that, the propagated state information was used to track the infrared small target in the entire sequences. The proposed method was validated on an open dataset DIRST (Dataset for Infrared detection and tRacking of dim-Small aircrafT). Experimental results show that for infrared small target tracking, the recall of the proposed method reaches 96.2%, and the precision of the method reaches 97.3%, which are 3.7% and 3.7% higher than those of the current best tracking method KeepTrack. It proves that the proposed method can effectively complete the tracking of small infrared targets under complex background and interference.