Cross-view scene matching refers to the discovery of images of the same geographical target from different platforms (such as drones and satellites). However, different image platforms lead to low accuracy of UAV (Unmanned Aerial Vehicle) positioning and navigation tasks, and the existing methods usually focus only on a single dimension of the image and ignore the multi-dimensional features of the image. To solve the above problems, GAMF (Global Attention and Multi-granularity feature Fusion) deep neural network was proposed to improve feature representation and feature distinguishability. Firstly, the images from the UAV perspective and the satellite perspective were combined, and the three branches were extended under the unified network architecture, the spatial location, channel and local features of the images from three dimensions were extracted. Then, by establishing the SGAM (Spatial Global relationship Attention Module) and CGAM (Channel Global Attention Module), the spatial global relationship mechanism and channel attention mechanism were introduced to capture global information, so as to better carry out attention learning. Secondly, in order to fuse local perception features, a local division strategy was introduced to better improve the model’s ability to extract fine-grained features. Finally, the features of the three dimensions were combined as the final features to train the model. The test results on the public dataset University-1652 show that the AP (Average Precision) of the GAMF model on UAV visual positioning tasks reaches 87.41%, and the Recall (R@1) in UAV visual navigation tasks reaches 90.30%, which verifies that the GAMF model can effectively aggregate the multi-dimensional features of the image and improve the accuracy of UAV positioning and navigation tasks.
A method based on Siamese network and Transformer was proposed to address the low accuracy problem of infrared dim small target tracking. First, a multi-feature extraction cascading moduling was constructed to separately extract the deep features of the infrared dim small target template frame and the search frame, and concatenate them with their corresponding HOG features at the dimension level. Second, a multi-head attention mechanism Transformer was introduced to perform cross-correlation operations between the template feature map and the search feature map, generating a response map. Finally, the target’s center position in the image and the regression bounding box were obtained through the response map upsampling network and bounding box prediction network to complete the tracking of the infrared dim small targets. Test results on a dataset of 13 655 infrared images show that compared with KeepTrack tracking method, the success rate is improved by 5.9 percentage points and the precision is improved by 1.8 percentage points; compared with TransT (Transformer Tracking) method, the success rate is improved by 14.2 percentage points and the precision is improved by 14.6 percentage points. The proposed method is proved to be more accurate in tracking infrared dim small targets in complex backgrounds.
At present, most accelerated Magnetic Resonance Imaging (MRI) reconstruction algorithms reconstruct undersampled amplitude images and use real-value convolution for feature extraction, without considering that the MRI data itself is complex, which limits the feature extraction ability of MRI complex data. In order to improve the feature extraction ability of single slice MRI complex data, and thus reconstruct single slice MRI images with clearer details, a Complex Convolution Dual-Domain Cascade Network (ComConDuDoCNet) was proposed. The original undersampled MRI data was used as input, and Residual Feature Aggregation (RFA) blocks were used to alternately extract the dual domain features of the MRI data, ultimately reconstructing the Magnetic Resonance (MR) images with clear texture details. Complex convolution was used as a feature extractor for each RFA block. Different domains were cascaded through Fourier transform or inverse transform, and data consistency layer was added to achieve data fidelity. A large number of experiments were conducted on publicly available knee joint dataset. The comparison results with the Dual-task Dual-domain Network (DDNet) under three different sampling masks with a sampling rate of 20% show that: under the two-dimensional Gaussian sampling mask, the proposed algorithm decreases Normalized Root Mean Square Error (NRMSE) by 13.6%, increases Peak Signal-to-Noise Ratio (PSNR) by 4.3%, and increases Structural SIMilarity (SSIM) by 0.8%; under the Poisson sampling mask, the proposed algorithm decreases NRMSE by 11.0%, increases PSNR by 3.5%, and increases SSIM by 0.1%; under the radial sampling mask, the proposed algorithm decreases NRMSE by 12.3%, increases PSNR by 3.8%, and increases SSIM by 0.2%. The experimental results show that ComConDuDoCNet, combined with complex convolution and dual-domain learning, can reconstruct MR images with clearer details and more realistic visual effects.
For radial angulation deformity, it is difficult to accurately locate the osteotomy position only by experience, thus a three-Dimensional (3D) automatic planning algorithm for radial angulation wedge osteotomy was proposed to accurately determine the specific osteotomy position and calculate the best reset angle. Firstly, the contralateral radius mirror model with compensation difference was used as the reference template to calculate the bone deformity area. Secondly, the distal radius joint was registered based on the weight of the joint anatomical area to create the rotation axis direction vector, and the deformity contour curve of the XOZ plane was solved by the cubic spline interpolation method to determine the orientation of the rotation axis. Finally, the single-objective optimization algorithm was used to optimize the iteration, calculate the optimal osteotomy position and reset angle, and automatically generate the preoperative plan of wedge osteotomy. Six cases of radial angulation were selected to compare the registration accuracy of the joint anatomical area with the surgeon’s manual osteotomy planning method in 3D space as the experimental control group. Experimental results show that compared with manual osteotomy and reset by surgeons proposed by Miyake et al., the Root Mean Square Error (RMSE) of the registration of the joint anatomical area obtained by the proposed algorithm is decreased by 0.09 to 0.42 mm; compared with the automatic planning method proposed by Fürnstahl et al., the proposed algorithm can clarify the type of wedge and has higher clinical feasibility.
Ring artifact is one of the most common artifacts in various types of CT (Computed Tomography) images, which is usually caused by the inconsistent response of detector pixels to X-rays. Effective removal of ring artifacts, which is a necessary step in CT image reconstruction, will greatly improve the quality of CT images and enhance the accuracy of later diagnosis and analysis. Therefore, the methods of ring artifact removal (also known as ring artifact correction) were systematically reviewed. Firstly, the performance and causes of ring artifacts were introduced, and commonly used datasets and algorithm libraries were given. Secondly, ring artifact removal methods were divided into three categories to introduce. The first category was based on detector calibration. The second category was based on analytical and iterative solution, including projection data preprocessing, CT image reconstruction and CT image post-processing. The last category was based on deep learning methods such as convolutional neural network and generative adversarial network. The principle, development process, advantages and limitations of each method were analyzed. Finally, the technical bottlenecks of existing ring artifact removal methods in terms of robustness, dataset diversity and model construction were summarized, and the solutions were prospected.
The commonly used monocular vision-based vehicle 3D detection method at present combines object detection with geometric constraint. However, the position of the vanishing point in the geometric constraint has a significant impact on the results. To obtain more accurate constraint conditions, a 3D vehicle detection algorithm based on horizon line detection was proposed. First, the relative position of the vanishing point was obtained using the vehicle image, and the vehicle image was preprocessed to an appropriate size. Then, the preprocessed vehicle image was fed into a vanishing point detection network to obtain a set of heatmaps indicating the vanishing point information. The vanishing point information was regressed, and the horizon information was calculated. Finally, geometric constraint was constructed based on the horizon line information, and the initial dimensions of the vehicle were iteratively optimized within the constrained space to calculate the precise 3D information of the vehicle. The experimental results demonstrate that the proposed horizon line solving algorithm obtains more accurate horizon lines. Compared to the random forest method, there is an AUC (Area Under Curve) improvement of 1.730 percentage points. Simultaneously, the introduced horizon line constraint effectively restricts the 3D vehicle information, resulting in an average precision improvement of 2.201 percentage points compared to the algorithm using diagonal and vanishing point constraint. It can be observed that the horizon line serves as a geometric constraint for solving vehicle 3D information in the context of roadside monocular camera perspectives.
Aiming at the problems of detail information loss and low segmentation accuracy in the segmentation of day and night ground-based cloud images, a segmentation network called CloudResNet-UNetwork (CloudRes-UNet) for day and night ground-based cloud images based on improved Res-UNet (Residual network-UNetwork) was proposed, in which the overall network structure of encoder-decoder was adopted. Firstly, ResNet50 was used by the encoder to extract features to enhance the feature extraction ability. Then, a Multi-Stage feature extraction (Multi-Stage) module was designed, which combined three techniques of group convolution, dilated convolution and channel shuffle to obtain high-intensity semantic information. Secondly, Efficient Channel Attention Network (ECA?Net) module was added to focus on the important information in the channel dimension, strengthen the attention to the cloud region in the ground-based cloud image, and improve the segmentation accuracy. Finally, bilinear interpolation was used by the decoder to upsample the features, which improved the clarity of the segmented image and reduced the loss of object and position information. The experimental results show that, compared with the state-of-the-art ground-based cloud image segmentation network Cloud-UNetwork (Cloud-UNet) based on deep learning, the segmentation accuracy of CloudRes-UNet on the day and night ground-based cloud image segmentation dataset is increased by 1.5 percentage points, and the Mean Intersection over Union (MIoU) is increased by 1.4 percentage points, which indicates that CloudRes-UNet obtains cloud information more accurately. It has positive significance for weather forecast, climate research, photovoltaic power generation and so on.
The use of contextual information plays an important role in speech enhancement tasks. To address the under-utilization problem of global speech, a Gated Dilated Convolutional Recurrent Network (GDCRN) for complex spectral mapping was proposed. GDCRN was composed of an encoder, a Gated Temporal Convolution Module (GTCM) and a decoder. The encoder and decoder had asymmetric network structure. Firstly, features were processed by the encoder using a Gated Dilated Convolution Module (GDCM), which expanded the receptive field. Secondly, longer contextual information was captured and selectively passed through the use of the GTCM. Finally, the deconvolution combined with a Gated Linear Unit (GLU)was used by the decoder, which was connected to the corresponding convolution layer in the encoder using skip connection. Additionally, a Channel Time-Frequency Attention (CTFA) mechanism was introduced. Experimental results show that the proposed network has fewer parameters and shorter training time than other networks such as Temporal Convolutional Neural Network (TCNN) and Gated Convolutional Recurrent Network (GCRN). The proposed GDCRN significantly improves PESQ (Perceptual Evaluation of Speech Quality) and STOI(Short-Time Objective Intelligibility) up by 0.258 9 and 4.67 percentage points, demonstrating that the proposed network has better enhancement effect and stronger generalization ability.
The existing image aesthetic quality evaluation methods widely use Convolution Neural Network (CNN) to extract image features. Limited by the local receptive field mechanism, it is difficult for CNN to extract global features from a given image, thereby resulting in the absence of aesthetic attributes like global composition relations, global color matching and so on. In order to solve this problem, an image aesthetic quality evaluation method based on SSViT (Self-Supervised Vision Transformer) model was proposed. Self-attention mechanism was utilized to establish long-distance dependencies among local patches of the image and to adaptively learn their correlations, and extracted the global features so as to characterize the aesthetic attributes. Meanwhile, three tasks of perceiving the aesthetic quality, namely classifying image degradation, ranking image aesthetic quality, and reconstructing image semantics, were designed to pre-train the vision Transformer in a self-supervised manner using unlabeled image data, so as to enhance the representation of global features. The experimental results on AVA (Aesthetic Visual Assessment) dataset show that the SSViT model achieves 83.28%, 0.763 4, 0.746 2 on the metrics including evaluation accuracy, Pearson Linear Correlation Coefficient (PLCC) and SRCC (Spearman Rank-order Correlation Coefficient), respectively. These experimental results demonstrate that the SSViT model achieves higher accuracy in image aesthetic quality evaluation.
Combining event cameras with traditional cameras for vehicle target detection can not only solve the problems of over-exposure, underexposure, and motion blur in high dynamic range of traditional cameras, but also solve the problem of low detection accuracy caused by missing texture information of event cameras. Existing fusion algorithms often have problems such as high computational complexity, loss of feature information, and poor fusion results. To solve the above problems, a vehicle target detection algorithm that effectively fused event cameras and conventional cameras was proposed. Firstly, a spatio-temporal event representation based on Event Frequency (EF) and Time Surface (TS) was proposed, which encoded event data into event frames. Then, a Feature fusion module based on Channel and Spatial Attention mechanism (FCSA) was proposed to perform feature-level fusion of image frames and event frames. Finally, the prior box was optimized by using the differential evolution search algorithm to further improve the vehicle detection performance. In addition, due to the lack of public datasets containing image frames and event data, a vehicle detection dataset MVSEC-CAR was established. The experimental results show that, on the public PKU-DDD17-CAR dataset, the mean Average Precision (mAP) of the proposed algorithm is 2.6 percentage points higher than that of the second best ADF (Attention fusion Detection Framework), and it achieves a higher frame rate, effectively improving the accuracy of vehicle target detection and robustness to lighting, which validate the effectiveness of the proposed event representation, feature fusion, and prior box optimization algorithms.
The existing image tamper detection networks based on deep learning often have problems such as low detection accuracy and weak algorithm transferability. To address the above issues, a two-channel progressive feature filtering network was proposed. Two channels were used to extract the two-domain features of the image in parallel, one of which was used to extract the shallow and deep features of the image spatial domain, and the other channel was used to extract the feature distribution of the image noise domain. At the same time, a progressive subtle feature screening mechanism was used to filter redundant features and gradually locate the tampered regions; in order to extract the tamper mask more accurately, a two-channel subtle feature extraction module was proposed, which combined the subtle features of the spatial domain and the noise domain to generate a more accurate tamper mask. During the decoding process, the localization ability of the network to tampered regions was improved by fusing filtered features of different scales and the contextual information of the network. The experimental results show that in terms of detection and localization, compared with the existing advanced tamper detection networks ObjectFormer, Multi-View multi-Scale Supervision Network (MVSS-Net) and Progressive Spatio-Channel Correlation Network (PSCC-Net), the F1 score of the proposed network is increased by an 10.4, 5.9 and 12.9 percentage points on CASIA V2.0 dataset; when faced with Gaussian low-pass filtering, Gaussian noise, and JPEG compression attacks, compared with Manipulation Tracing Network (ManTra-Net) and Spatial Pyramid Attention Network (SPAN), the Area Under Curve (AUC) of the proposed network is increased by 10.0 and 5.4 percentage points at least. It is verified that the proposed network can effectively solve the problems of low detection accuracy and poor transferability in the tamper detection algorithm.
Current Video Super-Resolution (VSR) algorithms cannot fully utilize inter-frame information of different distances when processing complex scenes with large motion amplitude, resulting in difficulty in accurately recovering occlusion, boundaries, and multi-detail regions. A VSR model based on frame straddling optical flow was proposed to solve these problems. Firstly, shallow features of Low-Resolution frames (LR) were extracted through Residual Dense Blocks (RDBs). Then, motion estimation and compensation was performed on video frames using a Spatial Pyramid Network (SPyNet) with straddling optical flows of different time lengths, and deep feature extraction and correction was performed on inter-frame information through RDBs connected in multiple layers. Finally, the shallow and deep features were fused, and High-Resolution frames (HR) were obtained through up-sampling. The experimental results on the REDS4 public dataset show that compared with deep Video Super-Resolution network using Dynamic Upsampling Filters without explicit motion compensation (DUF-VSR), the proposed model improves Peak Signal-to-Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) by 1.07 dB and 0.06, respectively. The experimental results show that the proposed model can effectively improve the quality of video image reconstruction.
Existing single-stage target detection algorithms are insensitive to nodule detection in lung nodule detection, multiple up-samplings during feature extraction by Convolutional Neural Network (CNN) has difficult feature extraction and poor detection effect, and the existing pulmonary nodule detection algorithm models are complex and not conductive to practical application employment and implementation. To address the above problems, a real-time pulmonary nodule detection algorithm combining attention mechanism and multipath fusion was proposed, based on which the up-sampling algorithm was improved to effectively increase the detection accuracy of lung nodules and speed of model inference, the model size was small and easy to deploy. Firstly, the hybrid attention mechanism of channel and space was fused in the backbone network part of feature extraction. Secondly, the sampling algorithm was improved to enhance the quality of generated feature maps. Finally, the channels were established between different paths in the enhanced feature extraction network part to achieve the fusion of deep and shallow features, so the semantic and location information at different scales was fused. Experimental results on LUNA16 dataset show that, compared to the original YOLOv5s algorithm, the proposed algorithm achieves an improvement of 9.5, 6.9, and 8.7 percentage points in precision, recall, and average precision, respectively, with a frame rate of 131.6 frames/s, and a model weight file of only 14.2 MB, demonstrating that the proposed algorithm can detect lung nodules in real time with much higher accuracy than existing single-stage detection algorithms such as YOLOv3 and YOLOv8.
Visual object tracking is one of the important tasks in computer vision, in order to achieve high-performance object tracking, a large number of object tracking methods have been proposed in recent years. Among them, Transformer-based object tracking methods become a hot topic in the field of visual object tracking due to their ability to perform global modeling and capture contextual information. Firstly, existing Transformer-based visual object tracking methods were classified based on their network structures, an overview of the underlying principles and key techniques for model improvement were expounded, and the advantages and disadvantages of different network structures were also summarized. Then, the experimental results of the Transformer-based visual object tracking methods on public datasets were compared to analyze the impact of network structure on performance. in which MixViT-L (ConvMAE) achieved tracking success rates of 73.3% and 86.1% on LaSOT and TrackingNet, respectively, proving that the object tracking methods based on pure Transformer two-stage architecture have better performance and broader development prospects. Finally, the limitations of these methods, such as complex network structure, large number of parameters, high training requirements, and difficulty in deploying on edge devices, were summarized, and the future research focus was outlooked, by combining model compression, self-supervised learning, and Transformer interpretability analysis, more kinds of feasible solutions for Transformer-based visual target tracking could be presented.
An Multi-YOLOv5 method was proposed for vehicle multi-attribute classification based on YOLOv5 to address the challenges of insufficient ability of convolutional networks to extract fine-grained features of images and inability to recognize dependencies between multiple attributes in image classification tasks. A collaborative working mechanism of Multi-head Non-Maximum Suppression (Multi-NMS) and separable label loss (Separate-Loss) function was designed to complete the multi-attribute classification task of vehicles. Additionally, the YOLOv5 detection model was reconstructed by using Convolutional Block Attention Module (CBAM), Shuffle Attention (SA), and CoordConv methods to enhance the ability of extracting multi-attribute features, strengthen the correlation between different attributes, and enhance the network’s perception of positional information, thereby improving the accuracy of the model in multi-attribute classification of objects. Finally, training and testing were conducted on datasets such as VeRi. Experimental results demonstrate that the Multi-YOLOv5 approach achieves superior recognition outcomes in multi-attribute classification of objects compared to network architectures including GoogLeNet, Residual Network (ResNet), EfficientNet, and Vision Transformer (ViT). The mean Average Precision (mAP) of Multi-YOLOv5 reaches 87.37% on VeRi dataset, with a remarkable improvement of 4.47 percentage points over the best-performing method mentioned above. Moreover, Multi-YOLOv5 exhibits better robustness compared to the original YOLOv5 model, thus providing reliable data information for traffic object perception in dense environments.
Gliomas are the most common primary cranial tumors arising from cancerous changes in the glia of the brain and spinal cord, with a high proportion of malignant gliomas and a significant mortality rate. Quantitative segmentation and grading of gliomas based on Magnetic Resonance Imaging (MRI) images is the main method for diagnosis and treatment of gliomas. To improve the segmentation accuracy and speed of glioma, a 3D-Ghost Convolutional Neural Network (CNN) -based MRI image segmentation algorithm for glioma, called 3D-GA-Unet, was proposed. 3D-GA-Unet was built based on 3D U-Net (3D U-shaped Network). A 3D-Ghost CNN block was designed to increase the useful output and reduce the redundant features in traditional CNNs by using linear operation. Coordinate Attention (CA) block was added, which helped to obtain more image information that was favorable to the segmentation accuracy. The model was trained and validated on the publicly available glioma dataset BraTS2018. The experimental results show that 3D-GA-Unet achieves average Dice Similarity Coefficients (DSCs) of 0.863 2, 0.847 3 and 0.803 6 and average sensitivities of 0.867 6, 0.949 2 and 0.831 5 for Whole Tumor (WT), Tumour Core (TC), and Enhanced Tumour (ET) in glioma segmentation results. It is verified that 3D-GA-Unet can accurately segment glioma images and further improve the segmentation efficiency, which is of positive significance for the clinical diagnosis of gliomas.
Six Degrees of freedom (6D) object pose estimation algorithm based on filter learning network was proposed to address the accuracy and real-time performance of object pose estimation for weakly textured objects in complex scenes. Firstly, standard convolutions were replaced with Blueprint Separable Convolutions (BSConv) to reduce model parameters, and GeLU (Gaussian error Linear Unit) activation functions were used to better approximate normal distributions, thereby improving the performance of the network model. Secondly, an Upsampling Filtering And Encoding information Module (UFAEM) was proposed to compensate for the loss of key upsampling information. Finally, a Global Attention Mechanism (GAM) was proposed to increase contextual information and more effectively extracted information from input feature maps. The experimental results on publicly available datasets LineMOD, YCB-Video, and Occlusion LineMOD show that the proposed algorithm significantly reduces network parameters while improving accuracy. The network parameter count of the proposed algorithm is reduced by nearly three-quarters. Using the ADD(-S) metric, the accuracy of the proposed algorithm is improved by about 1.2 percentage points compared to the Dual?Stream algorithm on lineMOD dataset, by about 5.2 percentage points compared to the DenseFusion algorithm on YCB-Video dataset, and by about 6.6 percentage points compared to the Pixel-wise Voting Network (PVNet) algorithm on Occlusion LineMOD dataset. Through experimental results, it is known that the proposed algorithm has excellent performance in estimating the pose of weakly textured objects, and has a certain degree of robustness for estimating the pose of occluded objects.
When using the slicing method to measure the point cloud volumes of irregular objects, the existing Polygon Splitting and Recombination (PSR) algorithm cannot split the nearer contours correctly, resulting in low calculation precision. Aiming at this problem, a multi-contour segmentation algorithm — Improved Nearest Point Search (INPS) algorithm was proposed. Firstly, the segmentation of multiple contours was performed through the single-use principle of local points. Then, Point Inclusion in Polygon (PIP) algorithm was adopted to judge the inclusion relationship of contours, thereby determining positive or negative property of the contour area. Finally, the slice area was multiplied by the thickness and the results were accumulated to obtain the volume of irregular object point cloud. Experimental results show that on two public point cloud datasets and one point cloud dataset of chemical electron density isosurface, the proposed algorithm can achieve high-accuracy boundary segmentation and has certain universality. The average relative error of volume measurement of the proposed algorithm is 0.043 6%, which is lower than 0.062 7% of PSR algorithm, verifying that the proposed algorithm achieves high accuracy boundary segmentation.
Aiming at the problems of limited representation of spectrogram feature correlation information and unsatisfactory denoising effect in the existing speech enhancement methods, a speech enhancement method of Double Complex Convolution and Attention Aggregating Recurrent Network (DCCARN) was proposed. Firstly, a double complex convolutional network was established to encode the two-branch information of the spectrogram features after the short-time Fourier transform. Secondly, the codes in the two branches were used in the inter- and and intra-feature-block attention mechanisms respectively, and different speech feature information was re-labeled. Secondly, the long-term sequence information was processed by Long Short-Term Memory (LSTM) network, and the spectrogram features were restored and aggregated by two decoders. Finally, the target speech waveform was generated by short-time inverse Fourier transform to achieve the purpose of suppressing noise. Experiments were carried out on the public dataset VBD (Voice Bank+DMAND) and the noise added dataset TIMIT. The results show that compared with the phase-aware Deep Complex Convolution Recurrent Network (DCCRN), DCCARN has the Perceptual Evaluation of Speech Quality (PESQ) increased by 0.150 and 0.077 to 0.087 respectively. It is verified that the proposed method can capture the correlation information of spectrogram features more accurately, suppress noise more effectively, and improve speech intelligibility.
In view of occlusion and lack of texture details of infrared targets in road scenes, which leads to false detection and missed detection, a lightweight infrared road scene detection YOLO (You Only Look Once) model based on Multi-Scale and weighted Coordinate attention (MSC-YOLO) was proposed. YOLOv7-tiny was taken as the baseline model. Firstly, a multi-scale pyramid module PSA (Pyramid Split Attention) was introduced in different intermediate feature layers of the MobileNetV3, and a lightweight backbone extraction network MSM-Net (Multi-Scale Mobile Network) for multi-scale feature extraction was designed to solve the problem of feature pollution caused by the fixed-size convolution kernel, improving the fine-grained extraction ability of targets of different scales. Secondly, Weighted Coordinate Attention (WCA) mechanism was integrated into the feature fusion network, and the target position information obtained from the vertical and horizontal spatial directions of the intermediate feature map was superimposed to enhance the fusion ability of target features in different dimensions. Finally, the positioning loss function was replaced to Efficient Intersection over Union (EIoU) to calculate the length and width influencing factors of the predicted frame and the real frame separately, accelerating the convergence. The verification experiment was carried out on the Flir dataset. Compared with the YOLOv7-tiny model, the number of parameters is reduced by 67.3%, the number of floating-point operations is reduced by 54.6%, and the model size is reduced by 60.5% under the premise that mAP(IoU=0.5) (mean Average Precision (IoU=0.5)) is only reduced by 0.7 percentage points. The Frames Per Second (FPS) reaches 101 on the RTA 2080Ti, achieving a balance between detection performance and lightweight, and meets the real-time detection requirements of infrared road scenes.
Few-shot Text-To-Speech (TTS) aims to synthesize speech that closely resembles the original speaker using only a small amount of training data. However, this approach faces challenges in quickly adapting to new speakers and improving the similarity between generated speech and speakers while ensuring high speech quality. Existing models often overlook changes in model features during different adaptation stages, leading to slow improvement of speech similarity. To address these issues, a meta-learning-guided model for adapting to new speakers was proposed. The model was guided by a meta-feature module during the adaptation process, ensuring the improvement of speech similarity while maintaining the quality of generated speech during the adaptation to new speakers. Furthermore, the differentiation of adaptation stages was achieved through a step encoder, thereby enhancing the speed of model adaptation to new speakers. The proposed method was evaluated on the Libri-TTS and VCTK datasets using subjective and objective evaluation metrics. Experimental results show that the Dynamic Time Warping-Mel Cepstral Distortion (DTW-MCD) of the proposed model are 7.450 2 and 6.524 3, respectively. It surpasses other meta-learning methods in terms of synthesized speech similarity and enables faster adaptation to new speakers.
A real-time object detection algorithm YOLO-C for complex construction environment was proposed for the problems of cluttered environment, obscured objects, large object scale range, unbalanced positive and negative samples, and insufficient real-time of existing detection algorithms, which commonly exist in construction environment. The extracted low-level features were fused with the high-level features to enhance the global sensing capability of the network, and a small object detection layer was designed to improve the detection accuracy of the algorithm for objects of different scales. A Channel-Spatial Attention (CSA) module was designed to enhance the object features and suppress the background features. In the loss function part, VariFocal Loss was used to calculate the classification loss to solve the problem of positive and negative sample imbalance. GhostConv was used as the basic convolutional block to construct the GCSP (Ghost Cross Stage Partial) structure to reduce the number of parameters and the amount of computation. For complex construction environments, a concrete construction site object detection dataset was constructed, and comparison experiments for various algorithms were conducted on the constructed dataset. Experimental results demonstrate that the YOLO?C has higher detection accuracy and smaller parameters, making it more suitable for object detection tasks in complex construction environments.
Speech emotion recognition has been widely used in multi-scenario intelligent systems in recent years, and it also provides the possibility to realize intelligent analysis of teaching behaviors in smart classroom environments. Classroom speech emotion recognition technology can be used to automatically recognize the emotional states of teachers and students during classroom teaching, help teachers understand their own teaching styles and grasp students’ classroom learning status in a timely manner, thereby achieving the purpose of precise teaching. For the classroom speech emotion recognition task, firstly, classroom teaching videos were collected from primary and secondary schools, the audio was extracted, and manually segmented and annotated to construct a primary and secondary school teaching speech emotion corpus containing six emotion categories. Secondly, based on the Temporal Convolutional Network (TCN) and cross-gated mechanism, dual temporal convolution channels were designed to extract multi-scale cross-fusion features. Finally, a dynamic weight fusion strategy was adopted to adjust the contributions of features at different scales, reduce the interference of non-important features on the recognition results, and further enhance the representation and learning ability of the model. Experimental results show that the proposed method is superior to advanced models such as TIM-Net (Temporal-aware bI-direction Multi-scale Network), GM-TCNet (Gated Multi-scale Temporal Convolutional Network), and CTL-MTNet (CapsNet and Transfer Learning-based Mixed Task Net) on multiple public datasets, and its UAR (Unweighted Average Recall) and WAR (Weighted Average Recall) reach 90.58% and 90.45% respectively in real classroom speech emotion recognition task.
At present, the image super-resolution networks based on deep learning are mainly implemented by convolution. Compared with the traditional Convolutional Neural Network (CNN), the main advantage of Transformer in the image super-resolution task is its long-distance dependency modeling ability. However, most Transformer-based image super-resolution models cannot establish global dependencies with small parameters and few network layers, which limits the performance of the model. In order to establish global dependencies in super-resolution network, an image Super-Resolution network based on Global Dependency Transformer (GDTSR) was proposed. Its main component was the Residual Square Axial Window Block (RSAWB), and in Transformer residual layer, axial window and self-attention were used to make each pixel globally dependent on the entire feature map. In addition, the super-resolution image reconstruction modules of most current image super-resolution models are composed of convolutions. In order to dynamically integrate the extracted feature information, Transformer and convolution were combined to jointly reconstruct super-resolution images. Experimental results show that the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) of GDTSR on five standard test sets, including Set5, Set14, B100, Urban100 and Manga109, are optimal for three multiples ( × 2 , × 3 , × 4 ), and on large-scale datasets Urban100 and Manga109, the performance improvement is especially obvious.
To solve the problems of insufficient utilization of residual features and loss of details in existing residual networks, a deep neural network model combining the two-layer structure of residual aggregation and dual-attention mechanism with receptive field expansion, was proposed for Single Image Super-Resolution (SISR) reconstruction. In this model, a two-layer nested network structure of residual aggregation was constructed through skip connections, to agglomerate and fuse hierarchically the residual information extracted by each layer of the network, thereby reducing the loss of residual information containing image details. Meanwhile, a multi-scale receptive field expansion module was designed to capture a larger range of context-dependent information at different scales for the effective extraction of deep residual features; and a space-channel dual attention mechanism was introduced to enhance the discriminative learning ability of the residual network, thus improving the quality of reconstructed images. Quantitative and qualitative assessments were performed on benchmark datasets Set5, Set14, B100 and Urban100 for comparison with the mainstream methods. The objective evaluation results indicate that the proposed method outperforms the comparative methods on all four datasets; compared with the classical SRCNN (Super-Resolution using Convolutional Neural Network) model and second best performing comparison model ISRN (Iterative Super-Resolution Network), the proposed model improves the average values of Peak Signal-to-Noise Ratio (PSNR) by 1.91, 1.71, 1.61 dB and 0.06, 0.04, 0.04 dB, respectively, at the magnification of 2, 3 and 4. Visual effects show that the proposed model reconstructs clearer image details and textures.
The automatic segmentation of brain lesions provides a reliable basis for the timely diagnosis and treatment of stroke patients and the formulation of diagnosis and treatment plans, but obtaining large-scale labeled data is expensive and time-consuming. Semi-Supervised Learning (SSL) methods alleviate this problem by utilizing a large number of unlabeled images and a limited number of labeled images. Aiming at the two problems of pseudo-label noise in SSL and the lack of ability of existing Three-Dimensional (3D) networks to focus on smaller objects, a semi-supervised method was proposed, namely, a rectified cross pseudo supervised method with attention mechanism for stroke lesion segmentation RPE-CPS (Rectified Cross Pseudo Supervision with Project & Excite modules). First, the data was input into two 3D U-Net segmentation networks with the same structure but different initializations, and the obtained pseudo-segmentation graphs were used for cross-supervised training of the segmentation networks, making full use of the pseudo-label data to expand the training set, and encouraging a high similarity between the predictions of different initialized networks for the same input image. Second, a correction strategy about cross-pseudo-supervised approach based on uncertainty estimation was designed to reduce the impact of the noise in pseudo-labels. Finally, in the segmentation network of 3D U-Net, in order to improve the segmentation performance of small object classes, Project & Excite (PE) modules were added behind each encoder module, decoder module and bottleneck module. In order to verify the effectiveness of the proposed method, evaluation experiments were carried out on the Acute Ischemic Stroke (AIS) dataset of the cooperative hospital and the Ischemic Stroke Lesion Segmentation Challenge (ISLES2022) dataset. The experimental results showed that when only using 20% of the labeled data in the training set, the Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD95), and Average Surface Distance (ASD) on the public dataset ISLES2022 reached 73.87%, 6.08 mm and 1.31 mm; on the AIS dataset, DSC, HD95, and ASD reached 67.74%, 15.38 mm and 1.05 mm, respectively. Compared with the state-of-the-art semi-supervised method Uncertainty Rectified Pyramid Consistency(URPC), DSC improved by 2.19 and 3.43 percentage points, respectively. The proposed method can effectively utilize unlabeled data to improve segmentation accuracy, outperforms other semi-supervised methods, and is robust.
Automatic identification of marine ships plays an important role in alleviating the pressure of marine traffic. To address the problem of low automatic ship identification rate, a ship identification model based on ResNet50 (Residual Network50) and improved attention mechanism was proposed. Firstly, a ship data set was made by ourselves, and divided into the training set, the verification set and the test set, which were augmented by blurring and adding noise. Secondly, an improved attention module — Efficient Spatial Pyramid Attention Module (ESPAM) and ship type recognition model ResNet50_ESPAM were designed. Finally, the ResNet50_ESPAM was trained, verified and compared with other commonly used neural network models using ship data sets. The experimental results show that in the verification set, the highest accuracy of ResNet50_ESPAM is 95.5%, and the initial accuracy is 81.2%; compared with AlexNet(Alex Krizhevsky Network), GoogleNet (Google Inception Net), ResNet34(Residual Network34), ResNet50 and ResNet50_CBAM (ResNet50_Convlutional Block Attention Module), the maximum accuracy of the model validation set increases by 5.1, 4.9, 2.6, 1.6 and 1.4 percentage points respectively, and the initial accuracy of the validation set increases by 49.4, 44.7, 27.7, 3.0 and 2.1 percentage points respectively, indicating that ResNet50_ESPAM has a high recognition accuracy in ship type recognition, and the improved attention module ESPAM is highly effective.
Existing learning-based single-image deraining networks mostly focus on the effect of rain streaks in rainy images on visual imaging, while ignoring the effect of fog on visual imaging due to the increase of humidity in the air in rainy environments, thus causing problems such as low generation quality and blurred texture detail information in the derained images. To address these problems, an asymmetric unsupervised end-to-end image deraining network model was proposed. It mainly consists of rain and fog removal network, rain and fog feature extraction network and rain and fog generation network, which form two different data domain mapping conversion modules: Rain-Clean-Rain and Clean-Rain-Clean. The above three sub-networks constituted two parallel transformation paths: the rain removal path and the rain-fog feature extraction path. In the rain-fog feature extraction path, a rain-fog-aware extraction network based on global and local attention mechanisms was proposed to learn rain-fog related features by using the global self-similarity and local discrepancy existing in rain-fog features. In the rain removal path, a rainy image degradation model and the above extracted rain-fog related features were introduced as priori knowledge to enhance the ability of rain-fog image generation, so as to constrain the rain-fog removal network and improve its mapping conversion capability from rain data domain to rain-free data domain. Extensive experiments on different rain image datasets show that compared to state-of-the-art deraining method CycleDerain, the Peak Signal-to-Noise Ratio (PSNR) is improved by 31.55% on the synthetic rain-fog dataset HeavyRain. The proposed model can adapt to different rainy scenarios, has better generalization, and can better recover the details and texture information of images.
Deploying the YOLOv8L model on edge devices for road crack detection can achieve high accuracy, but it is difficult to guarantee real-time detection. To solve this problem, a target detection algorithm based on the improved YOLOv8 model that can be deployed on the edge computing device Jetson AGX Xavier was proposed. First, the Faster Block structure was designed using partial convolution to replace the Bottleneck structure in the YOLOv8 C2f module, and the improved C2f module was recorded as C2f-Faster; second, an SE (Squeeze-and-Excitation) channel attention layer was connected after each C2f-Faster module in the YOLOv8 backbone network to further improve the detection accuracy. Experimental results on the open source road damage dataset RDD20 (Road Damage Detection 20) show that the average F1 score of the proposed method is 0.573, the number of detection Frames Per Second (FPS) is 47, and the model size is 55.5 MB. Compared with the SOTA (State-Of-The-Art) model of GRDDC2020 (Global Road Damage Detection Challenge 2020), the F1 score is increased by 0.8 percentage points, the FPS is increased by 291.7%, and the model size is reduced by 41.8%, which realizes the real-time and accurate detection of road cracks on edge devices.
The quality of low-light images is poor and Low-Light Image Enhancement (LLIE) aims to improve the visual quality. Most of LLIE algorithms focus on enhancing luminance and contrast, while neglecting details. To solve this issue, a Progressive Enhancement algorithm for low-light images based on Layer Guidance (PELG) was proposed, which enhanced algorithm images to a suitable illumination level and reconstructed clear details. First, to reduce the task complexity and improve the efficiency, the image was decomposed into several frequency components by Laplace Pyramid (LP) decomposition. Secondly, since different frequency components exhibit correlation, a Transformer-based fusion model and a lightweight fusion model were respectively proposed for layer guidance. The Transformer-based model was applied between the low-frequency and the lowest high-frequency components. The lightweight model was applied between two neighbouring high-frequency components. By doing so, components were enhanced in a coarse-to-fine manner. Finally, the LP was used to reconstruct the image with uniform brightness and clear details. The experimental results show that, the proposed algorithm achieves the Peak Signal-to-Noise Ratio (PSNR) 2.3 dB higher than DSLR (Deep Stacked Laplacian Restorer) on LOL(LOw-Light dataset)-v1 and 0.55 dB higher than UNIE (Unsupervised Night Image Enhancement) on LOL-v2. Compared with other state-of-the-art LLIE algorithms, the proposed algorithm has shorter runtime and achieves significant improvement in objective and subjective quality, which is more suitable for real scenes.