Introduction of attention mechanisms allows the backbone network to learn more discriminative feature representations. However, traditional attention mechanisms control the complexity of attention by channel dimension reduction or decreasing channel number while increasing batch size, which leads to excessive reduction of the number of channels and loss of important feature information. To address this issue, a Channel Shuffle Attention (CSA) module was proposed. Firstly, group convolutions were used to learn attention weights to control the complexity of CSA. Secondly, the traditional channel shuffle and Deep Channel Shuffle (DCS) methods were used to enhance the exchange of channel feature information between different groups. Thirdly, inverse channel shuffle was used to restore the order of attention weights. Finally, the restored attention weights were multiplied with the original feature map to obtain a more expressive feature map. Experimental results show that on CIFAR-100 dataset, ResNet50 adding CSA reduces the number of parameters by 2.3% and increases the Top-1 accuracy by 0.57 percentage points compared to ResNet50 adding CA (Coordinate Attention), and has the quantity of computation reduced by 18.4% and the Top-1 accuracy increased by 0.27 percentage points compared with ResNet50 adding EMA (Efficient Multi-scale Attention). On COCO2017 dataset, YOLOv5s adding CSA improves the mean Average Precision (mAP@50) by 0.5 and 0.2 percentage points, respectively, compared to YOLOv5s adding CA and EMA. It can be seen that CSA achieves a balance between the number of parameters and the computational complexity, and improves the accuracy of image classification tasks and the localization capability of object detection tasks at the same time.
Significant differences in object scale and aspect ratio in remote sensing images lead to difficult object detection in remote sensing images. Aiming at this characteristic of remote sensing image, in order to improve the precision of object detection in remote sensing images, EW-YOLO (Efficient Weighted-YOLO) was proposed by improving the YOLO framework. Firstly, the multi-level feature fusion structure was introduced in the feature fusion section, so that the dual-branch residual module was utilized to promote the fusion of features at different scales. And by the cascade of feature fusion modules and the cross-layer feature fusion design, the extraction capability of objects at different scales was improved, and the detection capability was further enhanced. Secondly, in the prediction section, the weighted detection head was proposed and Weighted Boxes Fusion (WBF) was introduced, so as to improve the detection precision of objects with different aspect ratios by weighting each candidate box using the confidence scores and generating prediction boxes by fusion. Finally, to address the issue of too large image size, an image resampling technique was proposed, which means that the images were sampled to appropriate sizes and joined into network training, solving the problem of low detection precision of large-size objects caused by cropping. Experimental results on DOTA dataset show that the detection mean Average Precision (mAP) of the proposed method is 77.47%, which is increased by 1.55 percentage points compared to that of the original YOLO framework based method. And compared with the current mainstream methods, the proposed method has superior performance. At the same time, the proposed method’s effectiveness is also verified on HRSC and UCAS-AOD datasets.
When multiple feature modalities are fused, there is a superposition of noise, and the cascaded structure used to reduce the differences between modalities does not fully utilize the feature information between modalities. To address these issues, a cross-modal Dual-stream Alternating Interactive Network (DAINet) method was proposed. Firstly, a Dual-stream Alternating Enhancement (DAE) module was constructed to fuse modal features in interactive dual-branch way. And by learning mapping relationships between modalities and employing bidirectional feedback adjustments of InFrared-VISible-InFrared (IR-VIS-IR) and VISible-InfRared-VISible (VIS-IR-VIS), the cross suppression of inter-modal noise was realized. Secondly, a Cross-Modal Feature Interaction (CMFI) module was constructed, and the residual structure was introduced to integrate low-level and high-level features within and between infrared-visible modalities, thereby minimizing differences and maximizing inter-modal feature utilization. Finally, on a self-constructed infrared-visible multi-modal typhoon dataset and a publicly available RGB-NIR multi-modal dataset, the effectiveness of DAE module and CMFI module was verified. Experimental results demonstrate that compared to the simple cascading fusion method on the self-constructed typhoon dataset, the proposed DAINet-based feature fusion method improves the overall classification accuracy by 6.61 and 3.93 percentage points for the infrared and visible modalities, respectively, with G-mean values increased by 6.24 and 2.48 percentage points, respectively. These results highlight the generalizability of the proposed method for class-imbalanced classification tasks. On the RGB-NIR dataset, the proposed method achieves the overall classification accuracy improvements of 13.47 and 13.90 percentage points, respectively, for the two test modalities. At the same time, experimental results of comparing with IFCNN (general Image Fusion framework based on Convolutional Neural Network) and DenseFuse methods demonstrate that the proposed method improves the overall classification accuracy by 9.82, 6.02, and 17.38, 1.68 percentage points for the two test modalities on the self-constructed typhoon dataset.
Foreign Function Interface (FFI) is a fundamental method to invoke interfaces provided in other programming languages. Focusing on huge amount of manual coding required when using FFI, an Automatic Foreign function Interface Generation (AFIG) method was proposed. The reverse source code analysis technique based on abstract syntax tree was employed by AFIG to accurately retrieve the multilingual intermediate representation from library binaries, in which function interface information was uniformly described. Based on the representation, the multilingual conversion rule matrix could be utilized by different platform code generators to automatically generate FFI codes for various platforms without handcrafting. To further reduce generation time usage, a dependency analysis-based task aggregation strategy was proposed, by which tasks with dependencies were consolidated as monolithic ones. Hence, blocking and deadlocks were efficiently eliminated, and load balancing and scalability on multi-core systems were achieved, accordingly. Experimental results indicate that AFIG achieves a reduction of 98.14% for FFI developing codes and 41.95% for testing codes compared to manual coding method; under the same task, AFIG further reduces development cost by 61.27% compared to SWIG (Simplified Wrapper and Interface Generator). And the code generation efficiency of AFIG increases linearly with the increase of computing resources.
A text classification method in Natural Language Processing (NLP) was introduced into the field of criminal psychology to scientifically and intelligently grade the violent tendencies of prisoners. A Criminal semantic Convolutional Hierarchical Attention Network (CCHA-Net) based on the joint modeling of two channels of improved HAN (Hierarchy Attention Network) and TextCNN (Text Convolutional Neural Network) was proposed to complete the violent criminal temperament grade by separately mining the semantic information of crime facts and basic information of prisoners. Firstly, Focal Loss was used to simultaneously replace the Cross-Entropy function in both channels to optimize the sample size imbalance problem. Secondly, in the two-channel input layer, positional encoding was simultaneously introduced to improve the perception of positional information. The HAN channel was improved by using max-pooling to construct salient vectors. Finally, global average pooling was used to replace the fully connected method in all output layers to avoid overfitting. Experimental results show that compared with 17 related baseline models such as AC-BiLSTM (Attention-based Bidirectional Long Short-Term Memory with Convolution layer) and Support Vector Machine (SVM), the indicators of CCHA-Net reach the best, the micro-average F1 (Micro_F1) is 99.57%, and the Area Under the Curve (AUC) under the macro-average and the micro-average are 99.45% and 99.89%, respectively, which are 4.08, 5.59 and 0.74 percentage points higher than those of the suboptimal AC-BiLSTM. It can be verified that the violent criminal temperament grade task can be effectively performed by CCHA-Net.
It is difficult for current object segmentation models to reach a good balance between segmentation performance and inference efficiency. To solve this challenge, a self-distillation object segmentation method via scale-attention knowledge transfer was proposed. Firstly, an object segmentation network only using features in backbone was constructed as the inference network, to achieve efficient forward inference process. Secondly, a self-distillation learning model via scale-attention knowledge was proposed. On the one hand, a scale-attention pyramid feature module was designed to adaptively capture context information at different semantic levels and extract more discriminative self-distillation knowledge. On the other hand, a distillation loss was constructed by fusing cross entropy, KL (Kullback-Leibler) divergence and L2 distance. It drove distillation knowledge to transfer into segmentation network efficiently to improve its generalization performance. The method was verified on five public object segmentation datasets of COD (Camouflaged Object Detection), DUT-O (Dalian University of Technology-OMRON), SOC (Salient Objects in Clutter), etc.: considering the proposed inference network as the baseline network, the proposed self-distillation model can increase the segmentation performance by 3.01% on Fβ metric, which was 1.00% higher better than that of Teacher-Free (TF) self-distillation model; compared with recent Residual learning Net (R2Net), the proposed object segmentation network reduces the number of parameters by 2.33×106, improves the inference frame rate by 2.53%, decreases the floating-point operations by 40.50%, and increases segmentation performance by 0.51%. Experimental results show that the proposed self-distillation segmentation method can balance performance and efficiency, and is suitable for scenarios with limited computing and storage resources.
As a classical problem in spatial databases, Aggregate Nearest Neighbor (ANN) query is of great importance in the optimization of network link structures, the location selection of logistics distribution points and the car-sharing services, and can effectively contribute to the development of fields such as logistics, mobile Internet industry and operations research. The existing research has some shortcomings: lack of efficient index structure for large-scale dynamic road network data, low query efficiency of the algorithms when the data point locations move in real time and network weights update dynamically. To address these problems, an ANN query algorithm in dynamic scenarios was proposed. Firstly, with adopting G-tree as the road network index, a pruning algorithm combining spatial index structures such as quadtrees and k-d trees with the Incremental Euclidean Restriction (IER) algorithm was proposed to solve ANN queries in statistic space. Then, aiming at the issue of frequent updates of data point locations in dynamic scenarios, the time window and safe zone update strategy were added to reduce the iteration times of the algorithm, experimental results showed that the efficiency could be improved by 8% to 85%. Finally, for ANN query problems with road network weight changed, based on historical query results, two correction based continuous query algorithms were proposed to obtain the current query results according to the increment of weight changes. In certain scenarios, these algorithms can reduce errors by approximately 50%. The theoretical research and experimental results show that the proposed algorithms can solve the ANN query problems in dynamic scenarios efficiently and more accurately.
Moving object detection aims to separate the background and foreground of the video, however, the commonly used low-rank factorization methods are often difficult to comprehensively deal with the problems of dynamic background and intermittent motion. Considering that the skewed noise distribution after background subtraction has potential background correction effect, a moving object detection model based on the reliability low-rank factorization and generalized diversity difference was proposed. There were three steps in the model. Firstly, the peak position and the nature of skewed distribution of the pixel distribution in the time dimension were used to select a sub-sequence without outlier pixels, and the median of this sub-sequence was calculated to form the static background. Secondly, the noise after static background subtraction was modeled by asymmetric Laplace distribution, and the modeling results based on spatial smoothing were used as reliability weights to participate in low-rank factorization to model comprehensive background (including dynamic background). Finally, the temporal and spatial continuous constraints were adopted in proper order to extract the foreground. Among them, for the temporal continuity, the generalized diversity difference constraint was proposed, and the expansion of the foreground edge was suppressed by the difference information of adjacent video frames. Experimental results show that, compared with six models such as PCP(Principal Component Pursuit), DECOLOR(DEtecting Contiguous Outliers in the Low-Rank Representation), LSD(Low-rank and structured Sparse Decomposition), TVRPCA(Total Variation regularized Robust Principal Component Analysis), E-LSD(Extended LSD) and GSTO(Generalized Shrinkage Thresholding Operator), the proposed model has the highest F-measure. It can be seen that this model can effectively improve the detection accuracy of foreground in complex scenes such as dynamic background and intermittent motion.
In the face of adversarial example attack, deep neural networks are vulnerable. These adversarial examples result in the misclassification of deep neural networks by adding human-imperceptible perturbations on the original images, which brings a security threat to deep neural networks. Therefore, before the deployment of deep neural networks, the adversarial attack is an important method to evaluate the robustness of models. However, under the black-box setting, the attack success rates of adversarial examples need to be improved, that is, the transferability of adversarial examples need to be increased. To address this issue, an adversarial example method based on image flipping transform, namely FT-MI-FGSM (Flipping Transformation Momentum Iterative Fast Gradient Sign Method), was proposed. Firstly, from the perspective of data augmentation, in each iteration of the adversarial example generation process, the original input image was flipped randomly. Then, the gradient of the transformed images was calculated. Finally, the adversarial examples were generated based on this gradient, so as to alleviate the overfitting in the process of adversarial example generation and to improve the transferability of adversarial examples. In addition, the method of attacking ensemble models was used to further enhance the transferability of adversarial examples. Extensive experiments on ImageNet dataset demonstrated the effectiveness of the proposed algorithm. Compared with I-FGSM (Iterative Fast Gradient Sign Method) and MI-FGSM (Momentum I-FGSM), the average black-box attack success rate of FT-MI-FGSM on the adversarially training networks is improved by 26.0 and 8.4 percentage points under the attacking ensemble model setting, respectively.
Aiming at the problems of slow detection and low recognition accuracy of road traffic signs in Chinese intelligent driving assistance system, an improved road traffic sign detection algorithm based on YOLOv3 (You Only Look Once version 3) was proposed. Firstly, MobileNetv2 was introduced into YOLOv3 as the basic feature extraction network to construct an object detection network module MN-YOLOv3 (MobileNetv2-YOLOv3). And two Down-up links were added to the backbone network of MN-YOLOv3 for feature fusion, thereby reducing the model parameters, and improving the running speed of the detection module as well as information fusion performance of the multi-scale feature maps. Then, according to the shape characteristics of traffic sign objects, K-Means++ algorithm was used to generate the initial cluster center of the anchor, and the DIOU (Distance Intersection Over Union) loss function was introduced to combine DIOU and Non-Maximum Suppression (NMS) for the bounding box regression. Finally, the Region Of Interest (ROI) and the context information were unified by ROI Align and merged to enhance the object feature expression. Experimental results show that the proposed algorithm has better performance, and the mean Average Precision (mAP) of the algorithm on the dataset CSUST (ChangSha University of Science and Technology) Chinese Traffic Sign Detection Benchmark (CCTSDB) can reach 96.20%. Compared with Faster R-CNN (Region Convolutional Neural Network), YOLOv3 and Cascaded R-CNN detection algorithms, the proposed algorithm has better real-time performance, higher detection accuracy, and is more robustness to various environmental changes.
Aiming at the low embedding capacity of Reversible Data Hiding (RDH) in encrypted videos, a high-capacity RDH scheme in encrypted videos based on histogram shifting was proposed. Firstly, 4×4 luminance intra-prediction mode and the sign bits of Motion Vector Difference (MVD) were encrypted by stream cipher, and then a two-dimensional histogram of MVD was constructed, and (0,0) symmetric histogram shifting algorithm was designed. Finally, (0,0) symmetric histogram shifting algorithm was carried out in the encrypted MVD domain to realize separable RDH in encrypted videos. Experimental results show that the embedding capacity of the proposed scheme is increased by 263.3% on average compared with the comparison schemes, the average Peak Signal-to-Noise Ratio (PSNR) of encrypted video is less than 15.956 dB, and the average PSNR of decrypted video with secret can reach more than 30 dB. The proposed scheme effectively improves the embedding capacity and is suitable for more types of video sequences.
A Non-Orthogonal Multiple Access (NOMA) based computation offloading and bandwidth allocation strategy was presented to address the issues of insufficient computing capacity of mobile devices and limited spectrum resource in 5G ultra-dense network. Firstly, the system model was analyzed, on this basis, the research problem was defined formally with the objective of minimizing the computation cost of devices. Then, this problem was decomposed into three sub-problems: device computation offloading, system bandwidth allocation, and device grouping and matching, which were solved by adopting simulated annealing, interior point method, and greedy algorithm. Finally, a joint optimization algorithm was used to alternately solve the above sub-problems, and the optimal computation offloading and bandwidth allocation strategy was obtained. Simulation results show that, the proposed joint optimization strategy is superior to the traditional Orthogonal Multiple Access (OMA), and can achieve lower device computation cost compared to NOMA technology with average bandwidth allocation.
Authorship attribution is the task of deciding who is the author of a particular document, however, the traditional methods for authorship attribution are target-independent without considering any constraint during the prediction of authorship, which is inconsistent with the actual problems. To address the above issue, a Target-Dependent method for Authorship Attribution (TDAA) was proposed. Firstly, the product ID corresponding to the user review was chosen to be the constraint information. Secondly, Bidirectional Encoder Representation from Transformer (BERT) was used to extract the pre-trained review text feature to make the text modeling process more universal. Thirdly, the Convolutional Neural Network (CNN) was used to extract the deep features of the text. Finally, two fusion methods were proposed to fuse the two different information. Experimental results on Amazon Movie_and_TV dataset and CDs_and_Vinyl_5 dataset show that the proposed method can increase the accuracy by 4%-5% compared with the comparison methods.
Concerning the problem that heterogeneous network representation learning only considers social relations in structure and ignores semantics, combining the social relationship between users and the preference of users for topics, a representation learning algorithm based on topic-attention network was proposed. Firstly, according to the characteristics of the topic-attention network and combining with the idea of the identical-discrepancy-contrary (determination and uncertainty) of set pair analysis theory, the transition probability model was given. Then, a random walk algorithm based on two types of nodes was proposed by using the transition probability model, so as to obtain the relatively high-quality random walk sequence. Finally, the embedding vector space representation of the topic-attention network was obtained by modeling based on two types of nodes in the sequences. Theoretical analysis and experimental results on the Douban dataset show that the random walk algorithm combined with the transition probability model is more comprehensive in analyzing the connection relationship between nodes in the network. The modularity of the proposed algorithm is 0.699 8 when the number of the communities is 13, which is nearly 5% higher than that of metapath2vec algorithm, and can capture more detailed information in the network.
In the non-overlapping filed of multi-camera system, the single-shot person identification methods cannot well deal with appearance and viewpoint changes. Based on the multiple frames acquired from surveillance cameras, a new technique which combined Hidden Markov Model (HMM) with appearance-based feature was proposed. First, considering the structural constraint of human body, the whole-body appearance of each individual was equally vertically divided into sub-images. Then multi-level threshold method was used to extract Segment Representative Color (SRC) and Segment Standard Variation (SSV) feature. The feature dataset acquired from multiple frames was applied to train continuous density HMM,and the final recognition was realized by these well-trained model. Extensive experiments on two public datasets show that the proposed method achieves high recognition rate, improves robustness against viewpoint changes and low resolution, and it is simple and easy to realize.
Images captured in hazy weather suffer from poor contrast and low visibility. This paper proposed a single image defogging algorithm to remove haze by combining with the characteristics of HSI color space. Firstly, the method converted original image from RGB color space to HSI color space. Then, based on the different affect to hue, saturation and intensity, a defogged model was established. Finally, the range of weight in saturation model was obtained by analyzing original images saturation, then the range of weight in intensity model was also estimated, and the original image was defogged. In comparison with other algorithms, the experimental results show that the running efficiency of the proposed method is doubled. And the proposed method effectively enhances clarity, so it is appropriate for single image defogging.
To solve the problem of location verification caused by collusion attack in Vehicular Ad Hoc NETworks (VANET), a multi-round vote location verification based on weight and difference was proposed. In the mechanism, a static frame was introduced and the Beacon messages format was redesigned to alleviate the time delay of location verification. By setting malicious vehicles filtering process, the position of the specific region was voted by the neighbors with different degrees of trust, which could obtain credible position verification. The experimental results illustrate that in the case of collusion attack, the scheme achieves a higher accuracy of 93.4% compared to Minimum Mean Square Estimation (MMSE) based location verification mechanism.