Under the background of emphasizing data right confirmation and privacy protection, federated learning, as a new machine learning paradigm, can solve the problem of data island and privacy protection without exposing the data of all participants. Since the modeling methods based on federated learning have become mainstream and achieved good effects at present, it is significant to summarize and analyze the concepts, technologies, applications and challenges of federated learning. Firstly, the development process of machine learning and the inevitability of the appearance of federated learning were elaborated, and the definition and classification of federated learning were given. Secondly, three federated learning methods (including horizontal federated learning, vertical federated learning and federated transfer learning) which were recognized by the industry currently were introduced and analyzed. Thirdly, concerning the privacy protection issue of federated learning, the existing common privacy protection technologies were generalized and summarized. In addition, the recent mainstream open-source frameworks were introduced and compared, and the application scenarios of federated learning were given at the same time. Finally, the challenges and future research directions of federated learning were prospected.
To solve the irreconcilable contradiction between data sharing demands and requirements of privacy protection, federated learning was proposed. As a distributed machine learning, federated learning has a large number of model parameters needed to be exchanged between the participants and the central server, resulting in higher communication overhead. At the same time, federated learning is increasingly deployed on mobile devices with limited communication bandwidth and limited power, and the limited network bandwidth and the sharply raising client amount will make the communication bottleneck worse. For the communication bottleneck problem of federated learning, the basic workflow of federated learning was analyzed at first, and then from the perspective of methodology, three mainstream types of methods based on frequency reduction of model updating, model compression and client selection respectively as well as special methods such as model partition were introduced, and a deep comparative analysis of specific optimization schemes was carried out. Finally, the development trends of federated learning communication overhead technology research were summarized and prospected.
Multi-modal medical images can provide clinicians with rich information of target areas (such as tumors, organs or tissues). However, effective fusion and segmentation of multi-modal images is still a challenging problem due to the independence and complementarity of multi-modal images. Traditional image fusion methods have difficulty in addressing this problem, leading to widespread research on deep learning-based multi-modal medical image segmentation algorithms. The multi-modal medical image segmentation task based on deep learning was reviewed in terms of principles, techniques, problems, and prospects. Firstly, the general theory of deep learning and multi-modal medical image segmentation was introduced, including the basic principles and development processes of deep learning and Convolutional Neural Network (CNN), as well as the importance of the multi-modal medical image segmentation task. Secondly, the key concepts of multi-modal medical image segmentation was described, including data dimension, preprocessing, data enhancement, loss function, and post-processing, etc. Thirdly, different multi-modal segmentation networks based on different fusion strategies were summarized and analyzed. Finally, several common problems in medical image segmentation were discussed, the summary and prospects for future research were given.
The multi-scale features of time series contain abundant category information which has different importance for classification. However, the existing univariate time series classification models conventionally extract series features by convolutions with a fixed kernel size, resulting in being unable to acquire and focus on important multi-scale features effectively. In order to solve the above problem, a Multi-scale Convolution and Attention mechanism (MCA) based Long Short-Term Memory (LSTM) model (MCA-LSTM) was proposed, which was capable of concentrating and fusing important multi-scale features to achieve more accurate classification effect. In this structure, by using LSTM, the transmission of series information was controlled through memory cells and gate mechanism, and the correlation information of time series was extracted fully; by using Multi-scale Convolution Module (MCM), the multi-scale features of the series were extracted through Convolutional Neural Networks (CNNs) with different kernel sizes; by using Attention Module (AM), the channel information was fused to obtain the importance of features and assign attention weights, which enabled the network to focus on important time series features. Experimental results on 65 univariate time series datasets of UCR archive show that compared with the state-of-the-art time series classification methods: Unsupervised Scalable Representation Learning-FordA (USRL-FordA), Unsupervised Scalable Representation Learning-Combined (1-Nearest Neighbor) (USRL-Combined (1-NN)), Omni-Scale Convolutional Neural Network (OS-CNN), Inception-Time and Robust Temporal Feature Network for time series classification (RTFN),MCA-LSTM has the Mean Error (ME) reduced by 7.48, 9.92, 2.43, 2.09 and 0.82 percentage points, respectively; and achieved the highest Arithmetic Mean Rank (AMR) and Geometric Mean Rank (GMR), which are 2.14 and 3.23 respectively. These results fully demonstrate the effectiveness of MCA-LSTM in the classification of univariate time series.
By using complex pre-training targets and a large number of model parameters, Pre-Training Model (PTM) can effectively obtain rich knowledge from unlabeled data. However, the development of the multimodal PTMs is still in its infancy. According to the difference between modals, most of the current multimodal PTMs were divided into the image-text PTMs and video-text PTMs. According to the different data fusion methods, the multimodal PTMs were divided into two types: single-stream models and two-stream models. Firstly, common pre-training tasks and downstream tasks used in validation experiments were summarized. Secondly, the common models in the area of multimodal pre-training were sorted out, and the downstream tasks of each model and the performance and experimental data of the models were listed in tables for comparison. Thirdly, the application scenarios of M6 (Multi-Modality to Multi-Modality Multitask Mega-transformer) model, Cross-modal Prompt Tuning (CPT) model, VideoBERT (Video Bidirectional Encoder Representations from Transformers) model, and AliceMind (Alibaba’s collection of encoder-decoders from Mind) model in specific downstream tasks were introduced. Finally, the challenges and future research directions faced by related multimodal PTM work were summed up.
With the widespread application of deep learning, human beings are increasingly relying on a large number of complex systems that adopt deep learning techniques. However, the black?box property of deep learning models offers challenges to the use of these models in mission?critical applications and raises ethical and legal concerns. Therefore, making deep learning models interpretable is the first problem to be solved to make them trustworthy. As a result, researches in the field of interpretable artificial intelligence have emerged. These researches mainly focus on explaining model decisions or behaviors explicitly to human observers. A review of interpretability for deep learning was performed to build a good foundation for further in?depth research and establishment of more efficient and interpretable deep learning models. Firstly, the interpretability of deep learning was outlined, the requirements and definitions of interpretability research were clarified. Then, several typical models and algorithms of interpretability research were introduced from the three aspects of explaining the logic rules, decision attribution and internal structure representation of deep learning models. In addition, three common methods for constructing intrinsically interpretable models were pointed out. Finally, the four evaluation indicators of fidelity, accuracy, robustness and comprehensibility were introduced briefly, and the possible future development directions of deep learning interpretability were discussed.
Aiming at the problem that traditional single factor methods cannot make full use of the relevant information of time series and has the poor accuracy and reliability of time series prediction, a time series prediction model based on multimodal information fusion,namely Skip-Fusion, was proposed to fuse the text data and numerical data in multimodal data. Firstly, different types of text data were encoded by pre-trained Bidirectional Encoder Representations from Transformers (BERT) model and one-hot encoding. Then, the single vector representation of the multi-text feature fusion was obtained by using the pre-trained model based on global attention mechanism. After that, the obtained single vector representation was aligned with the numerical data in time order. Finally, the fusion of text and numerical features was realized through Temporal Convolutional Network (TCN) model, and the shallow and deep features of multimodal data were fused again through skip connection. Experiments were carried out on the dataset of stock price series, Skip-Fusion model obtains the results of 0.492 and 0.930 on the Root Mean Square Error (RMSE) and daily Return (R) respectively, which are better than the results of the existing single-modal and multimodal fusion models. Experimental results show that Skip-Fusion model obtains the goodness of fit of 0.955 on the R-squared, indicating that Skip-Fusion model can effectively carry out multimodal information fusion and has high accuracy and reliability of prediction.
Traditional stock prediction methods are mostly based on time-series models, which ignore the complex relations among stocks, and the relations often exceed pairwise connections, such as stocks in the same industry or multiple stocks held by the same fund. To solve this problem, a stock trend prediction method based on temporal HyperGraph Convolutional neural Network (HGCN) was proposed, and a hypergraph model based on financial investment facts was constructed to fit multiple relations among stocks. The model was composed of two major components: Gated Recurrent Unit (GRU) network and HGCN. GRU network was used for performing time-series modeling on historical data to capture long-term dependencies. HGCN was used to model high-order relations among stocks to learn intrinsic relation attributes, and introduce the multiple relation information among stocks into traditional time-series modeling for end-to-end trend prediction. Experiments on real dataset of China A-share market show that compared with existing stock prediction methods, the proposed model improves prediction performance, e.g. compared with the GRU network, the proposed model achieves the relative increases in ACC and F1_score of 9.74% and 8.13%, respectively, and is more stable. In addition, the simulation back-testing results show that the trading strategy based on the proposed model is more profitable, with an annual return of 11.30%, which is 5 percentage points higher than that of Long Short-Term Memory (LSTM) network.
Multi-Label Text Classification (MLTC) is one of the important subtasks in the field of Natural Language Processing (NLP). In order to solve the problem of complex correlation between multiple labels, an MLTC method TLA-BERT was proposed by incorporating Bidirectional Encoder Representations from Transformers (BERT) and label semantic attention. Firstly, the contextual vector representation of the input text was learned by fine-tuning the self-coding pre-training model. Secondly, the labels were encoded individually by using Long Short-Term Memory (LSTM) neural network. Finally, the contribution of text to each label was explicitly highlighted with the use of an attention mechanism in order to predict the multi-label sequences. Experimental results show that compared with Sequence Generation Model (SGM) algorithm, the proposed method improves the F value by 2.8 percentage points and 1.5 percentage points on the Arxiv Academic Paper Dataset (AAPD) and Reuters Corpus Volume I (RCV1)-v2 public dataset respectively.
Concerning the characteristics of breast cancer in Magnetic Resonance Imaging (MRI), such as different shapes and sizes, and fuzzy boundaries, an algorithm based on multiscale residual U Network (UNet) with attention mechanism was proposed in order to avoid error segmentation and improve segmentation accuracy. Firstly, the multiscale residual units were used to replace two adjacent convolution blocks in the down-sampling process of UNet, so that the network could pay more attention to the difference of shape and size. Then, in the up-sampling stage, layer-crossed attention was used to guide the network to focus on the key regions, avoiding the error segmentation of healthy tissues. Finally, in order to enhance the ability of representing the lesions, the atrous spatial pyramid pooling was introduced as a bridging module to the network. Compared with UNet, the proposed algorithm improved the Dice coefficient, Intersection over Union (IoU), SPecificity (SP) and ACCuracy (ACC) by 2.26, 2.11, 4.16 and 0.05 percentage points, respectively. The experimental results show that the algorithm can improve the segmentation accuracy of lesions and effectively reduce the false positive rate of imaging diagnosis.
To address the problems of insufficient interpretability and long sequence dependency in the deep knowledge tracing model based on Recurrent Neural Network (RNN), a model named Temporal Convolutional Knowledge Tracing with Attention mechanism (ATCKT) was proposed. Firstly, the student historical interactions embedded representations were learned in the training process. Then, the exercise problem-based attention mechanism was used to learn a specific weight matrix to identify and strengthen the influences of student historical interactions on the knowledge state at each moment. Finally, the student knowledge states were extracted by Temporal Convolutional Network (TCN), in which dilated convolution and deep neural network were used to expand the scope of sequence learning, and alleviate the problem of long sequence dependency. Experimental results show that compared with four models such as Deep Knowledge Tracing (DKT) and Convolutional Knowledge Tracing (CKT) on four datasets (ASSISTments2009、ASSISTments2015、Statics2011 and Synthetic-5), ATCKT model has the Area Under the Curve (AUC) and Accuracy (ACC) significantly improved, especially on ASSISTments2015 dataset, with an increase of 6.83 to 20.14 percentage points and 7.52 to 11.22 percentage points respectively, at the same time, the training time of the proposed model is decreased by 26% compared with that of DKT model. In summary, this model can accurately capture the student knowledge states and efficiently predict student future performance.
In recent years, deep learning has been widely used in many fields. However, due to the highly nonlinear operation of deep neural network models, the interpretability of these models is poor, these models are often referred to as “black box” models, and cannot be applied to some key fields with high performance requirements. Therefore, it is very necessary to study the interpretability of deep learning. Firstly, deep learning was introduced briefly. Then, around the interpretability of deep learning, the existing research work was analyzed from eight aspects, including hidden layer visualization, Class Activation Mapping (CAM), sensitivity analysis, frequency principle, robust disturbance test, information theory, interpretable module and optimization method. At the same time, the applications of deep learning in the fields of network security, recommender system, medical and social networks were demonstrated. Finally, the existing problems and future development directions of deep learning interpretability research were discussed.
In the field of deep learning, a large number of correctly labeled samples are essential for model training. However, in practical applications, labeling data requires high labeling cost. At the same time, the quality of labeled samples is affected by subjective factors or tool and technology of manual labeling, which inevitably introduces label noise in the annotation process. Therefore, existing training data available for practical applications is subject to a certain amount of label noise. How to effectively train training data with label noise has become a research hotspot. Aiming at label noise learning algorithms based on deep learning, firstly, the source, classification and impact of label noise learning strategies were elaborated; secondly, four label noise learning strategies based on data, loss function, model and training method were analyzed according to different elements of machine learning; then, a basic framework for learning label noise in various application scenarios was provided; finally, some optimization ideas were given, and challenges and future development directions of label noise learning algorithms were proposed.
Single object tracking is an important research direction in the field of computer vision, and has a wide range of applications in video surveillance, autonomous driving and other fields. For single object tracking algorithms, although a large number of summaries have been conducted, most of them are based on correlation filter or deep learning. In recent years, Siamese network-based tracking algorithms have received extensive attention from researchers for their balance between accuracy and speed, but there are relatively few summaries of this type of algorithms and it lacks systematic analysis of the algorithms at the architectural level. In order to deeply understand the single object tracking algorithms based on Siamese network, a large number of related literatures were organized and analyzed. Firstly, the structures and applications of the Siamese network were expounded, and each tracking algorithm was introduced according to the composition classification of the Siamese tracking algorithm architectures. Then, the commonly used datasets and evaluation metrics in the field of single object tracking were listed, the overall and each attribute performance of 25 mainstream tracking algorithms was compared and analyzed on OTB 2015 (Object Tracking Benchmark) dataset, and the performance and the reasoning speed of 23 Siamese network-based tracking algorithms on LaSOT (Large-scale Single Object Tracking) and GOT-10K (Generic Object Tracking) test sets were listed. Finally, the research on Siamese network-based tracking algorithms was summarized, and the possible future research directions of this type of algorithms were prospected.
The event that the user is interested in is extracted from the unstructured information, and then displayed to the user in a structured way, that is event extraction. Event extraction has a wide range of applications in information collection, information retrieval, document synthesis, and information questioning and answering. From the overall perspective, event extraction algorithms can be divided into four categories: pattern matching algorithms, trigger lexical methods, ontology-based algorithms, and cutting-edge joint model methods. In the research process, different evaluation methods and datasets can be used according to the related needs, and different event representation methods are also related to event extraction research. Distinguished by task type, meta-event extraction and subject event extraction are the two basic tasks of event extraction. Among them, meta-event extraction has three methods based on pattern matching, machine learning and neural network respectively, while there are two ways to extract subjective events: based on the event framework and based on ontology respectively. Event extraction research has achieved excellent results in single languages such as Chinese and English, but cross-language event extraction still faces many problems. Finally, the related works of event extraction were summarized and the future research directions were prospected in order to provide guidelines for subsequent research.
For the shortcomings of falling into the local optimum easily and slow convergence in Sparrow Search Algorithm (SSA), a Sparrow Search Algorithm based on Sobol sequence and Crisscross strategy (SSASC) was proposed. Firstly, the Sobol sequence was introduced in the initialization stage to enhance the diversity and ergodicity of the population. Secondly, the nonlinear inertia weight in exponential form was proposed to improve the convergence efficiency of the algorithm. Finally, the crisscross strategy was applied to improve the algorithm. In specific, the horizontal crossover was used to enhance the global search ability, while the vertical crossover was used to maintain the diversity of the population and avoid the algorithm from trapping into the local optimum. Thirteen benchmark functions were selected for simulation experiments, and the performance of the algorithm was evaluated by Wilcoxon rank sum test and Friedman test. In comparison experiments with other metaheuristic algorithms, the mean and standard deviation generated by SSASC are always better than other algorithms when the benchmark functions extending from 10 dimensions to 100 dimensions. Experimental results show that SSASC achieves certain superiority in both convergence speed and solution accuracy.
In view of the problems that classroom teaching scene is obscured seriously and has numerous students, the current video action recognition algorithm is not suitable for classroom teaching scene, and there is no public dataset of student classroom action, a classroom teaching video library and a student classroom action library were constructed, and a real-time multi-person student classroom action recognition algorithm based on deep spatiotemporal residual convolution neural network was proposed. Firstly, combined with real-time object detection and tracking to get the real-time picture stream of each student, and then the deep spatiotemporal residual convolution neural network was used to learn the spatiotemporal characteristics of each student’s action, so as to realize the real-time recognition of classroom behavior for multiple students in classroom teaching scenes. In addition, an intelligent teaching evaluation model was constructed, and an intelligent teaching evaluation system based on the recognition of students’ classroom actions was designed and implemented, which can help improve the teaching quality and realize the intelligent education. By making experimental comparison and analysis on the classroom teaching video dataset, it is verified that the proposed real-time classroom action recognition model for multiple students in classroom teaching video can achieve high accuracy of 88.5%, and the intelligent teaching evaluation system based on classroom action recognition has also achieved good results in classroom teaching video dataset.