Knowledge Tracing (KT) is a fundamental and challenging task in online education, and it involves the establishment of learner knowledge state model based on the learning history; by which learners can better understand their knowledge states, while teachers can better understand the learning situation of learners. The KT research for learners of online education was summarized. Firstly, the main tasks and historical progress of KT were introduced. Subsequently, traditional KT models and deep learning KT models were explained. Furthermore, relevant datasets and evaluation metrics were summarized, alongside a compilation of the applications of KT. In conclusion, the current status of knowledge tracing was summarized, and the limitations and future prospects for KT were discussed.
Concerning the inability to make full use of existing business resources in the current software project development process, which leads to low development efficiency and weak capabilities, a cognitive graph based on software development process was proposed by studying the interrelations among business resources. First, a method for building knowledge hierarchy by extracting business knowledge from formal documents was developed and corrected. Second, a network representation model for software codes was constructed through code feature extraction and code entity similarity investigation. Finally, the model was tested using real business data and was compared with three other methods: Vector Space Model (VSM), diverse ranking method and deep learning. Experimental results show that the established cognitive graph method based on business process is superior to current text matching and deep learning algorithms in code retrieval; the cognitive graph method improves precision@5, mean Average Precision (mAP) and Normalized Discounted Cumulative Gain (?-NDCG) by 4.30, 0.38 and 2.74 percentage points respectively compared with ranking-based code search effectively method, solving many problems such as potential business vocabulary identification and business cognitive reasoning representation, and improving the code retrieval effect and business resource utilization.
In view of the rapid development of Large Language Model (LLM) technology, a comprehensive analysis was conducted on its technical application prospects and risk challenges which has great reference value for the development and governance of Artificial General Intelligence (AGI). Firstly, with representative language models such as Multi-BERT (Multilingual Bidirectional Encoder Representations from Transformer), GPT (Generative Pre-trained Transformer) and ChatGPT (Chat Generative Pre-trained Transformer) as examples, the development process, key technologies and evaluation systems of LLM were reviewed. Then, a detailed analysis of LLM on technical limitations and security risks was conducted. Finally, suggestions were put forward for technical improvement and policy follow-up of the LLM. The analysis indicates that at a developing status, the current LLMs still produce non-truthful and biased output, lack real-time autonomous learning ability, require huge computing power, highly rely on data quality and quantity, and tend towards monotonous language style. They have security risks related to data privacy, information security, ethics, and other aspects. Their future developments can continue to improve technically, from “large-scale” to “lightweight”, from “single-modal” to “multi-modal”, from “general-purpose” to “vertical”; for real-time follow-up in policy, their applications and developments should be regulated by targeted regulatory measures.
Observation Point Classifier (OPC) is a supervised learning model which tries to transform a multi-dimensional linearly-inseparable problem in original data space into a one-dimensional linearly-separable problem in projective distance space and it is good at high-dimensional data classification. In order to alleviate the high train complexity when applying OPC to handle the big data classification problem, a Random Sample Partition (RSP)-based Distributed OPC (DOPC) for big data was designed under the Spark framework. First, RSP data blocks were generated and transformed into Resilient Distributed Dataset (RDD) under the distributed computation environment. Second, a set of OPCs was collaboratively trained on RSP data blocks with high Spark parallelizability. Finally, different OPCs were fused into a DOPC to predict the final label of unknow sample. The persuasive experiments on eight big datasets were conducted to validate the feasibility, rationality and effectiveness of designed DOPC. Experimental results show that DOPC trained on multiple computation nodes gets the higher testing accuracy than OPC trained on single computation node with less time consumption, and meanwhile compared to the RSP model based Neural Network (NN), Decision Tree (DT), Naive Bayesian (NB), and K-Nearest Neighbor (KNN) classifiers under the Spark framework, DOPC obtains stronger generalization capability. The superior testing performances demonstrate that DOPC is a highly effective and low consumptive supervised learning algorithm for handling big data classification problems.
Aiming at the problem that most of the current Named Entity Recognition (NER) models only use character-level information encoding and lack text hierarchical information extraction, a Chinese NER (CNER) model incorporating Multi-granularity linguistic knowledge and Hierarchical information (CMH) was proposed. First, the text was encoded using a model that had been pre-trained with multi-granularity linguistic knowledge, so that the model could capture both fine-grained and coarse-grained linguistic information of the text, and thus better characterize the corpus. Second, hierarchical information was extracted using the ON-LSTM (Ordered Neurons Long Short-Term Memory network) model, in order to utilize the hierarchical structural information of the text itself and enhance the temporal relationships between codes. Finally, at the decoding end of the model, incorporated with the word segmentation Information of the text, the entity recognition problem was transformed into a table filling problem in order to better solve the entity overlapping problem and obtain more accurate entity recognition results. Meanwhile, in order to solve the problem of poor migration ability of the current models in different domains, the concept of universal entity recognition was proposed, and a set of universal NER dataset MDNER (Multi-Domain NER dataset) was constructed to enhance the generalization ability of the model in multiple domains by filtering the universal entity types in multiple domains. To validate the effectiveness of the proposed model, experiments were conducted on the datasets Resume, Weibo, and MSRA, and the F1 values were improved by 0.94, 4.95 and 1.58 percentage points, respectively, compared to the MECT (Multi-metadata Embedding based Cross-Transformer) model. In order to verify the proposed model’s entity recognition effect in multi-domain, experiments were conducted on MDNER, and the F1 value reached 95.29%. The experimental results show that the pre-training of multi-granularity linguistic knowledge, the extraction of hierarchical structural information of the text, and the efficient pointer decoder are crucial for the performance promotion of the model.
To tackle the difficulty in semantic mining of entity relations and biased relation prediction in Relation Extraction (RE) tasks, a RE method based on Mask prompt and Gated Memory Network Calibration (MGMNC) was proposed. First, the latent semantics between entities within the Pre-trained Language Model (PLM) semantic space was learned through the utilization of masks in prompts. By constructing a mask attention weight matrix, the discrete masked semantic spaces were interconnected. Then, the gated calibration networks were used to integrate the masked representations containing entity and relation semantics into the global semantics of the sentence. Besides, these calibrated representations were served as prompts to adjust the relation information, and the final representation of the calibrated sentence was mapped to the corresponding relation class. Finally, the potential of PLM was fully exploited by the proposed approach through harnessing masks in prompts and combining them with the advantages of traditional fine-tuning methods. The experimental results highlight the effectiveness of the proposed method. On the SemEval (SemEval-2010 Task 8) dataset, the F1 score reached impressive 91.4%, outperforming the RELA (Relation Extraction with Label Augmentation) generative method by 1.0 percentage point. Additionally, the F1 scores on the SciERC (Entities, Relations, and Coreference for Scientific knowledge graph construction) and CLTC (Chinese Literature Text Corpus) datasets were remarkable, achieving 91.0% and 82.8% respectively. The effectiveness of the proposed method was evident as it consistently outperformed the comparative methods on all three datasets mentioned above. Furthermore, the proposed method achieved superior extraction performance compared to generative methods.
Aiming at the poor extraction effect of words that appear less frequently but can better express the theme of the text in the keyword extraction task of scientific text, a keyword extraction method based on improved TextRank was proposed. Firstly, the Term Frequency-Inverse Document Frequency (TF-IDF) statistical features and positional features of the words were used to optimize the probability transfer matrix between the words in the co-occurrence graph, and the initial scores of the words were obtained through iterative computation. Then, K-Core (K-Core decomposition) algorithm was used to mine the K-Core subgraphs to get the hierarchical features of the words, and the average information entropy feature was used to measure the thematic representation ability of the words. Finally, on the basis of the initial score of the word, the hierarchical feature and the average information entropy feature were fused to determine the keyword. The experimental results show that: on the public dataset, compared with the TextRank method and the OTextRank (Optimized TextRank) method, the proposed method increases the average F1 by 6.5 and 3.3 percentage points respectively for extracting different numbers of keywords; on the science and technology service project dataset, compared with the TextRank method and the OtexTRank method, the proposed method increases the average F1 by 7.4 and 3.2 percentage points for extracting different numbers of keywords. Experimental results verified the effectiveness of the proposed method for extracting keywords with low frequency but better expressing the theme of the text.
The existing Data Quality Assessment (DQA) methods often only analyze the basic concept of a specific Data Quality Dimension (DQD), ignoring the influence of fine-grained sub-dimensions that reflect key information of Data Quality (DQ) on the assessment results. To address these problems, an Industrial Multivariate Time Series Data Quality Assessment (IMTSDQA) method was proposed. Firstly, the DQDs to be evaluated such as completeness, normativeness, consistency, uniqueness, and accuracy were fine-grainedly divided, and the correlation of the sub-dimensions within the same DQD or between different DQDs was considered to determine the measurements of these sub-dimensions. Secondly, the sub-dimensions of attribute completeness, record completeness, numerical completeness, type normativeness, precision normativeness, sequential consistency, logical consistency, attribute uniqueness, record uniqueness, range accuracy, and numerical accuracy were weighted to fully mine the deep-level information of DQDs, so as to obtain the evaluation results reflecting the details of DQ. Experimental results show that compared to existing approaches based on qualitative analysis of frameworks and model construction according to basic definitions of DQDs, the proposed method can assess DQ more effectively and comprehensively, and the assessment results of different DQDs can reflect DQ problems more objectively and accurately.
Aiming at the problem that the existing deep clustering methods can not efficiently divide event types without considering event information and its structural characteristics, a Deep Event Clustering method based on Event Representation and Contrastive Learning (DEC_ERCL) was proposed. Firstly, information recognition was utilized to identify structured event information from unstructured text, thus the impact of redundant information on event semantics was avoided. Secondly, the structural information of the event was integrated into the autoencoder to learn the low-dimensional dense event representation, which was used as the basis for downstream clustering. Finally, in order to effectively model the subtle differences between events, a contrast loss with multiple positive examples was added to the feature learning process. Experimental results on the datasets DuEE, FewFC, Military and ACE2005 show that the proposed method performs better than other deep clustering methods in accuracy and Normalized Mutual Information (NMI) evaluation indexes. Compared with the suboptimal method, the accuracy of DEC_ERCL is increased by 17.85%,9.26%,7.36% and 33.54%, respectively, indicating that DEC_ERCL has better event clustering effect.
Multi-view clustering has recently been a hot topic in graph data mining. However, due to the limitations of data collection technology or human factors, multi-view data often has the problem of missing views or samples. Reducing the impact of incomplete views on clustering performance is a major challenge currently faced by multi-view clustering. In order to better understand the development of Incomplete Multi-view Clustering (IMC) in recent years, a comprehensive review is of great theoretical significance and practical value. Firstly, the missing types of incomplete multi-view data were summarized and analyzed. Secondly, four types of IMC methods, based on Multiple Kernel Learning (MKL), Matrix Factorization (MF) learning, deep learning, and graph learning were compared, and the technical characteristics and differences among the methods were analyzed. Thirdly, from the perspectives of dataset types, the numbers of views and categories, and application fields, twenty-two public incomplete multi-view datasets were summarized. Then, the evaluation metrics were outlined, and the performance of existing incomplete multi-view clustering methods on homogeneous and heterogeneous datasets were evaluated. Finally, the existing problems, future research directions, and existing application fields of incomplete multi-view clustering were discussed.
Federated learning is a distributed learning approach for solving the data sharing problem and privacy protection problem in machine learning, in which multiple parties jointly train a machine learning model and protect the privacy of data. However, there are security threats inherent in federated learning, which makes federated learning face great challenges in practical applications. Therefore, analyzing the attacks faced by federation learning and the corresponding defensive measures are crucial for the development and application of federation learning. First, the definition, process and classification of federated learning were introduced, and the attacker model in federated learning was introduced. Then, the possible attacks in terms of both robustness and privacy of federated learning systems were introduced, and the corresponding defense measures were introduced as well. Furthermore, the shortcomings of the defense schemes were also pointed out. Finally, a secure federated learning system was envisioned.