Chinese Spelling Correction (CSC) is an important research task in Natural Language Processing (NLP). The existing CSC methods based on Large Language Models (LLMs) may generate semantic discrepancies between the corrected results and the original content. Therefore, a CSC method based on LLM with multiple inputs was proposed. The method consists of two stages: multi-input candidate set construction and LLM correction. In the first stage, a multi-input candidate set was constructed using error correction results of several small models. In the second stage, LoRA (Low-Rank Adaptation) was employed to fine-tune the LLM, which means that with the aid of reasoning capabilities of the LLM, sentences without spelling errors were deduced from the multi-input candidate set and used as the final error correction results. Experimental results on the public datasets SIGHAN13, SIGHAN14, SIGHAN15 and revised SIGHAN15 show that the proposed method has the correction F1 value improved by 9.6, 24.9, 27.9, and 34.2 percentage points, respectively, compared to the method Prompt-GEN-1, which generates error correction results directly using an LLM. Compared with the sub-optimal error correction small model, the proposed method has the correction F1 value improved by 1.0, 1.1, 0.4, and 2.4 percentage points, respectively, verifying the proposed method’s ability to enhance the effect of CSC tasks.
To address the issue of insufficient extraction of semantic feature information with different scales and the lack of focus on crucial information when obtaining sentence semantic information by Convolutional Neural Network (CNN)-based relation extraction, a model for relation extraction based on a multi-scale hybrid attention CNN was proposed. Firstly, relation extraction was modeled as label prediction with two-dimensional representation. Secondly, by extracting and fusing multi-scale feature information, finer-grained multi-scale spatial information was obtained. Thirdly, through the combination of attention and convolution, the feature maps were refined adaptively to make the model concentrate on important contextual information. Finally, two predictors were used jointly to predict the relation labels between entity pairs. Experimental results demonstrate that the multi-scale hybrid convolutional attention model can capture multi-scale semantic feature information,And the key information in channels and spatial locations was captured by the channel attention and spatial attention by assigning appropriate weights, thereby improving the performance of relation extraction. The proposed model achieves F1 scores of 90.32% on SemEval (SemEval-2010 task 8) dataset, 70.74% on TACRED (TAC Relation Extraction Dataset), 85.71% on Re-TACRED (Revised-TACRED), and 89.66% on SciERC (Entities, Relations, and Coreference for Scientific knowledge graph construction).
Existing methods for answer acquisition based on pre-trained language models may suffer from inaccuracies in predicting boundaries, a boundary-aware approach for span-based extraction Machine Reading Comprehension (MRC) is proposed to mitigate this issue. Firstly, special characters were introduced to mark the question boundary during the question input stage, enhancing the semantic information of the question to improve boundary perception. Secondly, during the answer prediction stage, an answer boundary regressor was constructed to facilitate semantic interaction between the perceived question boundary and the output of the predicted answer boundary. Lastly, the biased predicted answer boundary was further adjusted based on the post-interaction semantic information to calibrate the predicted answers. Experimental results demonstrate that when compared to the SpanBERT (Span-based Bidirectional Encoder Representation from Transformers), the proposed method improves the F1 value by 0.2 percentage points and the Exact Match (EM) value by 0.9 percentage points on the public dataset SQuAD (Stanford Question Answering Dataset)1.1, it achieved improvements of 0.7 percentage points in both F1 score and EM value on the HotpotQA (Hotpot Question Answering) dataset, and it improved the F1 score by 2.8 percentage points and the EM value by 3.3 percentage points on the NewsQA (News Question Answering) dataset. The effectiveness of this method is rooted in its capacity to enhance the model’s perception of question boundary information and to accomplish the calibration of predicted answer boundary. Consequently, it results in an enhancement of system accuracy in applications such as intelligent question answering and intelligent customer service when dealing with text data comprehension and analysis.
Aiming at the problem that the existing deep clustering methods can not efficiently divide event types without considering event information and its structural characteristics, a Deep Event Clustering method based on Event Representation and Contrastive Learning (DEC_ERCL) was proposed. Firstly, information recognition was utilized to identify structured event information from unstructured text, thus the impact of redundant information on event semantics was avoided. Secondly, the structural information of the event was integrated into the autoencoder to learn the low-dimensional dense event representation, which was used as the basis for downstream clustering. Finally, in order to effectively model the subtle differences between events, a contrast loss with multiple positive examples was added to the feature learning process. Experimental results on the datasets DuEE, FewFC, Military and ACE2005 show that the proposed method performs better than other deep clustering methods in accuracy and Normalized Mutual Information (NMI) evaluation indexes. Compared with the suboptimal method, the accuracy of DEC_ERCL is increased by 17.85%,9.26%,7.36% and 33.54%, respectively, indicating that DEC_ERCL has better event clustering effect.
To tackle the difficulty in semantic mining of entity relations and biased relation prediction in Relation Extraction (RE) tasks, a RE method based on Mask prompt and Gated Memory Network Calibration (MGMNC) was proposed. First, the latent semantics between entities within the Pre-trained Language Model (PLM) semantic space was learned through the utilization of masks in prompts. By constructing a mask attention weight matrix, the discrete masked semantic spaces were interconnected. Then, the gated calibration networks were used to integrate the masked representations containing entity and relation semantics into the global semantics of the sentence. Besides, these calibrated representations were served as prompts to adjust the relation information, and the final representation of the calibrated sentence was mapped to the corresponding relation class. Finally, the potential of PLM was fully exploited by the proposed approach through harnessing masks in prompts and combining them with the advantages of traditional fine-tuning methods. The experimental results highlight the effectiveness of the proposed method. On the SemEval (SemEval-2010 Task 8) dataset, the F1 score reached impressive 91.4%, outperforming the RELA (Relation Extraction with Label Augmentation) generative method by 1.0 percentage point. Additionally, the F1 scores on the SciERC (Entities, Relations, and Coreference for Scientific knowledge graph construction) and CLTC (Chinese Literature Text Corpus) datasets were remarkable, achieving 91.0% and 82.8% respectively. The effectiveness of the proposed method was evident as it consistently outperformed the comparative methods on all three datasets mentioned above. Furthermore, the proposed method achieved superior extraction performance compared to generative methods.
With the application of artificial intelligence technology in the judicial field, charge prediction based on case description has become an important research content. It aims at predicting the charges according to the case description. The terms of case contents are professional, and the description is concise and rigorous. However, the existing methods often rely on text features, but ignore the difference of relevant elements and lack effective utilization of elements of action words in diverse cases. To solve the above problems, a multi-task learning model of charge prediction based on action words was proposed. Firstly, the spans of action words were generated by boundary identifier, and then the core contents of the case were extracted. Secondly, the subordinate charge was predicted by constructing the structure features of action words. Finally, identification of action words and charge prediction were uniformly modeled, which enhanced the generalization of the model by sharing parameters. A multi-task dataset with action word identification and charge prediction was constructed for model verification. The experimental results show that the proposed model achieves the F value of 83.27% for action word identification task, and the F value of 84.29% for charge prediction task; compared with BERT-CNN, the F value respectively increases by 0.57% and 2.61%, which verifies the advantage of the proposed model in identification of action words and charge prediction.
It is importantly used in the fields such as creation of large-scale professional talent pools to extract scholar fine-grained information such as scholar’s research directions, education experience from scholar homepages. To address the problem that the existing scholar fine-grained information extraction methods cannot use contextual semantic associations effectively, a scholar fine-grained information extraction method incorporating local semantic features was proposed to extract fine-grained information from scholar homepages by using semantic associations in the local text. Firstly, general semantic representation was learned by the full-word mask Chinese pre-trained model RoBERTa-wwm-ext. Subsequently, the representation vector of the target sentence, as well as its locally adjacent text representation vector from the general semantic embeddings, were jointly fed into a CNN (Convolutional Neural Network) to accomplish local semantic fusion, thereby obtaining a higher-dimensional representation vector for the target sentence. Finally, the representation vector of the target sentence was mapped from the high-dimensional space to the low-dimensional labeling space to extract the fine-grained information from the scholar homepage. Experimental results show that the micro-average F1 score of the scholar fine-grained information extraction method fusing local semantic features reaches 93.43%, which is higher than that of RoBERTa-wwm-ext-TextCNN method without fusing local semantic by 8.60 percentage points, which verifies the effectiveness of the proposed method on the scholar fine-grained information extraction task.
The development of hot news events is very rich, and each stage of the development has its own unique narrative. With the development of events, a trend of hierarchical storyline evolution is presented. Aiming at the problem of poor interpretability and insufficient hierarchy of storyline in the existing storyline generation methods, a Hierarchical Storyline Generation Method (HSGM) for hot news events was proposed. First, an improved hotword algorithm was used to select the main seed events to construct the trunk. Second, the hotwords of branch events were selected to enhance the branch interpretability. Third, in the branch, a storyline coherence selection strategy fusing hotword relevance and dynamic time penalty was used to enhance the connection of parent-child events, so as to build hierarchical hotwords, and then a multi-level storyline was built. In addition, considering the incubation period of hot news events, a hatchery was added during the storyline construction process to solve the problem of neglecting the initial events due to insufficient hotness. Experimental results on two real self-constructed datasets show that in the event tracking process, compared with the methods based on singlePass and k-means respectively, HSGM has the F score increased by 4.51% and 6.41%, 20.71% and 13.01% respectively; in the storyline construction process, HSGM performs well in accuracy, comprehensibility and integrity on two self-constructed datasets compared with Story Forest and Story Graph.
The rapid development of Internet leads to the explosive growth of news data. How to capture the topic evolution process of current popular events from massive news data has become a hot research topic in the field of document analysis. However, the commonly used traditional dynamic clustering models are inflexible and inefficient when dealing with large-scale datasets, while the existing deep document clustering models lack a general method to capture the topic evolution process of time series data. To address these problems, a Deep Dynamic Document Clustering (DDDC) model was designed. In this model, based on the existing deep variational inference algorithms, the topic distributions incorporating the content of previous time slices on different time slices were captured, and the evolution process of event topics was captured from these distributions through clustering. Experimental results on real news datasets show that compared with Dynamic Topic Model (DTM), Variational Deep Embedding (VaDE) and other algorithms, DDDC model has the clustering accuracy and Normalized Mutual Information (NMI) improved by at least 4 percentage points averagely and at least 3 percentage points respectively in each time slice on different datasets, verifying the effectiveness of DDDC model.
In recent years, due to the advantages of the structural information of Graph Neural Network (GNN) in machine learning, people have begun to combine GNN into deep text clustering. The current deep text clustering algorithm combined with GNN ignores the important role of the decoder on semantic complementation in the fusion of text semantic information, resulting in the lack of semantic information in the data generation part. In response to the above problem, a Structured Deep text Clustering Model based on multi-layer Semantic fusion (SDCMS) was proposed. In this model, a GNN was utilized to integrate structural information into the decoder, the representation of text data was enhanced through layer-by-layer semantic complement, and better network parameters were obtained through triple self-supervision mechanism.Results of experiments carried out on 5 real datasets Citeseer, Acm, Reutuers, Dblp and Abstract show that compared with the current optimal Attention-driven Graph Clustering Network (AGCN) model, SDCMS in accuracy, Normalized Mutual Information (NMI ) and Average Rand Index (ARI) has increased by at most 5.853%, 9.922% and 8.142%.
Aiming at the problem of poor recognition of sentencing circumstances in adjudication documents caused by the lack of labeled data, low quality of labeling and existence of strong logicality in judicial field, a sentencing circumstance recognition model based on abductive learning named ABL-CON (ABductive Learning in CONfidence) was proposed. Firstly, combining with neural network and domain logic inference, through the semi-supervised method, a confidence learning method was used to characterize the confidence of circumstance recognition. Then, the illogical error circumstances generated by neural network of the unlabeled data were corrected, and the recognition model was retrained to improve the recognition accuracy. Experimental results on the self-constructed judicial dataset show that the ABL-CON model using 50% labeled data and 50% unlabeled data achieves 90.35% and 90.58% in Macro_F1 and Micro_F1, respectively, which is better than BERT (Bidirectional Encoder Representations from Transformers) and SS-ABL (Semi-Supervised ABductive Learning) under the same conditions, and also surpasses the BERT model using 100% labeled data. The ABL-CON model can effectively improve the logical rationality of labels as well as the recognition ability of labels by correcting illogical labels through logical abductive correctness.
Relation extraction aims to extract the semantic relationships between entities from the text. As the upper-level task of relation extraction, entity recognition will generate errors and spread them to relation extraction, resulting in cascading errors. Compared with entities, entity boundaries have small granularity and ambiguity, making them easier to recognize. Therefore, a relationship extraction method based on entity boundary combination was proposed to realize relation extraction by skipping the entity and combining the entity boundaries in pairs. Since the boundary performance is higher than the entity performance, the problem of error propagation was alleviated; in addition, the performance was further improved by adding the type features and location features of entities through the feature combination method, which reduced the impact caused by error propagation. Experimental results on ACE 2005 English dataset show that the proposed method outperforms the table-sequence encoders method by 8.61 percentage points on Macro average F1-score.
Many methods for liver tumor Computed Tomography (CT) segmentation have the difficulty to achieve accurate tumor due to inhomogeneous gray and fuzzy edges. To obtain precise segmentation result, a method using multi-scale morphology was proposed to eliminate local minima. Firstly, the morphological area operation was used to remove image's small burrs and irregular edges so as to avoid boundaries migration. Secondly, local minima in gradient image were distinguished by the combined knowledge of statistic characteristics and morphological properties including depth and scale. After partition, the function relationship was established between multi-scale structure elements and local minima. In order to filter noise via large-size structure elements and preserving major object via small-size structure elements, a morphological method called close operation was then employed to adaptively modify the image.Finally, standard watershed transform was utilized to implement segmentation of liver tumor. The experimental results show that this method can reduce over-segmentation effectively and liver tumor can be segmented accurately while boundaries of objects are located precisely.
Focusing on the key exchange problem of how to get the higher security for neural cryptography in the short time of the synchronization, a new hybrid algorithm combining the features of "Do not Trust My Partner" (DTMP) and the fast learning rule was proposed. The algorithm could send erroneous output bits in the public channel to disrupt the attacker's eavesdropping of the exchanged bits and reduce the success rate of passive attack. Meanwhile, the proposed algorithm estimated the synchronization by estimating the probability of unequal outputs, then adjusted the change of weights according to the level of synchronization to speed up the process of synchronization. The simulation results show that the proposed algorithm outperforms the original DTMP in the time needed for the partners to synchronize. Moreover, the proposed algorithm is securer than the original DTMP when the partners do not send erroneous output bits at the same time. And the proposed algorithm outperforms the feedback algorithm in both the synchronization time and security obviously. The experimental results show that the proposed algorithm can obtain the key with a high level of security and a less synchronization time.
Since the clustering by data competition algorithm has poor performance on complex datasets, a density-sensitive clustering by data competition algorithm was proposed. Firstly, the local distance was defined based on density-sensitive distance measure to describe the local consistency of data distribution. Secondly, the global distance was calculated based on local distance to describe the global consistency of data distribution and dig the information of data space distribution, which can make up for the defect of Euclidean distance on describing the global consistency of data distribution. Finally, the global distance was used in clustering by data competition algorithm. Using synthetic and real life datasets, the comparison experiments were conducted on the proposed algorithm and the original clustering by data competition based on Euclidean distance. The simulation results show that the proposed algorithm can obtain better performance in clustering accuracy rate and overcome the defect that clustering by data competition algorithm is difficult to handle complex datasets.