Loading...

Table of Content

    10 March 2025, Volume 45 Issue 3 Catalog Download
    Frontier research and typical applications of large models
    Survey and prospect of large language models
    Xiaolin QIN, Xu GU, Dicheng LI, Haiwen XU
    2025, 45(3):  685-696.  DOI: 10.11772/j.issn.1001-9081.2025010128
    Asbtract ( )   HTML ( )   PDF (2035KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Large Language Models (LLMs) are a class of language models composed of artificial neural networks with a vast number of parameters (typically billions of weights or more). They are trained on a large amount of unlabeled text using self-supervised or semi-supervised learning and are the core of current generative Artificial Intelligence (AI) technologies. Compared to traditional language models, LLMs demonstrate stronger language understanding and generation capabilities, supported by substantial computational power, extensive parameters, and large-scale data. They are widely applied in tasks such as machine translation, question answering systems, and dialogue generation with good performance. Most of the existing surveys focus on the theoretical construction and training techniques of LLMs, while systematic exploration of LLMs’ industry-level application practices and evolution of the technological ecosystem remains insufficient. Therefore, based on introducing the foundational architecture, training techniques, and development history of LLMs, the current general key technologies in LLMs and advanced integration technologies with LLMs bases were analyzed. Then, by summarizing the existing research, challenges faced by LLMs in practical applications were further elaborated, including problems such as data bias, model hallucination, and computational resource consumption, and an outlook was provided on the ongoing development trends of LLMs.

    Bias challenges of large language models: identification, evaluation, and mitigation
    Yuemei XU, Yuqi YE, Xueyi HE
    2025, 45(3):  697-708.  DOI: 10.11772/j.issn.1001-9081.2024091350
    Asbtract ( )   HTML ( )   PDF (2112KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Aiming at the unsafety and being out of control problems caused by biases in the output of Large Language Model (LLM), research status, techniques, and limitations related to biases in the existing LLMs were sorted deeply and analyzed from three aspects: bias identification, evaluation, and mitigation. Firstly, three key techniques of LLM were summed up to study the basic reasons of LLMs’ inevitable intrinsic biases. Secondly, three types of biases in LLMs were categorized into linguistic bias, demographic bias, and evaluation bias, and characteristics and causes of the biases were explored. Thirdly, a systematic review of the existing LLM bias evaluation benchmarks was carried out, and the strengths and weaknesses of these general-purpose, language-specific, and task-specific benchmarks were discussed. Finally, current LLM bias mitigation techniques were analyzed in depth from both model bias mitigation and data bias mitigation perspectives, and directions for their future refinement were pointed out. At the same time, the research directions for biases in LLMs were indicated by analysis: multi-cultural attribute evaluation of bias, lightweight bias mitigation techniques, and enhancement of the interpretability of biases.

    Recognition and optimization of hallucination phenomena in large language models
    Jing HE, Yang SHEN, Runfeng XIE
    2025, 45(3):  709-714.  DOI: 10.11772/j.issn.1001-9081.2024081190
    Asbtract ( )   HTML ( )   PDF (1539KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Focusing on problems that Large Language Models (LLMs) may generate hallucinations and are difficult to be fully applied to various fields of real life, especially medical field, as well as there is no high-quality LLM hallucination evaluation dataset and corresponding LLM hallucination degree evaluation, a method for identifying and optimizing LLM hallucinations in medical question answering field was proposed. Firstly, based on the publicly available dataset Huatuo, an LLM hallucination evaluation dataset in medical question answering field was constructed by combining GPT-4 generated question answers and manual annotation. Secondly, based on the constructed hallucination evaluation dataset, the concept of “hallucination rate” was defined. By designing prompts for the models to be tested answering “yes” or “no”, the degree of hallucination of each LLM was tested and quantified, and the “YES MAN” hallucination phenomenon of LLM was discovered. Thirdly, a low hallucination rate LLM, GPT-4, was used as LeaderAI to provide prior knowledge to assist LLMs with high hallucination rate in making judgments. Finally, to explore whether multiple different LLMs will make mistakes on the same problem, the concept of “hallucination collision” was defined, and based on probability statistical method, the hallucination collision situations of different LLMs in medical question answering field were revealed. Experimental results show that the introduction of LeaderAI can improve the performance of LLMs with high hallucination rate, so that LLMs can handle with the “YES MAN” hallucination phenomenon in medical question answering with low hallucination rate. Moreover, the current LLMs have a low probability of having hallucinations on a single question (collisions).

    Federated parameter-efficient fine-tuning technology for large model based on pruning
    Hui ZENG, Shiyu XIONG, Yongzheng DI, Hongzhou SHI
    2025, 45(3):  715-724.  DOI: 10.11772/j.issn.1001-9081.2024030322
    Asbtract ( )   HTML ( )   PDF (2395KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    With the continues increasing importance of data privacy, fine-tuning Pre-trained Foundational Model (PFM) for downstream tasks has become increasingly challenging, leading to the emergence of federated learning research based on PFM. However, PFM poses significant challenges to federated learning systems, especially in terms of local computation and communication. Therefore, the corresponding solution schemes were proposed for the two main stages of federated learning: local computing and aggregation communication, namely the local efficient fine-tuning mode and the ring-shaped local aggregation mode. In the first mode, a model pruning algorithm based on Parameter-Efficient Fine-Tuning (PEFT) was employed to reduce local computation and communication costs. In the second mode, the centralized aggregation method was replaced with a distributed local aggregation scheme to enhance communication efficiency during the aggregation stage. Experimental results demonstrate that the proposed federated parameter-efficient fine-tuning framework for large model performs well in terms of both final performance and efficiency.

    Efficient fine-tuning method of large language models for test case generation
    Peng CAO, Guangqi WEN, Jinzhu YANG, Gang CHEN, Xinyi LIU, Xuechun JI
    2025, 45(3):  725-731.  DOI: 10.11772/j.issn.1001-9081.2024111598
    Asbtract ( )   HTML ( )   PDF (1215KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Data-driven automated generation technology of unit test cases has problems of low coverage and poor readability, struggling to meet the increasing demand for testing. Recently, Large Language Model (LLM) has shown great potential in code generation tasks. However, due to the differences in functional and coding styles of code data, LLMs face the challenges of catastrophic forgetting and resource constraints. To address these problems, a transfer learning idea was proposed by fine-tuning coding and functional styles simultaneously, and an efficient fine-tuning training method was developed for LLMs in generating unit test cases. Firstly, the widely used instruction datasets were adopted to align LLM with instructions, and the instruction sets were divided by task types. At the same time, the weight increments with task-specific features were extracted and stored. Secondly, an adaptive style extraction module was designed for dealing with various coding styles with noise-resistant learning and coding style backtracking learning in the module. Finally, joint training of the functional and coding style increments was performed respectively on the target domain, thereby realizing efficient adaptation and fine-tuning on the target domains with limited resources. Experimental results of test case generation on SF110 Corpus of Classes dataset indicate that the proposed method outperforms the methods for comparison. Compared to the mainstream code generation LLMs — Codex, Code Llama and DeepSeek-Coder, the proposed method has the compilation rate increased by 0.8%, 43.5% and 33.8%, respectively; the branch coverage increased by 3.1%, 1.0%, and 17.2% respectively; and the line coverage increased by 4.1%, 6.5%, and 15.5% respectively; verifying the superiority of the proposed method in code generation tasks.

    Commonsense question answering model based on cross-modal contrastive learning
    Yuanlong WANG, Tinghua LIU, Hu ZHANG
    2025, 45(3):  732-738.  DOI: 10.11772/j.issn.1001-9081.2024081139
    Asbtract ( )   HTML ( )   PDF (772KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Commonsense Question Answering (CQA) aims to use commonsense knowledge to answer questions described in natural language automatically to obtain accurate answer, and it belongs to intelligent question answering field. Typically, this task demands background commonsense knowledge to enhance the model in problem-solving capability. While most related methods rely on extracting and utilizing commonsense from textual data, however, commonsense is often implicit and not always represented in the text directly, which affects the application range and effectiveness of these methods. Therefore, a cross-modal contrastive learning-based CQA model was proposed to fully utilize cross-modal information for enriching the expression of commonsense knowledge. Firstly, a cross-modal commonsense representation module was designed to integrate the commonsense bases and a cross-modal large model, thereby obtaining a cross-modal commonsense representation. Secondly, in order to enhance the ability of the model to distinguish among different options, contrastive learning was carried out on the cross-modal representations of problems and options. Finally, the softmax layer was used to generate relevance scores for the problem option pairs, and the option with the highest score was taken as the final predicted answer. Experimental results on public datasets CommonSenseQA (CSQA) and OpenBookQA (OBQA) show that compared to DEKCOR (DEscriptive Knowledge for COmmonsense question answeRing), the proposed model is improved by 1.46 and 0.71 percentage points respectively in accuracy.

    Visual question answering model based on association and fusion of multiple semantic features
    Hao ZHOU, Chao WANG, Guoheng CUI, Tingjin LUO
    2025, 45(3):  739-745.  DOI: 10.11772/j.issn.1001-9081.2024050660
    Asbtract ( )   HTML ( )   PDF (3044KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Bridging the semantic gaps among visual images and text-based questions is the key to improve the reasoning accuracy of Visual Question Answering (VQA) models. However, most the existing related models rely on extracting low-level image features and using attention mechanisms to reason and obtain answers of questions, while ignoring the important role of high-level image semantic features in visual reasoning, such as relationship features and attribute features. In order to solve the above problems, a VQA model based on multi-semantic association and fusion was proposed to establish semantic association among questions and images. Firstly, based on scene graph generation framework, multiple semantic features in images were extracted and refined as the feature input of VQA model to fully explore the information in visual scenes. Secondly, to enhance the semantic value of image features, an information filter was designed to remove noise and redundant information in the image features. Finally, a multi-layer attention fusion and reasoning module was designed to fuse multiple image semantics with question features, respectively, and strengthen the semantic association among the important regions of images and the questions. Experimental results show that compared with Bilinear Attention Network (BAN) and Coarse-to-Fine Reasoning (CFR) models, the proposed model has the accuracy on VQA2.0 test set increased by 2.9 and 0.4 percentage points respectively, and the accuracy on GQA test set increased by 17.2 and 0.3 percentage points respectively, demonstrating that the proposed model can better understand the semantics in image scenes and answer compositional visual questions.

    Multi-strategy retrieval-augmented generation method for military domain knowledge question answering systems
    Yanping ZHANG, Meifang CHEN, Changhai TIAN, Zibo YI, Wenpeng HU, Wei LUO, Zhunchen LUO
    2025, 45(3):  746-754.  DOI: 10.11772/j.issn.1001-9081.2024060833
    Asbtract ( )   HTML ( )   PDF (1254KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    The military domain knowledge question answering system based on Retrieval-Augmented Generation (RAG) has become an important tool for modern intelligence personnel to collect and analyze intelligence gradually. Focusing on the issue that the application strategies of RAG methods currently suffer from poor portability in hybrid retrieval as well as the problem of semantic drift caused by unnecessary query rewriting easily, a Multi-Strategy Retrieval-Augmented Generation (MSRAG) method was proposed. Firstly, the retrieval model was matched adaptively to recall relevant text based on query characteristics of the user input. Secondly, a text filter was utilized to extract the key text fragments that can answer the question. Thirdly, the content validity was assessed by the text filter to trigger query rewriting based on synonym expansion, and the initial query was merged with the rewritten information and used as input of the retrieval controller for more targeted re-retrieval. Finally, the key text fragments that can answer the question were merged with the question, prompt engineering input was used to generate answer model, and the response generated by the model was returned to the user. Experimental results show that compared to the convex linear combination RAG method, MSRAG method improves the ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation Longest common subsequence) by 14.35 percentage points on the Military domain dataset (Military) and by 5.83 percentage points on the Medical dataset. It can be seen that MSRAG method has strong universality and portability, enables the reduction of the semantic drift caused by unnecessary query rewriting, and effectively helps large language models generate more accurate answers.

    ScholatGPT: a large language model for academic social networks and its intelligent applications
    Chengzhe YUAN, Guohua CHEN, Dingding LI, Yuan ZHU, Ronghua LIN, Hao ZHONG, Yong TANG
    2025, 45(3):  755-764.  DOI: 10.11772/j.issn.1001-9081.2024101477
    Asbtract ( )   HTML ( )   PDF (2602KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    To address the limitations of the existing Large Language Models (LLMs) in processing cross-domain knowledge, updating real-time academic information, and ensuring output quality, ScholatGPT, a scholar LLM based on Academic Social Networks (ASNs), was proposed. In ScholatGPT, the abilities of precise semantic retrieval and dynamic knowledge update were enhanced by integrating Knowledge-Graph Augmented Generation (KGAG) and Retrieval-Augmented Generation (RAG), and optimization and fine-tuning were used to improve the generation quality of academic text. Firstly, a scholar knowledge graph was constructed based on relational data from SCHOLAT, with LLMs employed to enrich the graph semantically. Then, a KGAG-based retrieval model was introduced, combined with RAG to realize multi-path hybrid retrieval, thereby enhancing the model’s precision in search. Finally, fine-tuning techniques were applied to optimize the model’s generation quality in academic fields. Experimental results demonstrate that ScholatGPT achieves the precision of 83.2% in academic question answering tasks, outperforming GPT-4o and AMiner AI by 69.4 and 11.5 percentage points, and performs well in all the tasks such as scholar profiling, representative work identification, and research field classification. Furthermore, ScholatGPT obtains stable and competitive results in answer relevance, coherence, and readability, achieving a good balance between specialization and readability. Additionally, ScholatGPT-based intelligent applications such as scholar think tank and academic information recommendation system improve academic resource acquisition efficiency effectively.

    Design and practice of intelligent tutoring algorithm based on personalized student capability perception
    Yanmin DONG, Jiajia LIN, Zheng ZHANG, Cheng CHENG, Jinze WU, Shijin WANG, Zhenya HUANG, Qi LIU, Enhong CHEN
    2025, 45(3):  765-772.  DOI: 10.11772/j.issn.1001-9081.2024101550
    Asbtract ( )   HTML ( )   PDF (2239KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    With the rapid development of Large Language Models (LLMs), dialogue assistants based on LLM have emerged as a new learning method for students. These assistants generate answers through interactive Q&A, helping students solve problems and improve learning efficiency. However, the existing conversational assistants ignore students’ personalized needs, failing to provide personalized answers for “tailored instruction”. To address this, a personalized conversational assistant framework based on student capability perception was proposed, which is consisted of two main modules: a capability perception module that analyzes students’ exercise records to explore the knowledge proficiency of the students, and a personalized answer generation module that creates personalized answers based on the capabilities of the students. Three implementation paradigms — instruction-based, data-driven, and agent-based ones were designed to explore the framework’s practical effects. In the instruction-based assistant, the inference capabilities of LLMs were used to explore knowledge proficiency of the students from students’ exercise records to help generate personalized answers; in the small model-driven assistant, a Deep Knowledge Tracing (DKT) model was employed to generate students’ knowledge proficiency; in the agent-based assistant, tools such as student capability perception, personalized detection, and answer correction were integrated using LLM agent method for assistance of answer generation. Comparison experiments using Chat General Language Model (ChatGLM) and GPT4o_mini demonstrate that LLMs applying all three paradigms can provide personalized answers for students, the accuracy of the agent-based paradigm is higher, indicating the superior student capability perception and personalized answer generation of this paradigm.

    Personalized learning recommendation in collaboration of knowledge graph and large language model
    Xuefei ZHANG, Liping ZHANG, Sheng YAN, Min HOU, Yubo ZHAO
    2025, 45(3):  773-784.  DOI: 10.11772/j.issn.1001-9081.2024070971
    Asbtract ( )   HTML ( )   PDF (1570KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    As an important research topic in the field of smart education, personalized learning recommendation has a core goal of using recommendation algorithms and models to provide learners with effective learning resources that match their individual learning needs, interests, abilities, and histories, so as to improve learners’ learning effects. Current recommendation methods have problems such as cold start, data sparsity, poor interpretability, and over-personalization, and the combination of knowledge graph and Large Language Model (LLM) provides strong support to solve the above problems. Firstly, the contents such as concepts and current research status of personalized learning recommendation were overviewed. Secondly, the concepts of knowledge graph and LLM and their specific applications in personalized learning recommendation were discussed respectively. Thirdly, the collaborative application methods of knowledge graph and LLM in personalized learning recommendation were summarized. Finally, the future development directions of knowledge graph and LLM in personalized learning recommendation were prospected to provide reference and inspiration for continuous development and innovative practice in the field of personalized learning recommendation.

    Construction of digital twin water conservancy knowledge graph integrating large language model and prompt learning
    Yan YANG, Feng YE, Dong XU, Xuejie ZHANG, Jin XU
    2025, 45(3):  785-793.  DOI: 10.11772/j.issn.1001-9081.2024050570
    Asbtract ( )   HTML ( )   PDF (2950KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Constructing digital twin water conservancy construction knowledge graph to mine the potential relationships between water conservancy construction objects can help the relevant personnel to optimize the water conservancy construction design scheme and decision-making process. Aiming at the interdisciplinary and complex knowledge structure of digital twin water conservancy construction, and the problems such as insufficient learning and low extraction accuracy of knowledge of general knowledge extraction models in water conservancy domain, a Digital Twin water conservancy construction Knowledge Extraction method based on Large Language Model (DTKE-LLM) was proposed to improve the accuracy of knowledge extraction. In this method, by deploying local Large Language Model (LLM) through LangChain and integrating digital twin water conservancy domain knowledge, prompt learning was used to fine-tune the LLM. In the LLM, semantic understanding and generation capabilities were utilized to extract knowledge. At the same time, a heterogeneous entity alignment strategy was designed to optimize the entity extraction results. Comparison experiments and ablation experiments were carried out on the water conservancy domain corpus to verify the effectiveness of DTKE-LLM. Results of the comparison experiments demonstrate that DTKE-LLM outperforms the deep learning-based BiLSTM-CRF (Bidirectional Long Short-Term Memory Conditional Random Field) named entity recognition model and the general Information extraction model UIE (Universal Information Extraction) in precision. Results of the ablation experiments show that compared with the ChatGLM2-6B (Chat Generative Language Model 2.6 Billion), DTKE-LLM has the F1 scores of entity extraction and relation extraction improved by 5.5 and 3.2 percentage points respectively. It can be seen that the proposed method realizes the construction of digital twin water conservancy construction knowledge graph on the basis of ensuring the quality of knowledge graph construction.

    Synaesthesia metaphor analysis based on large language model and data augmentation
    Kun SHENG, Zhongqing WANG
    2025, 45(3):  794-800.  DOI: 10.11772/j.issn.1001-9081.2024091251
    Asbtract ( )   HTML ( )   PDF (1164KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Task of Chinese synaesthesia metaphor analysis is a specific subtask in metaphor domain. The uneven distribution of sensory words in synaesthesia corpora leads to data sparsity in the Chinese synaesthesia metaphor datasets. To address this issue, sparse sensory word data from real training data were used as prompts, and additional synthetic samples were generated by large language model for data augmentation. To avoid additional noise caused by introduced synthetic data from affecting model performance, a data augmentation framework based on large language model was constructed. Besides, a scoring mechanism and a label error optimization mechanism were applied to reduce the distribution differences between synthetic and real data. Experimental results show that the proposed framework can generate high-quality synthetic data to expand the dataset, and achieves an overall F1 value of 68.5% in sensory word extraction and sensory domain classification tasks, which is 2.7 percentage point improved compared to the baseline model T5 (Text-To-Text Transfer Transformer) trained only on real training data.

    Large language model prompt generation method for engineering drawing understanding
    Chenwei SUN, Junli HOU, Xianggen LIU, Jiancheng LYU
    2025, 45(3):  801-807.  DOI: 10.11772/j.issn.1001-9081.2024101537
    Asbtract ( )   HTML ( )   PDF (1540KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    In recent years, Large Language Models (LLMs) have demonstrated excellent language understanding and dialogue capabilities in fields such as natural language processing and computer vision. However, they can produce inference results that are inconsistent with the correct answers in professional fields. This situation brings significant challenges to the application of LLMs in precise and accurate decision-making tasks. To solve this problem, a rule-guided Post Prompt of Large Language Model (PP-LLM) generation method was proposed. In this method, by generating post prompts, the original problem was transformed into two sub-problems that are easier to solve, thereby achieving the purposes of introducing expert knowledge and reducing the difficulty of task learning. Specifically, the knowledge-guided specific rules were used to transform the output part of the supervised dataset into a combination of post prompts and the output portion. PP-LLM method does not change the training and inference processes of the model, and does not add computational cost. Experimental results show that PP-LLM method significantly improves the accuracy of inference results and narrows the gap between model predictions and actual answers. Compared with the results without using the proposed method, the F1 value and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) of the PP-LLM method have significantly improved. It can be seen that the above work improves the reliability of LLMs in professional applications and provides new ideas for LLM generation technology.

    Text-based person retrieval method based on multi-granularity shared semantic center association
    Bin KANG, Bin CHEN, Junjie WANG, Yulin LI, Junzhi ZHAO, Weizhi XIAN
    2025, 45(3):  808-814.  DOI: 10.11772/j.issn.1001-9081.2024101434
    Asbtract ( )   HTML ( )   PDF (1617KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Text-based person retrieval aims to identify specific person using textual descriptions as queries. The existing state-of-the-art methods typically design multiple alignment mechanisms to achieve correspondence among cross-modal data at both global and local levels, but they neglect the mutual influence among these mechanisms. To address this, a multi-granularity shared semantic center association mechanism was proposed to explore the promoting and inhibiting effects between global and local alignments. Firstly, a multi-granularity cross-alignment module was introduced to enhance interactions of image-sentence and local region-word, achieving multi-level alignment of the cross-modal data in a joint embedding space. Then, a shared semantic center was established and served as a learnable semantic hub, and associations among global and local features were used to enhance semantic consistency among different alignment mechanisms and promote the collaborative effect of global and local features. In the shared semantic center, the local and global cross-modal similarity relationships among image and text features were calculated, providing a complementary measure from both global and local perspectives and maximizing positive effects among multiple alignment mechanisms. Finally, experiments were carried out on CUHK-PEDES dataset. Results show that the proposed method improves the Rank-1 by 8.69 percentage points and the mean Average Precision (mAP) by 6.85 percentage points compared to the baseline method significantly. The proposed method also achieves excellent performance on ICFG-PEDES and RSTPReid datasets, significantly surpassing all the compared methods.

    Speaker-emotion voice conversion method with limited corpus based on large language model and pre-trained model
    Chaofeng LU, Ye TAO, Lianqing WEN, Fei MENG, Xiugong QIN, Yongjie DU, Yunlong TIAN
    2025, 45(3):  815-822.  DOI: 10.11772/j.issn.1001-9081.2024010013
    Asbtract ( )   HTML ( )   PDF (1966KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Aiming at the problems that few people have combined research on speaker conversion and emotional voice conversion, and the emotional corpora of a target speaker in actual scenes are usually small, which are not enough to train strong generalization models from scratch, a Speaker-Emotion Voice Conversion with Limited corpus (LSEVC) was proposed with fusion of large language model and pre-trained emotional speech synthesis model. Firstly, a large language model was used to generate text with required emotion tags. Secondly, a pre-trained emotional speech synthesis model was fine-tuned by using the target speaker corpus to embed into the target speaker. Thirdly, the emotional speech was synthesized from the generated text for data augmentation. Fourthly, the synthesized speech and source target speech were used to co-train speaker-emotion voice conversion model. Finally, to further enhance speaker similarity and emotional similarity of converted speech, the model was fine-tuned by using source target speaker’s emotional speech. Experiments were conducted on publicly available corpora and a Chinese fiction corpus. Experimental results show that the proposed method outperforms CycleGAN-EVC, Seq2Seq-EVC-WA2, SMAL-ET2 and other methods when considering evaluation indicators — Emotional similarity Mean Opinion Score (EMOS), Speaker similarity Mean Opinion Score (SMOS), Mel Cepstral Distortion (MCD), and Word Error Rate (WER).

    Vision foundation model-driven pixel-level image anomaly detection method
    Zhenhua XUE, Qiang LI, Chao HUANG
    2025, 45(3):  823-831.  DOI: 10.11772/j.issn.1001-9081.2024091398
    Asbtract ( )   HTML ( )   PDF (3364KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    While previous anomaly detection methods have achieved high-precision detection in specific scenarios, but their applicability is constrained by their lack of generalizability and automation. Thus, a Vision Foundation Model (VFM)-driven pixel-level image anomaly detection method, namely SSMOD-Net (State Space Model driven-Omni Dimensional Net), was proposed with the aim of achieving more accurate industrial defect detection. Unlike the existing methods, SSMOD-Net achieved automated prompting of SAM (Segment Anything Model) without the need for fine-tuning SAM, making it particularly suitable for scenarios that require processing large-scale industrial visual data. The core of SSMOD-Net is a novel prompt encoder driven by a state space model, which was able to generate prompts dynamically based on the input image of SAM. With this design, the model was allowed to introduce additional guidance information through the prompt encoder while preserving SAM’s architecture, thereby enhancing detection accuracy. A residual multi-scale module was integrated in the prompt encoder, and this module was constructed based on the state space model and was able to use multi-scale and global information comprehensively. Through iterative search, the module found optimal prompts in the prompt space and provided the prompts to SAM as high-dimensional tensors, thereby strengthening the model’s ability to recognize industrial anomalies. Moreover, the proposed method did not require any modifications to SAM, thereby avoiding the need for complex fine-tuning of the training schedules. Experimental results on several datasets show that the proposed method has excellent performance, and achieves better results in mE (mean E-measure) and Mean Absolute Error (MAE), Dice, and Intersection over Union (IoU) compared to methods such as AutoSAM and SAM-EG (SAM with Edge Guidance framework for efficient polyp segmentation).

    Privacy preserving localization of surveillance images based on large vision models
    Qiang LI, Shaoxiong BAI, Yuan XIONG, Wei YUAN
    2025, 45(3):  832-839.  DOI: 10.11772/j.issn.1001-9081.2024101538
    Asbtract ( )   HTML ( )   PDF (3015KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Visual localization of surveillance images is an important technology in industrial intelligence. The existing visual localization algorithms lack the protection of the privacy information in the image and may lead to the leakage of sensitive content during data transmission. To address the problem, a localization method of surveillance images based on Large Vision Models (LVMs) was proposed. Firstly, the architecture of LVM privacy preserving-based visual localization was designed to transfer the style of input images by using a few prompts and reference images. Then, a feature matching algorithm for the image with style transfer was designed to estimate the camera pose. Experimental results on public datasets show that the localization error of the proposed algorithm is relatively small, demonstrating that the algorithm reduces the privacy leakage significantly while ensuring the localization accuracy.

    Crop disease recognition method based on multi-modal data fusion
    Wei CHEN, Changyong SHI, Chuanxiang MA
    2025, 45(3):  840-848.  DOI: 10.11772/j.issn.1001-9081.2024091297
    Asbtract ( )   HTML ( )   PDF (2997KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Current deep learning-based methods for crop disease recognition rely on specific image datasets of crop diseases for image representation learning, and do not consider the importance of text features in assisting image feature learning. To enhance feature extraction and disease recognition capabilities of the model for crop disease images more effectively, a Crop Disease Recognition method through multi-modal data fusion based on Contrastive Language-Image Pre-training (CDR-CLIP) was proposed. Firstly, high-quality disease recognition image-text pair datasets were constructed to enhance image feature representation through textual information. Then, a multi-modal fusion strategy was applied to integrate text and image features effectively, which strengthened the model capability of distinguishing diseases. Finally, specialized pre-training and fine-tuning strategies were designed to optimize the model’s performance in specific crop disease recognition tasks. Experimental results demonstrate that CDR-CLIP achieves the disease recognition accuracies of 99.31% and 87.66% with F1 values of 99.04% and 87.56%, respectively, on PlantVillage and AI Challenger 2018 crop disease datasets. On PlantDoc dataset, CDR-CLIP achieves the mean Average Precision mAP@0.5 of 51.10%, showing the strong performance advantage of CDR-CLIP.

    Chinese spelling correction method based on LLM with multiple inputs
    Can MA, Ruizhang HUANG, Lina REN, Ruina BAI, Yaoyao WU
    2025, 45(3):  849-855.  DOI: 10.11772/j.issn.1001-9081.2024091325
    Asbtract ( )   HTML ( )   PDF (946KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Chinese Spelling Correction (CSC) is an important research task in Natural Language Processing (NLP). The existing CSC methods based on Large Language Models (LLMs) may generate semantic discrepancies between the corrected results and the original content. Therefore, a CSC method based on LLM with multiple inputs was proposed. The method consists of two stages: multi-input candidate set construction and LLM correction. In the first stage, a multi-input candidate set was constructed using error correction results of several small models. In the second stage, LoRA (Low-Rank Adaptation) was employed to fine-tune the LLM, which means that with the aid of reasoning capabilities of the LLM, sentences without spelling errors were deduced from the multi-input candidate set and used as the final error correction results. Experimental results on the public datasets SIGHAN13, SIGHAN14, SIGHAN15 and revised SIGHAN15 show that the proposed method has the correction F1 value improved by 9.6, 24.9, 27.9, and 34.2 percentage points, respectively, compared to the method Prompt-GEN-1, which generates error correction results directly using an LLM. Compared with the sub-optimal error correction small model, the proposed method has the correction F1 value improved by 1.0, 1.1, 0.4, and 2.4 percentage points, respectively, verifying the proposed method’s ability to enhance the effect of CSC tasks.

    Cyber security
    Lazy client identification method in federated learning based on proof-of-work
    Haili LIN, Jing LI
    2025, 45(3):  856-863.  DOI: 10.11772/j.issn.1001-9081.2024030296
    Asbtract ( )   HTML ( )   PDF (1131KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    In today’s society with the growing demand for privacy protection, federated learning is receiving widespread attention. However, in federated learning, it is difficult for the server to supervise behaviors of clients, so that the existence of lazy clients poses a potential threat to the performance and fairness of federated learning. Aiming at the problem of how to identify lazy clients efficiently and accurately, a dual-task proof-of-work method based on backdoor was proposed, namely FedBD (FedBackDoor). In FedBD, additional backdoor tasks that are easier to detect were allocated by the server for the clients participating in federated learning, the backdoor tasks were trained by the clients based on the original training tasks, and the clients’ behaviors were supervised by the server indirectly through training status of the backdoor tasks. Experimental results show that FedBD has certain advantages over the classic federated averaging algorithm FedAvg and the advanced algorithm GTG-Shapley (Guided Truncation Gradient Shapley) on datasets such as MNIST and CIFAR10. On CIFAR10 dataset, when the proportion of lazy clients is 15%, FedBD improves the accuracy by more than 10 percentage points compared with FedAvg, and increases the accuracy by 2 percentage points compared with GTG-Shapley. Moreover, the average training time of FedBD is only 11.8% of that of GTG-Shapley, and the accuracy of FedBD in identifying lazy clients can exceed 99% when the proportion of lazy clients is 10%. It can be seen that FedBD can solve the problem of lazy clients being difficult to supervise.

    Stacking ensemble adversarial defense method for encrypted malicious traffic detection model
    Ruilong CHEN, Tao HU, Youjun BU, Peng YI, Xianjun HU, Wei QIAO
    2025, 45(3):  864-871.  DOI: 10.11772/j.issn.1001-9081.2024030327
    Asbtract ( )   HTML ( )   PDF (1463KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Currently, deep learning-based traffic classification models are used widely for encrypted malicious traffic classification. However, adversarial attack samples faced by deep learning models severely impact the detection accuracy and availability of these models. Therefore, an adversarial defense method for encrypted malicious traffic detection models was proposed, namely D-SE (Detector-Stacking Ensemble). D-SE employed a stacking ensemble learning framework, which was divided into an adversarial defense layer and a decision layer. The former was used to detect potential adversarial traffic samples, including three classifiers — Residual Network (ResNet), CNN-LSTM, and Vision Transformer (ViT), and a multilayer perceptron as an adversarial attack detector. Based on the predicted probability distribution of the classifiers, the existence of adversarial attack was detected by the multilayer perceptron. To improve the detection performance of the detector for adversarial samples, the detector was enhanced via adversarial training. In the decision layer, a joint decision module based on voting and weight mechanism was designed, and through a majority rule decision mechanism and a high-weight-preference mechanism, excessive dependence on some classifiers was alleviated in the final prediction. The performance of D-SE was tested on USTC-TFC2016 dataset, and the results show that the accuracy of D-SE is over 96% in the non-adversarial environment, and the accuracy of D-SE is more than 89% in the white-box attack environment. It can be seen that D-SE has certain ability of adversarial defense.

    Encrypted traffic classification method based on Attention-1DCNN-CE
    Haijun GENG, Yun DONG, Zhiguo HU, Haotian CHI, Jing YANG, Xia YIN
    2025, 45(3):  872-882.  DOI: 10.11772/j.issn.1001-9081.2024030325
    Asbtract ( )   HTML ( )   PDF (2750KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    To address the problems of low multi-classification accuracy, poor generalization, and easy privacy invasion in traditional encrypted traffic identification methods, a multi-classification deep learning model that combines Attention mechanism (Attention) with one-Dimensional Convolutional Neural Network (1DCNN) was proposed, namely Attention-1DCNN-CE. This model consists of three core components: 1) in the dataset preprocessing stage, the spatial relationship among packets in the original data stream was retained, and a cost-sensitive matrix was constructed on the basis of the sample distribution; 2) based on the preliminary extraction of encrypted traffic features, the Attention and 1DCNN models were used to mine deeply and compress the global and local features of the traffic; 3) in response to the challenge of data imbalance, by combining the cost-sensitive matrix with the Cross Entropy (CE) loss function, the sample classification accuracy of minority class was improved significantly, thereby optimizing the overall performance of the model. Experimental results show that on BOT-IOT and TON-IOT datasets, the overall identification accuracy of this model is higher than 97%. Additionally, on public datasets ISCX-VPN and USTC-TFC, this model performs excellently, and achieves performance similar to that of ET-BERT (Encrypted Traffic BERT) without the need for pre-training. Compared to Payload Encoding Representation from Transformer (PERT) on ISCX-VPN dataset, this model improves the F1 score in application type detection by 29.9 percentage points. The above validates the effectiveness of this model, so that this model provides a solution for encrypted traffic identification and malicious traffic detection.

    Image adversarial example generation method based on multi-space probability enhancement
    Huahua WANG, Zijian FAN, Ze LIU
    2025, 45(3):  883-890.  DOI: 10.11772/j.issn.1001-9081.2024040495
    Asbtract ( )   HTML ( )   PDF (2764KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Adversarial examples can evaluate the robustness and safety of deep neural networks effectively. Aiming at the problem of low success rate of adversarial attacks in black-box scenarios and to improve the transferability of adversarial examples, a Multi-space Probability Enhancement Adversarial example generation Method (MPEAM) was proposed. The transferability of the adversarial examples was improved by the proposed method through introduction of two pieces of random data enhancement branches in the adversarial example generation method. In this process, random image Cropping and Padding (CP) based on the pixel space, as well as random Color Changing (CC) based on HSV color space, were implemented, respectively, by each branch. At the same time, the returned image examples were controlled by constructing a probability model, which increased the diversity of the original examples while decreasing the dependence of the adversarial examples on the original dataset, thereby enhancing the transferability of adversarial examples. On this basis, the proposed method was introduced into the integration model to further improve the success rate of the adversarial example attack in black-box scenarios. After extensive experiments on ImageNet dataset, the experimental results show that the proposed method improves the black-box attack success rate by 28.72 and 8.44 percentage points, averagely and respectively, compared to the benchmark methods Iterative Fast Gradient Sign Method (IFGSM) and Momentum Iterative Fast Gradient Sign Method (MIFGSM), and improves the black-box attack success rate by up to 6.81 percentage points compared to the attack methods based on single-space probability enhancement. The above indicates that the proposed method can improve the transferability of adversarial examples at a small cost of complexity and achieve effective attacks in black-box scenarios.

    Data tamper-proof batch auditing scheme based on industrial cloud storage systems
    Xiaojun ZHANG, Yunpu HAO, Lei LI, Chenyang LI, Ziyu ZHOU
    2025, 45(3):  891-895.  DOI: 10.11772/j.issn.1001-9081.2024030349
    Asbtract ( )   HTML ( )   PDF (1386KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    To address the issue of network active attacks such as tampering for industrial cloud storage system data, to achieve the goal of secure sharing of industrial data in cloud storage, and to ensure the confidentiality, integrity, and availability of industrial data transmission and storage processes, a data tamper-proof batch auditing scheme based on industrial cloud storage systems was proposed. In this scheme, a homomorphic digital signature algorithm based on bilinear pairing mapping was proposed, enabling a third-party auditor to achieve batch tamper-proof integrity detection of industrial cloud storage system data, and feedback the tamper-proof integrity auditing results to engineering service end users timely. Besides, the computational burden on engineering service end users was reduced by adding auditors, while ensuring the integrity of industrial encrypted data during transmission and storage processes. Security analysis and performance comparison results demonstrate that the proposed scheme reduces the third-party auditing computational cost significantly by reducing the third-party auditor’s computational cost from On) bilinear pairing operations to O(1) constant-level bilinear pairing operations through the design of tamper-proof detection vectors. It can be seen that the proposed scheme is suitable for lightweight batch auditing scenarios that require tamper-proof detection of a large number of core data files of industrial cloud storage systems.

    Image watermarking algorithm based on improved singular value decomposition and Haar wavelet transform
    Hailin XIAO, Xiangting KONG, Yu WANG, Di ZHOU, Xiaoming DAI
    2025, 45(3):  896-903.  DOI: 10.11772/j.issn.1001-9081.2024030304
    Asbtract ( )   HTML ( )   PDF (2556KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    To improve the limited robustness and transparency of traditional watermarking algorithms facing different kinds of attacks, an image watermarking algorithm based on improved Singular Value Decomposition (SVD) and two-dimensional discrete Haar wavelet transform was proposed. Firstly, a maximum segmentation Arnold transform was utilized to scramble the watermark image in order to ensure a uniform energy distribution in the image, thereby enhancing the stability and anti-attack ability of the watermark, and making the watermark have robustness against potential threats. Secondly, Haar wavelet transform was introduced to perform multi-scale image analysis for strengthening the encryption process, and an improved economical SVD method was presented to further improve the security and stability of the algorithm. Finally, the image watermark was restored and generated through inverse transformation. The proposed algorithm was reversible and easily operable, which ensured the visual quality of the image. Numerical simulation results show that all of the Peak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity (SSIM) values of the 5 classic host images without attacking are over 42.448 1 dB and 0.999 4, respectively, representing a good degree of transparency. The Normalized Correlation coefficient (NC) values of the proposed algorithm exceed 0.99 when the algorithm faces different attacks such as Gaussian noise, salt-and-pepper noise, and JPEG compression, demonstrating that the proposed algorithm outperforms the image watermarking algorithms: Discrete Wavelet Transform + SVD (DWT+SVD) and Integer Wavelet Transform + Heisenberg Matrix Decomposition (HMD) + SVD (IWT+HMD+SVD), and Integer Wavelet Transform + SVD (IWT+SVD). Even in the face of other attacks such as sharpening, motion blur, and speckle noise, the NC values of the proposed algorithm remain above 0.968 under the same conditions, verifying the robustness and transparency of the proposed algorithm in resisting various attacks.

    Active defense against face forgery based on attention mask and feature extraction
    Yu WANG, Xianjin FANG, Gaoming YANG, Yifeng DING, Xinlu YANG
    2025, 45(3):  904-910.  DOI: 10.11772/j.issn.1001-9081.2024030364
    Asbtract ( )   HTML ( )   PDF (1964KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    To address the issue of unauthorized forgery or tampering of facial images, an active defense method based on attention mask and feature extraction was proposed. This method was designed to take offensive measures to interfere with forgery models by adding adversarial examples into the image, so that the image was prevented forgery from the source and the visual quality of the protected image was enhanced. Firstly, an improved gradient descent method was employed to generate and add adversarial perturbations to the original image, resulting in the generation of a blurred false image after forgery processing the original image. At the same time, the attention mask was incorporated into the generator to enhance key feature channels, thereby reducing the influence of complex backgrounds and lighting. Additionally, the VGG16 pre-trained network was utilized to extract image features, thereby improving the visual quality of adversarial images at feature map level. Experimental results on CelebFaces Attributes (CelebA) dataset and Radboud Faces Database (RaFD) dataset show that, for StarGAN, the defense success rates of the proposed model are 99.80% and 99.63% respectively. Compared with the baseline method based on spread-spectrum adversarial attack, the proposed method has the visual quality of generated adversarial images improved by 30.86% and 26.63% respectively on Structure Similarity Index Measure (SSIM), and the Peak Signal-to-Noise Ratio (PSNR) improved by 34.80% and 36.15% respectively. The above indicates that the proposed method defends against face image forgery effectively while enhancing the visual quality of adversarial images.

    Advanced computing
    Physics-informed neural network based on Lobatto method and Legendre polynomials for solving differential-algebraic equations
    Shuai LAI, Juan TANG, Kun LIANG, Jiasheng CHEN
    2025, 45(3):  911-919.  DOI: 10.11772/j.issn.1001-9081.2024030313
    Asbtract ( )   HTML ( )   PDF (2186KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Current neural network methods solving Differential-Algebraic Equations (DAEs) basically adopt data-driven strategies, and require a large number of datasets. So that, there are problems such as sensitive structure and parameter selection of neural networks, low accuracy of solution, and poor stability. In response to these issues, a Physics-Informed Neural Network based on Lobatto method and Legendre polynomials (LL-PINN) was proposed. Firstly, based on the discrete Physics-Informed Neural Network (PINN) computing framework, combined with the advantages of high accuracy and high stability of Lobatto IIIA method solving DAEs, the physical information of DAEs was embedded in the Lobatto IIIA time iteration format, and PINN was used to solve the approximate numerical value of this time iteration. Secondly, a neural network structure with single hidden layer was utilized, by using the approximation capability of Legendre polynomials, these polynomials were applied as activation functions to simplify the process of adjusting the network model. Finally, a time domain decomposition scheme was employed to construct the network model, which a differential neural network and an algebraic neural network were used for each equally divided sub-time domain one by one, enabling high-precision continuous-time prediction of DAEs. Results of numerical examples demonstrate that the LL-PINN based on Legendre polynomials and the 4th-order Lobatto method achieves high-precision solutions for DAEs. Compared to the Theory of Functional Connections (TFC) trial solution method and PINN model, LL-PINN significantly reduces the absolute error between the predicted and exact solutions of differential variables and algebraic variables, and improves accuracy by one or two orders of magnitude. Therefore, the proposed solution model exhibits good computational accuracy for solving DAE problems, providing a feasible solution for challenging partial DAEs.

    Adaptive extended RRT* path planning algorithm based on node-to-obstacle distance
    Caiqi WANG, Xining CUI, Yi XIONG, Shiqian WU
    2025, 45(3):  920-927.  DOI: 10.11772/j.issn.1001-9081.2024030400
    Asbtract ( )   HTML ( )   PDF (4518KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Rapidly-exploring Random Tree star (RRT*) is widely used in the robot path planning field owing to its asymptotic optimality and probabilistic completeness. However, RRT* and its improved algorithms still suffer from several limitations such as poor initial path quality, slow path convergence, and low search efficiency. In response to these challenges, an adaptive extended RRT* algorithm based on node-to-obstacle distance, namely AE-RRT*, was proposed. To improve the search efficiency, a dynamic goal-biased sampling strategy and a dynamic step size strategy based on the node-to-obstacle distance were adopted. Furthermore, to improve the path quality, a more accurate parent node choice method MA-ChooseParent was proposed, which broadened the set of potential parent nodes. In addition, to speed up path convergence, an adaptive Gaussian sampling method and a global Gaussian sampling method AG-Gaussian Sample based on the node-to-obstacle distance were adopted. Through simulation in Matlab, AE-RRT* was compared with RRT*, Quick-RRT*, Bi-RRT*, Informed-RRT*, and Smart-RRT*. Experimental results demonstrate that compared to RRT*, AE-RRT* achieves reductions of 63.78%, 6.55%, and 71.93%, respectively, in the time taken to find the initial path, the length of the initial path, and the time to converge to a global sub-optimal path in 2D environments. In 3D environments, AE-RRT* achieves reductions of 59.44%, 18.26%, and 79.58%, respectively, in the three indicators.

    Dynamic UAV path planning based on modified whale optimization algorithm
    Xingwang WANG, Qingyang ZHANG, Shouyong JIANG, Yongquan DONG
    2025, 45(3):  928-936.  DOI: 10.11772/j.issn.1001-9081.2024030370
    Asbtract ( )   HTML ( )   PDF (7205KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    A dynamic Unmanned Aerial Vehicle (UAV) path planning method based on Modified Whale Optimization Algorithm (MWOA) was proposed for the problem of UAV path planning in environments with complex terrains. Firstly, by analyzing the mountain terrain, dynamic targets, and threat zones, a three-dimensional dynamic environment and a UAV route model were established. Secondly, an adaptive step size Gaussian walk strategy was proposed to balance the algorithm’s abilities of global exploration and local exploitation. Finally, a supplementary correction strategy was proposed to correct the optimal individual in the population, and combined with differential evolution strategy, the population was avoided from falling into local optimum while improving convergence accuracy of the algorithm. To verify the effectiveness of MWOA, MWOA and intelligent algorithms such as Whale Optimization Algorithm (WOA), and Artificial Hummingbird Algorithm (AHA) were used to solve the CEC2022 test functions, and validated in designed UAV dynamic environment model. The comparative analysis of simulation results shows that compared with the traditional WOA, MWOA improves the convergence accuracy by 6.1%, and reduces the standard deviation by 44.7%. The above proves that the proposed MWOA has faster convergence and higher accuracy, and can handle UAV path planning problems effectively.

    Multi-strategy improved Aquila optimizer and its application in path planning
    Suqian WU, Jianguo YAN, Bin YANG, Tao QIN, Ying LIU, Jing YANG
    2025, 45(3):  937-945.  DOI: 10.11772/j.issn.1001-9081.2024020242
    Asbtract ( )   HTML ( )   PDF (1988KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Aiming at the shortcomings of the original Aquila Optimizer (AO), such as insufficient local development ability, low optimization accuracy and slow convergence speed, a Multi-Strategy Improved AO (MSIAO) for robot path planning was proposed. Firstly, the Sobol sequence was introduced to initialize the Aquila population, which was conducive to diversity of the initial population and improved the convergence speed. Secondly, the local search method was improved by using golden sine operator and idea of self-learning and social learning of particle swarm, which enhanced exploitation ability of the algorithm and reduced the possibility of falling into the local optimum. Meanwhile, a non-linear balance factor was used as switching condition of the two stages, which made better communication among the populations, and was able to balance the global exploration and local exploitation more effectively. Finally, multiple experiments were carried out. Through the simulation on 12 benchmark functions and 10 CEC2017 complex functions, it can be seen that the proposed improvement strategies enhance the global optimization ability of MSIAO greatly. Results of applying MSIAO to robot path planning show that MSIAO can obtain shorter and more reliable moving paths. In 20×20 grid map, the average path of MSIAO is shortened by 2.53%, 3.83%, and 6.70% compared to those of Particle Swarm Optimization (PSO) algorithm, the original AO, and Butterfly Optimization Algorithm (BOA), respectively; and in 40×40 grid map, the average path of MSIAO is shortened by 10.65%, 5.27%, and 14.88% compared to those of the above three algorithms, verifying that the path-finding of MSIAO is more efficient.

    Multimedia computing and computer simulation
    LiDAR-camera 3D object detection based on multi-modal information mutual guidance and supplementation
    Chuanhao ZHANG, Xiaohan TU, Xuehui GU, Bo XUAN
    2025, 45(3):  946-952.  DOI: 10.11772/j.issn.1001-9081.2024030290
    Asbtract ( )   HTML ( )   PDF (2335KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Multi-modal 3D object detection is an important task in computer vision, and how to better fuse information among different modalities is always a research focus of this task. Previous methods lack information filtering when fusing the information of different modalities, and excessive irrelevant and interference information may lead to a decline in model performance. To address the above issues, an LiDAR-camera 3D object detection model based on multi-modal information mutual guidance and supplementation was proposed, which selected information from another modality for fusion adaptively when fusing features. Adaptive information fusion includes data-level and feature-level mutual guidance and supplementation. In data-level fusion, depth maps generated by point clouds and segmentation masks generated by images were used as input to construct instance-level depth maps and instance-level 3D virtual points, respectively, for supplementing images and point clouds. In feature-level fusion, voxel features generated by point clouds and feature maps generated by images were used as input, and key regions were selected from another modality for the features to be fused and feature fusion was conducted through attention mechanism. Experimental results show that the proposed model achieves good results on nuScenes test set. Compared to traditional unguided fusion models such as BEVFusion and TransFusion, the proposed model has the two mainstream evaluation indexes — mean Average Precision (mAP) and nuScenes Detection Score (NDS) improved by 0.9-28.9 percentage points and 0.6-26.1 percentage points, respectively. The above verifies that the proposed model can improve the accuracy of multi-modal 3D object detection effectively.

    Scene graph generation method based on association information enhancement and relationship balance
    Linhao LI, Dong HAN, Yongfeng DONG, Yingshuang LI, Zhen WANG
    2025, 45(3):  953-962.  DOI: 10.11772/j.issn.1001-9081.2024010135
    Asbtract ( )   HTML ( )   PDF (3809KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Utilizing contextual information of scene graphs can help models understand the correlation effect among targets. However, a large number of unrelated targets may introduce additional noise, affecting information interaction and causing prediction biases. In noisy and diverse scenes, even a few simple associated targets are sufficient to infer environmental information of the target and eliminate ambiguity information of other targets. In addition, Scene Graph Generation (SGG) faces challenges when dealing with long-tailed biased data in real-world scenarios. To address the problems of contextual information optimization and prediction biases, an association Information Enhancement and Relationship Balance based SGG (IERB) method was proposed. In IERB method, a secondary reasoning structure was employed according to biased scene graph prediction results, to reconstruct association information under different prediction angles of view and balance the prediction biases. Firstly, strongly correlated targets from different angles of view were focused on to construct the contextual association information. Secondly, the prediction capability for tail relationships was enhanced using a balancing strategy of tree structure. Finally, a prediction-guided approach was used to optimize predictions based on the existing scene graph. Experimental results on Visual Genome dataset show that compared with three baseline models Visual Translation Embedding network (VTransE), Motif, and Visual Context Tree (VCTree), the proposed method improves the mean Recall mR@100 in the Predicate Classification (PredCls) task by 11.66, 13.77 and 13.62 percentage points, respectively, demonstrating the effectiveness of the proposed method.

    Weakly supervised action localization based on temporal and global contextual feature enhancement
    Weichao DANG, Yinghao FAN, Gaimei GAO, Chunxia LIU
    2025, 45(3):  963-971.  DOI: 10.11772/j.issn.1001-9081.2024040443
    Asbtract ( )   HTML ( )   PDF (1810KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    In view of the inaccuracy of action classification and localization caused by the independent processing of video clips as single action instances in the existing weakly supervised action localization studies, a weakly supervised action localization method that integrates temporal and global contextual feature enhancement was proposed. Firstly, the temporal feature enhancement branch was constructed to enlarge the receptive field by using dilated convolution, and the attention mechanism was introduced to capture the temporal dependency between video clips. Secondly, an EM (Expectation-Maximization) algorithm based on Gaussian Mixture Model (GMM) was designed to capture video context information. At the same time, global contextual feature enhancement was performed by using binary walk propagation. As the result, high-quality Temporal Class Activation Maps (TCAMs) were generated as pseudo labels to supervise the temporal enhancement branch online. Thirdly, the momentum update network was used to obtain a cross-video dictionary that reflects the action features between videos. Finally, cross-video contrastive learning was used to improve the accuracy of action classification. Experimental results show that the proposed method has the mean Average Precision (mAP) of 42.0% and 42.2% on THUMOS’14 and ActivityNet v1.3 datasets when the Intersection-over-Union (IoU) is 0.5, and compared with CCKEE (Cross-video Contextual Knowledge Exploration and Exploitation), the proposed method has the mAP improved by 2.6 and 0.6 percentage points, respectively, proving the effectiveness of the proposed method.

    Lightweight pose estimation network based on non-globally dependent integral regression
    Benjie SHE, Shuzhi SU, Yanmin ZHU, Jian HUA, Chao WANG
    2025, 45(3):  972-977.  DOI: 10.11772/j.issn.1001-9081.2024030369
    Asbtract ( )   HTML ( )   PDF (1620KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Significant success has been achieved in human pose estimation networks based on heatmap detection. However, the methods based on heatmap detection has a large number of parameters due to redundant computations, quantization errors, and the requirement of heatmap decoding. To address these issues, a Lightweight pose estimation Network based on Non-globally dependent Integral Regression (Lite-NIRNet) was designed to reduce redundant computations in the network by employing Partial Convolution (PConv), which made the network more lightweight. To respond to the information loss caused by PConv, a Coordinate Attention (CA) mechanism was introduced to fuse inter-channel features, thereby enhancing the network performance. Additionally, a Non-globally dependent Integral Regression (NIR) module was designed to incorporate coordinate supervision to the network, which reduced the influence of quantization errors on network performance. The proposed NIR was able to reduce the bias produced by traditional integral regression during expectation calculations effectively, balancing better learning gradients with lower bias. Experimental results show that compared with the advanced High-Resolution Network (HRNet), Lite-NIRNet reduces the number of parameters and computational complexity by 73.0% and 63.4%, respectively, on COCO validation set, and achieves the mean Average Precision (mAP) of 72.8% without additional heatmap decoding. Furthermore, on MPII validation set, Lite-NIRNet can also achieve a good balance between network performance and complexity.

    Low-dose CT image reconstruction based on low-rank and total variation joint regularization
    Yu LIU, Pengcheng ZHANG, Liyuan ZHANG, Yi LIU, Zhiguo GUI, Xueyi ZHANG, Chenyifei ZHU, Haowei TANG
    2025, 45(3):  978-987.  DOI: 10.11772/j.issn.1001-9081.2024040478
    Asbtract ( )   HTML ( )   PDF (5600KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Aiming at the problems that the Total Variation (TV) minimization method easily leads to image over-smoothing and block effects in Low-Dose Computed Tomography (LDCT) image reconstruction, an LDCT image reconstruction method based on low-rank and TV joint regularization was proposed to improve the visual quality of LDCT reconstructed images. Firstly, a low-rank and TV joint regularization based image reconstruction model was established, thus, more accurate and natural reconstruction results were obtained theoretically. Secondly, a low-rank prior with non-local self-similarity property was introduced to overcome the limitations of only using the TV minimization method. Finally, the Chambolle-Pock (CP) algorithm was used to optimize and solve the model, which improved the solution efficiency of the model and ensured the effective solution of the model. The effectiveness of the proposed method was verified under three different LDCT scanning conditions. Experimental results on Mayo dataset show that compared with the PWLS-LDMM (Penalized Weighted Least-Squares based on Low-Dimensional Manifold) method, NOWNUNM (NOnlocal Weighted NUclear Norm Minimization) method and CP method, at 25% dose, the proposed method increases the Visual Information Fidelity (VIF) by 28.39%, 8.30% and 2.93%, respectively; at 15% dose, the proposed method increases the VIF by 29.96%, 13.83% and 4.53%, respectively; at 10% dose, the proposed method increases the VIF by 30.22%, 17.10% and 7.66%, respectively. It can be seen that the proposed method can retain more detailed texture information while removing noise and stripe artifacts, which verifies that the proposed method has better noise artifact suppression capability.

    Medical image segmentation network integrating multi-scale semantics and parallel double-branch
    Baohua YUAN, Jialu CHEN, Huan WANG
    2025, 45(3):  988-995.  DOI: 10.11772/j.issn.1001-9081.2024030358
    Asbtract ( )   HTML ( )   PDF (2085KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    In medical image segmentation networks, Convolutional Neural Network (CNN) can extract rich local feature details, but has the problem of insufficient capture of long-range information, and Transformer can capture long-range global feature dependencies, but destroys local feature details. To make full use of the complementarity of characteristics of the two networks, a parallel fusion network of CNN and Transformer for medical image segmentation was proposed, named PFNet. In the parallel fusion module of this network, a pair of interdependent parallel branches based on CNN and Transformer were used to learn both local and global discriminative features efficiently, and fuse local features and long-distance feature dependencies interactively. At the same time, to recover the spatial information lost during downsampling to enhance detail retention, a Multi-Scale Interaction (MSI) module was proposed to extract the local context of multi-scale features generated by hierarchical CNN branches for long-range dependency modeling. Experimental results show that PFNet outperforms other advanced methods such as MISSFormer (Medical Image Segmentation tranSFormer) and UCTransNet (U-Net with Channel Transformer module). On Synapse and ACDC (Automated Cardiac Diagnosis Challenge) datasets, compared to the optimal baseline method MISSFormer, PFNet increases the average Dice Similarity Coefficient (DSC) by 1.27% and 0.81%, respectively. It can be seen that PFNet can realize more accurate medical image segmentation.

    Coordinate enhancement and multi-source sampling for brain tumor image segmentation
    Zhanjun JIANG, Yang LI, Jing LIAN, Xinfa MIAO
    2025, 45(3):  996-1002.  DOI: 10.11772/j.issn.1001-9081.2024030359
    Asbtract ( )   HTML ( )   PDF (2626KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    To address the issues of insufficient focus on tumor regions and the loss of spatial contextual information in brain tumor image segmentation models, which affect the accuracy of tumor segmentation, a TransUNet-based brain tumor segmentation network integrating Coordinate Enhanced Learning mechanism (CEL) and multi-source sampling was proposed. Firstly, a CEL was proposed, and ResNetv2 was combined as shallow feature extraction network of the model, so as to enhance attention to brain tumor regions. Secondly, a deep blended sampling feature extractor was designed, and deformable attention and self-attention mechanisms were used to perform multi-source sampling on both global and local information of brain tumors. Finally, an Interactive Level Fusion (ILF) module was designed between the encoder and the decoder, thereby realizing interaction between deep and shallow feature information while minimizing parameter computational cost. Experimental results on BraTS2018 and BraTS2019 datasets indicate that compared to the benchmark TransUNet, the proposed model has the mean Dice coefficient (mDice), the mean Intersection over Union (mIoU), the mean Average Precision (mAP) and the mean Recall (mRecall) improved by 4.84, 7.21, 3.83, 3.15 percentage points, respectively, and the model size reduced by 16.9 MB.

    Frontier and comprehensive applications
    Survey of research status and development of runtime assurance technology
    Lei DONG, Qi WANG, Xi CHEN, Jiachen LIU
    2025, 45(3):  1003-1015.  DOI: 10.11772/j.issn.1001-9081.2024030318
    Asbtract ( )   HTML ( )   PDF (6135KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    While advanced technologies such as Artificial Intelligence (AI), big data, and cloud computing are developing rapidly, the difficulties to explain, certify and other issues of the technologies limit the practical application of them in various industries. Meanwhile, through monitoring the system state, RunTime Assurance (RTA) technology achieves the function switching, making “complex” into “simple”, thereby providing a preliminary solution to some complex system behaviors’ problems of difficulties to predict and explain, insecurity, unexplained results, with a broad prospect for development in the future. Therefore, a review was conducted on the current research status and development of RTA to offer researchers insights into the latest research trends and developmental directions in RTA technology. Firstly, the development history of RTA technology was reviewed, on the basis of describing the basic principle architecture and the switching logic of RTA, the current application research status of RTA in the fields of intelligent aviation, Unmanned Aerial Vehicle (UAV), intelligent aerospace, and automated vehicle driving, as well as on Cyber-Physical System (CPS) and safe reinforcement learning were sorted out systematically. Finally, the development prospects of RTA technology were discussed.

    Roadside traffic object detection model and deployment for vehicle-road collaboration
    Quan WANG, Xinyu CAO, Qidong CHEN
    2025, 45(3):  1016-1024.  DOI: 10.11772/j.issn.1001-9081.2024040424
    Asbtract ( )   HTML ( )   PDF (4790KB) ( )  
    Figures and Tables | References | Related Articles | Metrics

    Vehicle-road collaboration aims to achieve intelligent and efficient traffic management through information exchange and collaboration, in which accurate, lightweight, and easily deployable vehicle and pedestrian detection from the roadside perspective is crucial. To this end, a lightweight traffic object detection model based on improved YOLOv8 was proposed. Firstly, the FasterBlock from FasterNet was introduced to replace certain bottleneck components in the original C2f, thereby reducing Giga FLOating-Point operations (GFLOPs) and parameters effectively, thus reducing the overall model complexity. Secondly, the GSConv (Group Shuffle Convolution) that balanced speed and precision was adopted in the neck network of the model to replace the original convolutional kernel, and the SlimNeck feature fusion module was introduced, enabling each feature layer to consider the semantic information of deep features and the details of shallow features simultaneously. Thirdly, the MPDIoU (Minimum Point Distance based Intersection over Union) was used to replace the original loss function, so as to improve the bounding box regression performance of the model. Finally, the channel pruning was performed to remove redundant connections in the model network, thereby reducing the model size and improving the detection speed. Experimental results show that compared to the original YOLOv8s, the improved and pruned model has the precision increased by 1.0 percentage points, the mean Average Precision (mAP) increased by 1.2 percentage points, and the computational cost and parameters reduced by 70.1% and 69.4% respectively. Under the conditions of edge device Atlas 200I DK A2 (computing power 4 TOPS, power consumption 9 W), the proposed model has a detection speed of 58.03 frame/s.

2025 Vol.45 No.3

Current Issue
Archive
Honorary Editor-in-Chief: ZHANG Jingzhong
Editor-in-Chief: XU Zongben
Associate Editor: SHEN Hengtao XIA Zhaohui
Domestic Post Distribution Code: 62-110
Foreign Distribution Code: M4616
Address:
No. 9, 4th Section of South Renmin Road, Chengdu 610041, China
Tel: 028-85224283-803
  028-85222239-803
Website: www.joca.cn
E-mail: bjb@joca.cn
WeChat
Join CCF