Journal of Computer Applications

Review of optimization methods for end-to-end speech-to-speech translation

Wei ZONG, Yue ZHAO, Yin LI, Xiaona XU

2025, 45(5): 1363-1371. DOI: 10.11772/j.issn.1001-9081.2024050666

Asbtract ( )

HTML ( )

PDF (2566KB) ( )

Figures and Tables | References | Related Articles | Metrics

Speech-to-Speech Translation （S2ST） is an emerging research direction in intelligent speech field， aiming to seamlessly translate spoken language from one language into another language. With increasing demands for cross-linguistic communication， S2ST has garnered significant attention， driving continuous research. Traditional cascaded models face numerous challenges in S2ST， including error propagation， inference latency， and inability to translate languages without a writing system. To address these issues， achieving direct S2ST using end-to-end models has become a key research focus. Based on a comprehensive survey of end-to-end S2ST models， a detailed analysis and summary of various end-to-end S2ST models was provided， the existing related technologies were reviewed， and the challenges were summarized into three categories： modeling burden， data scarcity， and real-world application， with a focus on how existing work has addressed these three categories. The extensive comprehension and generative capabilities of Large Language Models （LLMs） offer new possibilities for S2ST， while simultaneously presenting additional challenges. Exploring effective applications of LLMs in S2ST was also discussed， and potential future development directions were looked forward.

Multi-source data representation learning model based on tensorized graph convolutional network and contrastive learning

Yufei LONG, Yuchen MOU, Ye LIU

2025, 45(5): 1372-1378. DOI: 10.11772/j.issn.1001-9081.2024071001

Asbtract ( )

HTML ( )

PDF (821KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the issues of existing multi-source data representation learning models in processing large-scale， complex， and high-dimensional data， specifically the tendency to overlook high-order association among different sources， and susceptibility to noise， a Multi-Source data representation learning model based on Tensorized Graph convolutional network and Contrastive learning， namely MS-TGC， was proposed. Firstly， the K-Nearest Neighbors （KNN） algorithm and Graph Convolutional Network （GCN） were used to unify multi-source data dimensions， forming tensorized multi-source data. Then， a defined tensor graph convolution operator was applied to perform high-dimensional graph convolution operations， enabling simultaneous learning of intra-source and inter-source information. Finally， a multi-source contrastive learning paradigm was constructed to enhance the accuracy of representation learning in noisy data and improve robustness against noise by incorporating contrastive constraints based on semantic consistency and label consistency. Experimental results show that when the labeled sample ratio is 0.3， MS-TGC achieves 1.36 and 5.53 percentage points higher semi-supervised classification accuracy than CONMF （Co-consensus Orthogonal Non-negative Matrix Factorization） on BDGP and 20newsgroup datasets， respectively. These results indicate that MS-TGC effectively captures inter-source correlations， reduces noise interference， and achieves high-quality multi-source data representations.

Semi-supervised video object segmentation method based on spatio-temporal decoupling and regional robustness enhancement

Pengyu CHEN, Xiushan NIE, Nanjun LI, Tuo LI

2025, 45(5): 1379-1386. DOI: 10.11772/j.issn.1001-9081.2024060802

Asbtract ( )

HTML ( )

PDF (3648KB) ( )

Figures and Tables | References | Related Articles | Metrics

In response to issues faced by memory-based methods in semi-supervised Video Object Segmentation （VOS）， such as object occlusion caused by inter-object interactions and interference from similar objects or background noise， a semi-supervised VOS method based on spatio-temporal decoupling and regional robustness enhancement was proposed. Firstly， a structural Transformer architecture was employed to eliminate shared feature information across all pixels， emphasizing the differences among pixels and thoroughly exploring the key features of objects in video frames. Secondly， the similarity between the current frame and the long-term memory frames was decoupled into two critical dimensions： spatio-temporal correlation and object importance. This decoupling allowed for a more precise analysis of pixel-level spatio-temporal and object features， thereby solving the issue of object occlusion caused by inter-object interactions. Finally， a Regional Strip Attention （RSA） module was designed to enhance focus to the foreground region and suppress background noise by utilizing the object location information from long-term memory. Experimental results indicate that the proposed method outperforms the retrained AOT （Associating Objects with Transformers） model on DAVIS 2017 validation set by 1.7 percentage points in J&F， and achieves a 1.6 percentage points improvement compared to the retrained AOT model in overall score on YouTube-VOS 2019 validation set， indicating that the proposed method effectively addresses existing challenges in semi-supervised VOS.

Improvement method of heuristic vehicle routing algorithm based on constrained spectral clustering

Meng LUO, Chao GAO, Zhen WANG

2025, 45(5): 1387-1394. DOI: 10.11772/j.issn.1001-9081.2024060882

Asbtract ( )

HTML ( )

PDF (1514KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the poor initial solution quality of existing heuristic algorithms in solving large-scale Multi-Depot Vehicle Routing Problems （MDVRPs）， an improvement method of heuristic vehicle routing algorithm based on Constrained Spectral Clustering （CSC） was proposed. Firstly， the geographical and demand information feature matrices of delivery points were generated according to the geographical location and demand quantities of the customers to be served. Secondly， the constraint matrix with CSC was generated according to these feature matrices to perform clustering operations. Finally， the spectral clustering results were used to generate the initial solutions of the heuristic algorithms， and the appropriate heuristic algorithms were selected to solve the Vehicle Routing Problems (VRPs). Experimental results on 21 benchmark instances demonstrate that compared with Self-Constrained Spectral Clustering （SCSC）， CSC achieves 18.75% and 31.18% improvements in Normalized Mutual Information （NMI） and Fowlkes-Mallows Index （FMI）， respectively. In vehicle routing tasks， the heuristic algorithm initialized with CSC obtains the shortest path on 16 of 21 different sized instances while reducing runtime by 13.05% compared with SCSC-based initialization. Experimental results indicate that CSC can effectively improve the clustering accuracy of customer points， thereby improving both solving speed and solution quality for VRPs.

Psychological counseling human-machine dialogue dataset construction for dialogue generation and mental disorder detection

Bo XU, Dezhi HAO, Erchen YU, Hongfei LIN, Linlin ZONG

2025, 45(5): 1395-1402. DOI: 10.11772/j.issn.1001-9081.2024050705

Asbtract ( )

HTML ( )

PDF (2665KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the lack of publicly available data for modeling effective dialogue models in psychological counseling human-machine dialogues， a psychological counseling dialogue dataset was constructed for dialogue generation and mental disorder detection. Firstly， a multi-round dialogue dataset containing 3 268 doctor-patient conversations was collected from an online medical consultation platform， enriched with comprehensive metadata including hospital affiliations， medical departments， disease categories， and patient self-descriptions. Secondly， a knowledge-enhanced dialogue model named Empathy Bidirectional and Auto-Regressive Transformers （EmBART） was proposed to enhance the empathic capabilities of the dialogue model. Finally， an experimental evaluation of the dataset usability was conducted through psychological response generation and mental disorder detection tasks. In psychological response generation， EmBART trained on this dataset performed excellently on all metrics in both automatic and human evaluations， with the perplexity reduced by 2.31 compared to baseline model CDial-GPT（Chinese Dialogue Generative Pre-trained Transformer）. In mental disorder detection， CPT （Chinese Pre-trained unbalanced Transformer） and RoBERTa （Robustly optimized Bidirectional Encoder Representations from Transformers approach） trained on this dataset demonstrated outstanding mental disorder prediction capabilities. Experimental results confirm the strong utility of this dataset in generating empathic dialogues and detecting mental disorders， providing a data base for future research on psychological counseling human-machine dialogues.

Estimation and classification of brain functional networks based on temporal correlation information fusion

Jun YANG, Mengxue PANG, Lishan QIAO

2025, 45(5): 1403-1409. DOI: 10.11772/j.issn.1001-9081.2024050684

Asbtract ( )

HTML ( )

PDF (3057KB) ( )

Figures and Tables | References | Related Articles | Metrics

Brain functional networks play a crucial role in the early diagnosis of neurological or encephaloid diseases， and the estimation of a high-quality brain functional network is one of the most critical challenges. Although numerous brain functional network estimation methods have been proposed， most of them only focus on the correlations among brain regions while ignoring potential dependencies among time points. Recent studies have found that encoding dependencies among time points can effectively improve the discriminative properties of brain functional networks； however， this method only relies on the dependencies between adjacent time points and fails to effectively utilize information from non-adjacent time points， thereby inadequately capturing the temporal characteristics of the brain functional network. To address this limitation， a new brain functional network estimation method was proposed， which introduced a similarity matrix to encode the dependencies among non-adjacent time points， aiming to improve the quality of the estimation. Additionally， an alternating optimization learning algorithm was designed to solve the model quickly. To evaluate the effectiveness of the proposed method， experiments were conducted on three public datasets — ADNI （Alzheimer's Disease Neuroimaging Initiative）， ABIDE （Autism Brain Imaging Data Exchange）， and REST-MDD （REST-meta-MDD Consortium） — for mild cognitive impairment， autism and depression， respectively. Experimental results demonstrate that the brain functional network estimated by the proposed method achieves superior classification performance.

Auxiliary diagnostic method for retinopathy based on dual-branch structure with knowledge distillation

Sijie NIU, Yuliang LIU

2025, 45(5): 1410-1414. DOI: 10.11772/j.issn.1001-9081.2024060856

Asbtract ( )

HTML ( )

PDF (1274KB) ( )

Figures and Tables | References | Related Articles | Metrics

When using traditional models for the early diagnosis of retinopathy in high-risk patients with Diabetic Nephropathy （DN）， the diagnostic accuracy is often compromised due to limited and category imbalanced retinal images of diabetic patients. To address this issue， an auxiliary diagnostic method for retinopathy based on dual-branch structure with knowledge distillation was proposed to improve the recognition capability for minority categories. Firstly， a teacher network pre-trained on large medical datasets was employed to guide the student network's learning process， transferring acquired knowledge to improve the student network's generalization ability and mitigate data scarcity. Secondly， a dual-branch structure was proposed in the student network. Branch 1 utilized a rebalancing strategy with Focal Loss function to emphasize challenging samples by adjusting loss function weights， while Branch 2 employed a Category Attention Module （CAM） to learn discriminative features for each category， preventing model bias towards majority categories. These two branches respectively promoted classifier learning and feature learning to alleviate category imbalance. Evaluated on clinically collected retinal image data， experimental results demonstrate that the proposed method achieves 1.05 and 1.53 percentage points improvements in accuracy and specificity respectively compared with Lesion-aware Attention Model （LAM） in screening tasks involving 66 cases （89 eyes） of high-risk patients with DN. The proposed method improves the recognition accuracy of DN and realizes the auxiliary diagnosis of retinal diseases.

Concept set construction of reduced formal context and its recommendation application

Xin CHEN, Zhonghui LIU, Fan MIN

2025, 45(5): 1415-1423. DOI: 10.11772/j.issn.1001-9081.2024050743

Asbtract ( )

HTML ( )

PDF (1014KB) ( )

Figures and Tables | References | Related Articles | Metrics

In the field of Formal Concept Analysis （FCA）， the proposal of concept set satisfies the recommendation needs of real environments. However， current concept set generation methods lack effective means to avoid the inclusion of redundant attributes， which to some extent affects the quality and efficiency of concept generation， and ultimately the effectiveness of recommendations. To solve the above problem， a Formal Context Attribute Reduction algorithm （FCAR）， a Concept Set Construction Algorithm （CSCA）， and a Recommendation Algorithm based on Concept Set （RACS） were proposed. Firstly， the attribute interest degree was designed based on formal context and rating matrix， and formal context reduction was realized according to the threshold of attribute interest degree. Secondly， by combining extent similarity and intent interest degree， the concept criticality was designed as heuristic information to generate the concept set. Finally， the recommendation matrix of the concept set was obtained using the recommendation confidence and recommendation threshold， enabling personalized recommendation for the target user. RACS was compared with algorithms including k-Nearest Neighbor （kNN）， Item-Based Collaborative Filtering （IBCF）， Group Recommendation based on Heuristic Concept set （GRHC）， Concept Set based-Personalized Recommendation （CSPR）， and GreConD-kNN on 11 datasets. In experiments on six standard datasets， RACS achieves the highest accuracy and the second highest recall on three datasets， and achieves the best F1 score on four datasets. Especially on three larger-scale datasets， compared to formal concept recommendation algorithms， RACS has recommendation time efficiency improved by at least eight times. Experimental results validate the significant advantages of RACS in recommendation effects and efficiency.

Expert counter-evaluation model with three-way decision and entropy weight TOPSIS

Ying YU, Feng ZHU, Hongjian FU, Yiwen LUO, Jin QIAN, Yuchao ZHENG

2025, 45(5): 1424-1431. DOI: 10.11772/j.issn.1001-9081.2024060819

Asbtract ( )

HTML ( )

PDF (1671KB) ( )

Figures and Tables | References | Related Articles | Metrics

During the evaluation and review process of scientific and technological projects， the quality of evaluation and review conducted by experts significantly influences the accuracy and credibility of the final evaluation results. To ensure the fairness and objectivity of the evaluation results， it is necessary to conduct a counter-evaluation of the evaluation experts. By analyzing three aspects including the basic personal situation， professional level and evaluation performance of evaluation experts， an evaluation index system for counter-evaluation of experts was constructed. Based on this， an expert counter-evaluation model that combined three-way decisions and entropy weight based Technique for Order Preference by Similarity to Ideal Solution （TOPSIS） was proposed， which used the three-way decision theory to solve the problem of weight distortion due to the over-reliance of entropy weight method on data. When the index weights were abnormal， the group of experts to be evaluated was divided into positive， negative and boundary domains according to the threshold， and then the expert-in， expert-out and delayed evaluation strategies were adopted respectively. Once the index weights were normal or corrected， TOPSIS was used to rank the evaluation experts. Through empirical analysis of the historical evaluation and review data of scientific and technological projects from an enterprise， it can be seen that the proposed model can integrate the empirical judgment of decision makers with inherent information of the experts to be evaluated to realize the unity of subjectivity and objectivity， ensuring a scientific and fair evaluation of the evaluation experts， and providing decision-making references for construction of a high-quality expert database.

Multimodal sarcasm detection model integrating contrastive learning with sentiment analysis

Wenbin HU, Tianxiang CAI, Tianle HAN, Zhaoman ZHONG, Changxia MA

2025, 45(5): 1432-1438. DOI: 10.11772/j.issn.1001-9081.2024050731

Asbtract ( )

HTML ( )

PDF (1779KB) ( )

Figures and Tables | References | Related Articles | Metrics

Comments on social media platforms sometimes express their attitudes towards events through sarcasm. Sarcasm detection can more accurately analyze user sentiments and opinions. But traditional models based on vocabulary and syntactic structure ignore the role of text sentiment information in sarcasm detection and suffer from performance degradation due to data noise. To address these limitations， a Multimodal Sarcasm Detection model integrating Contrastive learning with Sentiment analysis （MSDCS） was proposed. Firstly， BERT （Bidirectional Encoder Representation from Transformers） was used to extract text features， and ViT （Vision Transformer） was used to extract image features. Then， the contrastive loss in contrastive learning was employed to train a shallow model， and the image and text features were aligned before fusion. Finally， the cross-modal features were combined with the sentiment features to make classification judgments， and the use of information between different modalities was maximized to achieve sarcasm detection. Experimental results on the open dataset of multimodal sarcasm detection show that the accuracy and F1 value of MSDCS are at least 1.85% and 1.99% higher than those of the baseline model Decomposition and Relation Network （D&R Net）， verifying the effectiveness of using sentiment information and contrastive learning in multimodal sarcasm detection.

Generative adversarial network underwater image enhancement model based on Swin Transformer

Hui LI, Bingzhi JIA, Chenxi WANG, Ziyu DONG, Jilong LI, Zhaoman ZHONG, Yanyan CHEN

2025, 45(5): 1439-1446. DOI: 10.11772/j.issn.1001-9081.2024050730

Asbtract ( )

HTML ( )

PDF (3642KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problems of low contrast， heavy noise and color deviation in underwater images， using Generative Adversarial Network （GAN） model as the core framework， a new underwater image enhancement model was proposed based on GAN， namely SwinGAN （GAN based on Swin Transformer）. Firstly， the generative network was designed according to the encoder-bottleneck-decoder structure， where the input feature maps were divided into multiple non-overlapping local windows at the bottleneck layer. Secondly， a Dual-path Window Multi-head Self-Attention mechanism（DWMSA） was introduced to enhance local attention while simultaneously capturing global information and long-range dependencies. Finally， the decoder recombined the multiple windows back into the original size feature maps， and the discriminator network employed a Markov discriminator. Compared to the URSCT-SESR model， SwinGAN model shows an improvement of 0.837 2 dB in Peak Signal-to-Noise Ratio （PSNR） and 0.003 6 in Structural SIMilarity index （SSIM） on the UFO-120 dataset. On the EUVP-515 dataset， SwinGAN model achieves more significant improvement， with a 0.843 9 dB boost in PSNR， an increase of 0.005 1 in SSIM， an enhancement of 0.112 4 in Underwater Image Quality Measure （UIQM）， and a slight increase of 0.001 0 in Underwater Color Image Quality Evaluation （UCIQE）. Experimental results demonstrate that the SwinGAN model excels in both subjective and objective evaluation metrics， achieving notable improvements in correcting color deviation in underwater images.

Federated learning optimization algorithm based on local drift and diversity computing power

Yiming ZHANG, Tengfei CAO

2025, 45(5): 1447-1454. DOI: 10.11772/j.issn.1001-9081.2024070928

Asbtract ( )

HTML ( )

PDF (2076KB) ( )

Figures and Tables | References | Related Articles | Metrics

In view of the challenges of non-Independent and Identically Distributed （non-IID） data and heterogeneous computing power faced in Federated Learning （FL） for edge computing applications， the concept of local drift variable was introduced to avoid the significant deviation in client model updates caused by non-IID data， thereby preventing unstable model convergence. By correcting the local model parameters， the local training process was separated from the global aggregation process， optimizing FL performance in non-IID data training process. Furthermore， considering the diversity of edge server computing power， a new strategy was proposed： a simplified neural network sub-model was divided from the global model for deployment on resource-constrained edge servers， while high-capacity servers utilized the complete global model. Parameters trained by the low-capacity servers were uploaded to the cloud server， with partial parameter freezing to accelerate model convergence. Integrating these two methods， a Federated learning optimization algorithm based on Local drift and Diversity computing power （FedLD） was proposed to solve the heterogeneous challenges caused by non-IID data and diversity computing power in FL for edge computing. Experimental results show that FedLD has faster convergence speed and higher accuracy compared to FedAvg， SCAFFOLD， and FedProx algorithms， compared to FedProx， when 50 clients are involved in training， FedLD improves the model accuracy by 0.39%， 3.68% and 15.24% on MNIST， CIFAR-10 and CIFAR-100 datasets， respectively. Comparative analysis with the latest FedProc algorithm reveals that FedLD has lower communication overhead. Additional experiments incorporating K-Nearest Neighbors （KNN） algorithm， Long Short-Term Memory （LSTM） model， and bidirectional Gated Recurrent Unit （GRU） model demonstrate approximately 1% accuracy improvements across all three models when integrated with FedLD.

Deep symbolic regression method based on Transformer

Pengcheng XU, Lei HE, Chuan LI, Weiqi QIAN, Tun ZHAO

2025, 45(5): 1455-1463. DOI: 10.11772/j.issn.1001-9081.2024050609

Asbtract ( )

HTML ( )

PDF (3565KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the challenges of reduced population diversity and sensitivity to hyperparameters in solving Symbolic Regression （SR） problems by using genetic evolutionary algorithms， a Deep Symbolic Regression Technique （DSRT） method based on Transformer was proposed. This method employed autoregressive capability of Transformer to generate expression symbol sequence. Subsequently， the transformation of the fitness value between the data and the expression symbol sequence was served as a reward value， and the model parameters were updated through deep reinforcement learning， so that the model was able to output expression sequence that fitted the data better， and with the model’s continuous converging， the optimal expression was identified. The effectiveness of the DSRT method was validated on the SR benchmark dataset Nguyen， and it was compared with DSR （Deep Symbolic Regression） and GP （Genetic Programming） algorithms within 200 iterations. Experimental results confirm the validity of DSRT method. Additionally， the influence of various parameters on DSRT method was discussed， and an experiment to predict the formula for surface pressure coefficient of an aircraft airfoil using NACA4421 dataset was performed. The obtained formula was compared with the Kármán-Tsien formula， yielding a mathematical formula with a lower Root Mean Square Error （RMSE）.

Graph regularized elastic net subspace clustering

Shujian GUO, Jieyue YU, Xuesong YIN

2025, 45(5): 1464-1471. DOI: 10.11772/j.issn.1001-9081.2024050651

Asbtract ( )

HTML ( )

PDF (1150KB) ( )

Figures and Tables | References | Related Articles | Metrics

Graph-based Subspace Clustering （SC） has become a popular technique for processing high-dimensional data efficiently. However， existing methods suffer from the following problems： the constructed graph neglects to establish associations with clustering and fails to capture intrinsic correlated structure of the data. To address these issues， a new SC method was proposed， called Graph regularized Elastic Net Subspace Clustering （GENSC）. GENSC employed L₂ norm regularization to enhance the connectivity among samples with the correlated structure， and utilized L₁ norm regularization to discard the connectivity among samples from different subspaces. Simultaneously， a nearest neighbor graph of the representation was constructed to capture the intrinsic local structure among samples， and a rank constraint was incorporated to encourage the learned graph to have clear clustering structure. GENSC integrated L₂ norm， L₁ norm， and rank constraint into a general framework which was solved by an iterative optimization algorithm. Experimental results on nine real-world datasets demonstrate that on ChinaCXRSet， the accuracy and Normalized Mutual Information （NMI） values of GENSC exceeded the second-best method by 9.03 and 7.61 percentage points， respectively， and the clustering Purity reached the best； on UMIST， the accuracy， NMI， and Purity values of GENSC exceeded the second-best method by 4.15， 3.17 and 5.21 percentage points， respectively， validating the effectiveness of GENSC.

Multi-graph diffusion attention network for traffic flow prediction

Quan WANG, Qixiang LU, Pei SHI

2025, 45(5): 1472-1479. DOI: 10.11772/j.issn.1001-9081.2024050636

Asbtract ( )

HTML ( )

PDF (2668KB) ( )

Figures and Tables | References | Related Articles | Metrics

Current traffic flow prediction methods based on spatio-temporal feature extraction has problems of insufficient capture of global spatial correlation and dynamic long-term temporal dependency， where spatial correlation mining relies on the quality of graph structure heavily. Therefore， a Multi-Graph Diffusion Attention Network （MGDAN） was proposed， consisting of a Multi-Graph Diffusion Attention Module （MGDAM） and a temporal attention module. Firstly， adaptive spatio-temporal embedding generator was used to construct dynamic spatio-temporal information. Secondly， a Maximal Information Coefficient （MIC） matrix and an adaptive matrix were utilized to explore fine-grained spatial information， and a global spatial attention mechanism was employed to capture dynamic spatial correlation. Finally， the temporal attention module was used to extract nonlinear temporal correlation， and the integration of the three modules was carried out to realize effective extraction of spatio-temporal correlation. Experimental results demonstrate that， on PEMS08 dataset， the Mean Absolute Error （MAE） of MGDAN model within one hour has 19.34% and 5.74% reductions compared to those of Spatio-Temporal AutoEncoder （ST_AE） and Spatial-Temporal IDentity （STID） models， respectively. At the same time， MGDAN model outperforms 9 baseline models in overall prediction performance， and can conduct medium- and long-term traffic flow prediction accurately， providing theoretical basis for urban traffic dispersion.

Pedestrian trajectory prediction based on graph convolutional network and endpoint induction

Man CHEN, Xiaojun YANG, Huimin YANG

2025, 45(5): 1480-1487. DOI: 10.11772/j.issn.1001-9081.2024050650

Asbtract ( )

HTML ( )

PDF (3068KB) ( )

Figures and Tables | References | Related Articles | Metrics

In order to solve the problem that pedestrian trajectory prediction research only focuses on interactive information of historical trajectories and ignores interactive information of endpoints， a pedestrian trajectory prediction model based on Graph Convolutional Network （GCN） and Endpoint Induction was proposed， named GCN-EI. Firstly， a classification method was employed on the training set to learn the weighted distribution of potential future endpoints for pedestrians. Subsequently， the possible endpoints were connected with their corresponding historical trajectories， and the interactive features of pedestrians were extracted over a longer time span by using the GCN with attention mechanism and endpoint conditions. Meanwhile， an individual feature module was used to extract the internal motion characteristics of pedestrians. Finally， the future trajectory of pedestrian was predicted by the temporal inference convolution. Test results on ETH and UCY datasets show that compared to STITD-GCN （Spatio-Temporal Interaction and Trajectory Distribution GCN） model， the proposed model has the Average Displacement Error （ADE） and Final Displacement Error （FDE） decreased by 4.5% and 5.0%， respectively； moreover， compared to PCCSNet （Prediction via modality Clustering， Classification and Synthesis Network） model using classification method， it has the FDE decreased by 9.5% .

Visually guided word segmentation and part of speech tagging

Haiyan TIAN, Saihao HUANG, Dong ZHANG, Shoushan LI

2025, 45(5): 1488-1495. DOI: 10.11772/j.issn.1001-9081.2024050627

Asbtract ( )

HTML ( )

PDF (1826KB) ( )

Figures and Tables | References | Related Articles | Metrics

Chinese Word Segmentation （WS） and Part-Of-Speech （POS） tagging can assist other downstream tasks such as knowledge graph construction and sentiment analysis effectively. Existing work typically only uses pure-text information for WS and POS tagging. However， the Web also contains many associated image and video information. Therefore， efforts were made to mine associated clues from this visual information to aid Chinese WS and POS tagging. Firstly， a series of detailed annotation standards were established， and with WS and POS tagging， a multimodal dataset VG-Weibo was annotated using the text and image content from Weibo posts. Then， two multimodal information fusion methods， VGTD （Visually Guided Two-stage Decoding model） and VGCD （Visually Guided Collapsed Decoding model）， with different decoding mechanisms were proposed to accomplish this joint task of WS and POS tagging. Among the above， in VGTD method， a cross-attention mechanism was adopted to fuse textual and visual information and a two-stage decoding strategy was employed to firstly predict possible word spans and then predict the corresponding tags； in VGCD method， a cross-attention mechanism was also utilized to fuse textual and visual information and more appropriate Chinese representation and a collapsed decoding strategy were used. Experimental results on VG-Weibo test set demonstrate that on WS and POS tagging tasks， the F1 scores of VGTD method are improved by 0.18 and 0.22 percentage points， respectively， compared to those of the traditional pure-text method's Two-stage Decoding model （TD）； the F1 scores of VGCD method are improved by 0.25 and 0.55 percentage points， respectively， compared to the traditional pure-text method's Collapsed Decoding model （CD）. It can be seen that both VGTD and VGCD methods can utilize visual information effectively to enhance the performance of WS and POS tagging.

Document-level relation extraction model based on anaphora and logical reasoning

Jie HU, Cui WU, Jun SUN, Yan ZHANG

2025, 45(5): 1496-1503. DOI: 10.11772/j.issn.1001-9081.2024050676

Asbtract ( )

HTML ( )

PDF (986KB) ( )

Figures and Tables | References | Related Articles | Metrics

In Document-level Relation Extraction （DocRE） task， the existing models mainly focus on learning interaction among entities in the document， neglecting the learning of internal structures of entities， and pay little attention to recognition of pronoun references and application of logical rules in the document. The above leads to the model not being accurate enough in modeling relationships among entities in the document. Therefore， an anaphor-aware relation graph was integrated on the basis of the Transformer architecture to model interaction among entities and internal structures of entities. So that， anaphora was used to aggregate more contextual information to the corresponding entities， thereby enhancing relation extraction accuracy. Moreover， a data-driven approach was used to mine logical rules from relation annotations to enhance understanding and reasoning capabilities for implicit logical relationships in the text. To solve the problem of sample imbalance， a weighted long-tail loss function was introduced to improve the accuracy of identifying rare relations. Experiments were conducted on two public datasets DocRED （Document-level Relation Extraction Dataset） and Re?DocRED （Revisiting Document-level Relation Extraction Dataset）. The results show that the proposed model has the best performance， when using BERT as encoder， its IgnF1 and F1 values on test set of on DocRED are increased by 1.79 and 2.09 percentage points compared to those of the baseline model ATLOP （Adaptive Thresholding and Localized cOntext Pooling）， respectively， validating the high comprehensive performance of the proposed model.

Few-shot named entity recognition based on decomposed fuzzy span

Biqing ZENG, Guangbin ZHONG, James Zhiqing WEN

2025, 45(5): 1504-1510. DOI: 10.11772/j.issn.1001-9081.2024050567

Asbtract ( )

HTML ( )

PDF (1072KB) ( )

Figures and Tables | References | Related Articles | Metrics

Few-shot Named Entity Recognition （few-shot NER） aims to identify entity spans and their types in text based on limited labeled data. Although span-based metric learning has achieved promising results in recent years， two challenges remain： first， prototypes may be pulled away from cluster centers due to sparse candidate spans； second， some non-entity spans may be produced by span detectors that are irrelevant to the categories. To address these issues， a decomposed model integrating fuzzy span， namely DFSM （Decomposed Fuzzy Span Model）， was proposed for few-shot NER. In the span detection stage， a global boundary matrix was used to detect candidate spans， enabling the learning of explicit entity boundary information without dependency on labels at token level. In the span classification stage， a fuzzy span strategy was proposed to adjust the boundary ranges of candidate spans， thereby increasing the number of trainable candidate spans for each entity type. Meanwhile， a prototypical contrastive learning was designed to optimize the span-based semantic representation space. Besides， prototypical boundary learning was introduced to enlarge the distance between non-entity spans and prototypes， eliminating interference from non-entity noisy data. Experimental results on Few-NERD and CrossNER datasets show that： compared to the baseline model TadNER， DFSM achieves an average F1-score gain of 8.52 percentage points under the Few-NERD Inter setting， with a notable 10.39 percentage points improvement in the Inter 10-way 5 - 10-shot scenario， highlighting its enhanced capability for fine-grained entity recognition； compared to the baseline model DecomMeta， DFSM achieves F1-score improvements of 3.32 and 1.09 percentage points in CrossNER 1-shot and CrossNER 5-shot setting， respectively， demonstrating the good generalization ability of DFSM in cross-domain low-resource scenarios.

Named entity recognition model based on global information fusion and multi-dimensional relation perception

Jie HU, Shuaixing WU, Zhilan CAO, Yan ZHANG

2025, 45(5): 1511-1519. DOI: 10.11772/j.issn.1001-9081.2024050675

Asbtract ( )

HTML ( )

PDF (1503KB) ( )

Figures and Tables | References | Related Articles | Metrics

The existing Named Entity Recognition （NER） models based on Bidirectional Long Short-Term Memory （BiLSTM） network are difficult to fully understand the global semantics of text and capture the complex relationships between entities. Therefore， an NER model based on global information fusion and multi-dimensional relation perception was proposed. Firstly， BERT （Bidirectional Encoder Representations from Transformers） was used to obtain vector representation of the input sequence， and BiLSTM was combined to further learn context information of the input sequence. Secondly， a global information fusion mechanism composed of gradient stabilization layer and feature fusion module was proposed. With the former one， the model was able to maintain stable gradient propagation and update as well as optimize representation of the input sequence. In the latter one， the forward and backward representations of BiLSTM were integrated to obtain more comprehensive feature representation. Thirdly， a multi-dimensional relation perception structure was constructed to learn correlations between words in different subspaces in order to capture complex entity relationships in documents. In addition， the adaptive focus loss function was used to adjust the weights of different entity types dynamically to improve the recognition performance of the model for minority entities. Finally， experiments were conducted on 7 public datasets for the proposed model and 11 baseline models. The results show that all of the F1 values of the proposed model are higher than those of the comparison models， validating the comprehensive performance of the proposed model.

Chinese image captioning method based on multi-level visual and dynamic text-image interaction

Junyan ZHANG, Yiming ZHAO, Bing LIN, Yunping WU

2025, 45(5): 1520-1527. DOI: 10.11772/j.issn.1001-9081.2024050616

Asbtract ( )

HTML ( )

PDF (3653KB) ( )

Figures and Tables | References | Related Articles | Metrics

Image captioning technology can help computers understand image content better， and achieve cross-modal interaction. To address the issues of incomplete extraction of multi-granularity features from images and insufficient understanding of image-text correlation in Chinese image captioning tasks， a method for extracting multi-level visual and semantic features of images and dynamically integrating them in decoding process was proposed. Firstly， multi-level visual features were extracted on the encoder， and multi-granularity features were obtained through an auxiliary guidance module of the image local feature extractor. Then， a text-image interaction module was designed to dynamically focus on semantic associations between visual and textual information. Concurrently， a dynamic feature fusion decoder was designed to perform closed-loop dynamic fusion and decoding of features with adaptive cross-modal weights， ensuring enhanced information integrity while maintaining semantic relevance. Finally， coherent Chinese descriptive sentences were generated. The method's effectiveness was evaluated using BLEU-n， Rouge， Meteor， and CIDEr metrics， with comparisons against eight existing approaches. Experimental results demonstrate consistent improvements across all semantic relevance evaluation metrics. Specifically， compared with the baseline model NIC （Neural Image Caption）， the proposed method improves the BLEU-1， BLEU-2， BLEU-3， BLEU-4， Rouge_L， Meteor， and CIDEr by 5.62%， 7.25%， 8.78%， 10.85%， 14.06%， 5.14%， and 15.16%， respectively， confirming its superior accuracy.

Chinese spelling correction algorithm based on multi-modal information fusion

Qing ZHANG, Fan YANG, Yuhan FANG

2025, 45(5): 1528-1534. DOI: 10.11772/j.issn.1001-9081.2024050628

Asbtract ( )

HTML ( )

PDF (1480KB) ( )

Figures and Tables | References | Related Articles | Metrics

The goal of Chinese Spelling Correction （CSC） is to detect and correct character or word-level errors in user-input Chinese text， which commonly arise from semantic， phonetic， or glyphic similarities among Chinese characters. However， existing models often neglect local information， and fail to fully capture phonetic and glyphic similarities among different Chinese characters， as well as effectively integrate these similarities with semantic information. To address these issues， a new CSC algorithm based on multimodal information fusion was proposed， namely PWSpell. This algorithm utilized a convolutional attention mechanism to focus on local semantic information， employed Pinyin encoding to capture phonetic similarities among characters， and， for the first time， introduced Wubi encoding into the CSC domain for capturing glyphic similarities among Chinese characters. Additionally， it selectively integrated these two types of similarity information with semantic information processed by BERT （Bidirectional Encoder Representation from Transformers）. Experimental results demonstrate that PWSpell improves error detection accuracy， precision， F1-score， as well as correction precision and F1-score on SIGHAN 2015 test set， with at least one percentage point increase in correction precision. Ablation experimental results also validate that the design of each module in PWSpell effectively improves its performance.

User data management and control in internet of behaviors： a review

Yi HE, Yinan XIAO, Yunkai WEI, Supeng LENG

2025, 45(5): 1535-1547. DOI: 10.11772/j.issn.1001-9081.2024050599

Asbtract ( )

HTML ( )

PDF (1223KB) ( )

Figures and Tables | References | Related Articles | Metrics

In recent years， the rapid development of Internet of Things （IoT） has spurred the emergence of the Internet of Behavior （IoB）， which leverages IoT-derived data and information to achieve higher levels of knowledge and wisdom， rapidly evolving into a promising technology in various application potential. IoB involves extensive collection， processing， and utilization of user behavioral data， thereby exposing user data security and privacy to significant risks. Therefore， it is vital to protect the IoB user data with effective data management and control. After introducing the fundamental concepts and characteristics of IoB， its development trends and the security and privacy risks associated with user data were analyzed. Furthermore， the current situation of management and control of behavioral data was elaborated， the main problems and challenges existed in IoB were discussed， and the potential research directions to achieve user data management and control in IoB were proposed.

Evaluation of cross-domain attacks in cloud-edge collaborative industrial control systems

Chenwei LIN, Ping CHEN

2025, 45(5): 1548-1555. DOI: 10.11772/j.issn.1001-9081.2024050579

Asbtract ( )

HTML ( )

PDF (1512KB) ( )

Figures and Tables | References | Related Articles | Metrics

In response to the increasing complexity of Industrial Control System （ICS） structure， especially within the context of cloud-edge collaborative computing， which significantly raises cybersecurity risks， an evaluation framework specifically for assessing cross-domain attacks in cloud-edge collaborative scenarios was proposed to identify， evaluate， and defense against potential security threats systematically. Initially， this framework entailed a thorough collection and categorization of ICS assets， cross-domain attack entrances， methods， and impacts， establishing a foundational database and structure for assessment. Furthermore， based on the characteristics of ICS， a novel set of evaluation indicators for cross-domain attacks was developed， encompassing system modules， attack paths， attack methods， and potential impacts. Additionally， through simulation experiments conducted in a simulated ICS environment， the effectiveness of this evaluation framework was tested， verifying its capacity to effectively identify vulnerabilities within the system and enhance overall security. The results demonstrate that the assessment framework can provide both theoretical and practical guidance for the secure application of cloud-edge technologies in industrial settings， indicating promising applicability.

Conditional privacy-preserving authentication scheme based on certificateless group signature for VANET

Yueduan XU, Jianwei CHEN, Hengliang ZHU

2025, 45(5): 1556-1563. DOI: 10.11772/j.issn.1001-9081.2024050695

Asbtract ( )

HTML ( )

PDF (1317KB) ( )

Figures and Tables | References | Related Articles | Metrics

The Vehicular Ad hoc NETwork （VANET） improves road traffic efficiency， but the security and privacy issues it faces may lead to serious traffic accidents， making anonymous authentication of messages necessary. However， existing authentication schemes still struggle to the problems of conditional privacy preservation， anonymous authentication and authentication efficiency. To address these problems， a conditional privacy-preserving authentication scheme for VANET based on certificateless group signature was proposed. Firstly， an anonymous authentication scheme based on certificateless group signature was proposed by combining certificateless public key cryptosystem with the ACJT group signature algorithm. In this scheme， when group member changes， other group members remain unaffected and require no key updates； moreover， the computational overhead of the group signature generation and verification algorithm remains constant， independent of the group member number. Furthermore， to prevent vehicles from committing malicious acts due to identity anonymity， the scheme realized conditional privacy protection， i.e.， when a malicious act occurs， the identity of the relevant vehicle can be traced and held responsible. Security analysis proves that the scheme simultaneously satisfies forward security， unforgeability， and unlinkability requirements. Performance experimental results show that compared with similar schemes， the proposed scheme improves the authentication efficiency by at least 31.63% and reduces the communication overhead by at least 33.12%.

Privacy protection method for consortium blockchain based on SM2 linkable ring signature

Gaimei GAO, Miaolian DU, Chunxia LIU, Yuli YANG, Weichao DANG, Guoxia DI

2025, 45(5): 1564-1572. DOI: 10.11772/j.issn.1001-9081.2024050607

Asbtract ( )

HTML ( )

PDF (1976KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the challenges of privacy leakage in identity information and transaction data within consortium blockchain， a Privacy Protection Method for Consortium Blockchain based on SM2 Linkable Ring Signature （PPMCB-SM2LRS） was proposed. Firstly， to overcome the issues of insufficient security and poor traceability in existing Linkable Ring Signature （LRS） scheme， it was redesigned in combination with SM2 digital signature， aiming to enhance the privacy protection of counterparty identities while enabling the traceability of malicious users. Secondly， based on the optimized Paillier homomorphic encryption algorithm， a hierarchical encryption strategy was proposed to realize the “visible unavailability” of private data， so as to improve the privacy and confidentiality of transaction data verification in consortium chain. Security analysis demonstrates that the proposed method is correct， unforgeable， conditionally anonymous and linkable. Experimental results show that compared with similar LRS schemes， PPMCB-SM2LRS has lower computational overhead， and the average time spent in the signature generation and verification stages is significantly reduced； additionally， it adheres to the principle of autonomous controllability in cryptographic technology development.

Adversarial sample generation method for time-series data based on local augmentation

Xueying LI, Kun YANG, Guoqing TU, Shubo LIU

2025, 45(5): 1573-1581. DOI: 10.11772/j.issn.1001-9081.2024050610

Asbtract ( )

HTML ( )

PDF (2336KB) ( )

Figures and Tables | References | Related Articles | Metrics

Deep Neural Networks （DNNs） are highly susceptible to adversarial attacks， causing security problems in time-series data classification tasks. Gradient-based attack methods can generate adversarial samples quickly but need continuous access to the model's internal information， while generation-based attack methods do not need this access after training but suffer from poor stealthiness and transferability. To address these problems， a semi-white box adversarial sample generation method for time-series data based on local augmentation was proposed using the generative attack method AdvGAN. The local augmentation strategy in this method injected information from other data categories into original samples and utilized enhanced data to execute semi-white-box attacks. The attack model leveraged both original sample information and distribution characteristics of other categories， thereby enhancing model's attack capability and transferability. Experimental results on UCR datasets demonstrate that the proposed method generates an adversarial example in 0.027 s on average； it outperforms Fast Gradient Sign Method （FGSM）， AdvGAN， and GATN （Gradient Adversarial Transformation Network） methods in attack success rate on 18， 25， and 13 datasets of 27 datasets respectively. The generated adversarial examples exhibit significantly smaller Mean Squared Error （MSE） compared to AdvGAN and GATN methods on 20 and 27 datasets respectively. Its transfer success rates surpass AdvGAN and FGSM methods on 18 and 11 datasets respectively， with transfer attack success rates exceeding 25% on 9 datasets of 21 datasets. The results indicate that the proposed method maintains efficient adversarial example generation while improving stealthiness and preserving competitive attack performance.

Downsampled image forensic network based on image recovery and spatial channel attention

Aoling LIU, Wuyang SHAN, Junying QIU, Mao TIAN, Jun LI

2025, 45(5): 1582-1588. DOI: 10.11772/j.issn.1001-9081.2024050672

Asbtract ( )

HTML ( )

PDF (2717KB) ( )

Figures and Tables | References | Related Articles | Metrics

Downsampling operation will make images lose high-frequency forensic traces and detail information， increasing the difficulty of image forensics. Existing deep learning-based image forensic networks cannot effectively detect the images tampered by downsampling operation， making the enhancement of robustness in downsampling image forensics methods becomes a bottleneck in image forensics. To solve these problems， a downsampling image forensic network named HirrNet （Hierarchical Ringed Residual U-Net） was proposed， which consists of an image recovery module and a tampering detection module. In the image recovery module， the idea of Hierarchical Conditional Flow （HCF） was used to reduce the loss of high-frequency information by recovering forensic traces and details in tampered images， so as to improve the performance of tampering detection. In the tampering detection module， an end-to-end image segmentation network RRU-Net （Ringed Residual U-Net） was employed for tampering detection. Besides， by combining the Spatial and Channel Squeeze & Excitation （SCSE） mechanism， the extraction of tampering-related features in the downsampled image was effectively enhanced. Experimental results show that HirrNet outperforms comparative networks in terms of Area Under the receiver operating characteristic Curve （AUC）， F1-score and Intersection and Union （IoU） on DSO， Columbia， CASIA， and NIST16 datasets. Compared with the comparative methods， HirrNet improves the AUC by 25 and 30 percentage points on average for the tampered images scaled down to 1/2 and 1/4 of their original sizes on CASIA dataset. These findings indicate that HirrNet can effectively resolve the poor robustness of existing downsampled image forensic methods.

Node collaboration mechanism for quality optimization of hierarchical federated learning models under energy consumption constraints

Yazhou FAN, Zhuo LI

2025, 45(5): 1589-1594. DOI: 10.11772/j.issn.1001-9081.2024050704

Asbtract ( )

HTML ( )

PDF (1190KB) ( )

Figures and Tables | References | Related Articles | Metrics

The massive data generated at the edge can be used to train global models through Federated Learning （FL）， making the combination of edge computing and federation learning become a key technology for reducing network energy consumption. In Hierarchical Federated Learning （HFL）， the difference in the amount of local data and data quality of edge devices directly affects the quality of the global model of HFL. To address these issues， a Node Cooperation Algorithm under Transmission Energy Consumption Constraint （NCATTECC） was proposed to solve the global model quality optimization problem， which was proved to be an Non-deterministic Polynomial-hard （NP-hard） problem， and it was also proved that the proposed algorithm has an approximate ratio of （1-1/ $e$ ）. Specifically， node collaboration enabled the participation of more high-quality nodes in training without exceeding energy consumption limits. Simulation experimental results on the widely used CIFAR-10 and FashionMNIST datasets prove that the proposed algorithm achieves model accuracy improvements of 4.47% and 6.64% compared to FedAvg （Federated Averaging）， and 3.47% and 4.58% compared to Fed-CBS （Federated Class-balanced Sampling）， respectively， when training with selected nodes.

Improved ring theory-based evolutionary algorithm with new repair optimization operator for solving multi-dimensional knapsack problem

Hansong ZHANG, Yichao HE, Fei SUN, Guoxin CHEN, Ju CHEN

2025, 45(5): 1595-1604. DOI: 10.11772/j.issn.1001-9081.2024050575

Asbtract ( )

HTML ( )

PDF (1523KB) ( )

Figures and Tables | References | Related Articles | Metrics

To efficiently solve Multi-dimensional Knapsack Problem （MKP） using Ring Theory-based Evolutionary Algorithm （RTEA）， after analyzing the inadequacies of existing repair operators： RO1 （based on the pseudo-utility ratio of items’ overall resource consumption） and RO3 （based on the value density across individual resource dimensions）， a new weighted repair optimization operator named RO4 was proposed by integrating complementary strategy. Additionally， an inheritance strategy was introduced to improve the global evolutionary operator of RTEA， and a self-adaptive reverse mutation operator suitable for MKP was proposed on the basis of Logistic model， along with a new algorithm IRTEA-RO4 for solving MKP. To validate its efficiency， IRTEA-RO4 was tested on 114 internationally recognized MKP benchmark instances and compared with six state-of-the-art algorithms for solving MKP. Experimental results demonstrate that for small-scale MKP instances， IRTEA-RO4 achieves the highest solution accuracy and fastest computation speed； for large-scale MKP instances， IRTEA-RO4 outperforms the best results of the six existing algorithms by 21% to 125% in solution quality， while also exhibiting superior average performance， enhanced stability， and faster computational speed.

Two-stage infill sampling-based semi-supervised expensive multi-objective optimization algorithm

Ying TAN, Xinyu REN, Chaoli SUN, Sisi WANG

2025, 45(5): 1605-1612. DOI: 10.11772/j.issn.1001-9081.2024050585

Asbtract ( )

HTML ( )

PDF (1322KB) ( )

Figures and Tables | References | Related Articles | Metrics

Replacing expensive objective function evaluations with computationally inexpensive surrogate models to assist evolutionary algorithms in solving expensive black-box multi-objective optimization problems has garnered significant attention in recent years. Model accuracy plays a critical role in surrogate-assisted Multi-Objective Evolutionary Algorithms （MOEAs）； particularly when dealing with numerous objective functions， inaccurate models may misguide the search direction. However， due to the high cost of objective function evaluation， obtaining sufficient training samples to build high-quality surrogate models remains challenging. To address this issue， a Two-stage Infill Sampling-based Semi-supervised Expensive Multi-objective Optimization Algorithm （TISS-EMOA） was proposed. Semi-supervised techniques were introduced to augment the training dataset by selecting partial unlabeled data， thereby improving model accuracy. Simultaneously， a two-stage infill sampling criterion was introduced to acquire high-quality solutions for expensive multi-objective optimization problems under limited evaluation budgets. To validate the effectiveness of TISS-EMOA， experiments were conducted on the DTLZ1 - DTLZ7 benchmark problems and a real-world vehicle frontal structure optimization design. Compared with five State-Of-The-Art （SOTA） surrogate-assisted multi-objective evolutionary algorithms， TISS-EMOA achieves 25， 28， 28，24， 23 optimal or equal Modified Inverted Generational Distance （IGD⁺） results in 28 benchmark problems.

Channel estimation of reconfigurable intelligent surface assisted communication system based on deep learning

Dan WANG, Wenhao ZHANG, Lijuan PENG

2025, 45(5): 1613-1618. DOI: 10.11772/j.issn.1001-9081.2024050587

Asbtract ( )

HTML ( )

PDF (2736KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the issue of low channel estimation accuracy in Reconfigurable Intelligent Surface （RIS） assisted communication systems， a channel estimation scheme based on Channel Denoising Network （CDN） was proposed， which modeled the channel estimation problem as a channel noise elimination problem. Firstly， traditional algorithms were employed to estimate the received pilot signal preliminarily. Then， the estimated signals were input into the channel estimation network to learn noise features and execute denoising， thereby recovering accurate channel coefficients. Finally， to improve the denoising capability of the network， a Weighted Attention Block （WAB） and a Dilated Convolution Block （DCB） were designed to enhance the network's extraction of dominant noise features， and a multi-scale feature fusion module was designed to prevent the loss of shallow features. Simulation results demonstrate that compared with classical DnCNN （Denoising Convolutional Neural Network） and CDRN （Convolutional neural network-based Deep Residual Network） schemes， the proposed scheme reduces the Normalized Mean Square Error （NMSE） by 2.89 dB and 2.01 dB averagely at different Signal-to-Noise Ratios （SNRs）.

Resource allocation for relay in intelligent reflecting surface assisted wireless powered communication networks

Hongwei FAN, Woping XU

2025, 45(5): 1619-1624. DOI: 10.11772/j.issn.1001-9081.2024050633

Asbtract ( )

HTML ( )

PDF (2025KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problems of limited coverage and vulnerability to obstacles in Wireless Powered Communication Network （WPCN）， the resource allocation for Intelligent Reflecting Surface （IRS） assisted WPCN relay systems under communication blocking conditions was investigated. Specifically， in the downlink， IRS assisted users in harvesting energy from the Hybrid Access Point （HAP）； in the uplink， it facilitated information transmission from users to the HAP. Considering both energy transmission and information transfer， Time Division Multiple Access （TDMA） was used to partition time slots for energy harvesting， data communication， and data relay transmission. Based on the constructed system model and transmission strategy， an energy efficiency optimization problem was formulated with the constraints on user quality of service and energy consumption of sending information by the users， and total energy efficiency of the system was maximized by jointly optimizing the transmit power， IRS phase shift matrix and time scheduling. Due to the non-convex nature of the proposed problem， the Dinkelbach method was first applied to transform the fractional objective function into a non-fractional form. Subsequently， variable substitution and Semi-Definite Programming （SDP）were employed to convert the non-convex problem into a convex formulation， which was then solved suboptimally using CVX. Simulation results show that the proposed scheme not only extends system coverage， but also significantly improves energy efficiency. Compared with the average time allocation scheme and optimized time scheme with hybrid relay node， the proposed scheme achieves average energy efficiency improvements of 11.0% and 26.9% respectively.

Fast beam training on extremely large-scale multiple-input multiple-output system

Huahua WANG, Changjiang XIE, Jiening FANG

2025, 45(5): 1625-1631. DOI: 10.11772/j.issn.1001-9081.2024050583

Asbtract ( )

HTML ( )

PDF (2429KB) ( )

Figures and Tables | References | Related Articles | Metrics

The eXtremely Large-scale Multiple-Input-Multiple-Output （XL-MIMO） system can significantly improve channel capacity. However， traditional Uniform Linear Arrays （ULAs） experience a drastic reduction in the near-field region at large incident and emitted angles， leading to limited signal coverage. The use of Uniform Circular Arrays （UCAs） can effectively expand the near-field regions， but renders low-overhead beam training schemes based on ULA impractical. To reduce the overhead of near-field beam training with UCA， a new fast beam training scheme was proposed. In the first stage， UCA was approximated as ULA， and a joint method was used to construct a far-field hierarchical codebook for angle domain user search； in the second stage， based on the angles obtained from the first stage， UCA was used for exhaustive search in both angle and distance domains. Simulation results on a UCA system with 512 antennas indicate that the proposed scheme requires only 28 training overheads， while maintaining good robustness across different Signal-to-Noise Ratio （SNR） conditions， and its rate performance achieves 99.16% of the benchmark.

Survey on hardware acceleration schemes for ray tracing

Daquan ZHANG, Jiarui DONG, Yang LEI, Shikang LI, Xiangyu SHI, Zonghui LI, Yangdong DENG, Weimin WU

2025, 45(5): 1632-1644. DOI: 10.11772/j.issn.1001-9081.2024030399

Asbtract ( )

HTML ( )

PDF (2672KB) ( )

Figures and Tables | References | Related Articles | Metrics

Nowadays， real-time 3D graphics rendering is undergoing technological innovation， with a surge in applications of real-time ray tracing technology. However， from a computational perspective， ray tracing remains expensive， as traditional hardware cannot support such computational demands. New Graphics Processing Units （GPUs） must balance performance， power consumption， and higher complexity scenarios， making hardware acceleration technologies central to real-time ray tracing. Firstly， the theoretical foundations of ray tracing was introduced， and based on the two most dominant accelerated data structures — KD-Tree （K-Dimensional Tree） and Bounding Volume Hierarchies based on Tree （BVH-Tree）， primitive segmentation， construction methods， optimization methods， and traversal acceleration were investigated to reveal the potential of these two structures for hardware acceleration. Secondly， the dedicated acceleration hardware developed in each stage were summarized from three perspectives： fixed-function design， hardware architecture design， and scheduling and data management to reduce memory bandwidth. Thirdly， mainstream industry oriented ray tracing GPU solutions and future development trends for industry were researched. Finally， the current situation and limitations of hardware acceleration schemes were discussed， along with potential directions for performance optimization.

Review of multi-modal research methods for face recognition

Yali YANG, Ying LI, Yutao ZHANG, Peihua SONG

2025, 45(5): 1645-1657. DOI: 10.11772/j.issn.1001-9081.2024050568

Asbtract ( )

HTML ( )

PDF (1779KB) ( )

Figures and Tables | References | Related Articles | Metrics

Multi-modal face recognition technology can fully utilize face features and other biometric features to enhance recognition robustness and security， and has broad practical application value. Current research on multi-modal face recognition has problems such as modal disparity and inefficient modal fusion. Therefore， based on multiple information modalities and application purposes， the existing multi-modal face recognition methods were classified and reviewed to sort out the problems in research and explore future development directions. Firstly， the multi-modal face recognition research based on multi-source information fusion was divided into sensor-level， feature-level， scoring-level， and decision-level ones according to different stages of data processing， and advantages， limitations， and applicable scenarios of the existing methods were summarized. Secondly， the research on information-enhanced multi-modal face recognition was categorized into 2D-3D and 3D-2D information enhancement ones according to different enhanced modalities， and advantages and disadvantages of the existing methods were summed up. Thirdly， multi-modal face recognition methods based on other biometric features and for anti-spoofing were summarized， and the relevant information of commonly used multi-modal face recognition datasets were introduced briefly. Finally， key challenges and future development directions were given and prospected.

Review of unsupervised deep learning methods for industrial defect detection

Wenpeng WANG, Yinchang QIN, Wenxuan SHI

2025, 45(5): 1658-1670. DOI: 10.11772/j.issn.1001-9081.2024050736

Asbtract ( )

HTML ( )

PDF (3241KB) ( )

Figures and Tables | References | Related Articles | Metrics

Industrial defect detection plays a crucial role in ensuring product quality and enhancing enterprise competitiveness. Traditional defect detection methods rely on manual inspection， which is costly and inefficient， making it difficult to meet large-scale quality inspection requirements. In recent years， vision-based industrial defect detection technologies have made significant progress and become an efficient solution for product appearance quality inspection. However， in many practical industrial scenarios， it is challenging to obtain large amounts of labeled data， and there are requirements for both the labor cost and real-time performance of product detection， making unsupervised learning become a research hotspot. Related work on task construction， current technologies， evaluation standards， and the commonalities and differences among various methods in this field were reviewed. Firstly， the definition of industrial defect problems was clarified， and the difficulties of the problem were analyzed from perspectives of data challenges and task difficulties. Secondly， unsupervised deep learning-based methods for industrial defect detection were comprehensively introduced and systematically categorized. Furthermore， commonly used public datasets and evaluation metrics were summarized. Finally， future work in industrial defect detection was discussed.

Classification algorithm for point cloud based on local-global interaction and structural Transformer

Kai CHEN, Hailiang YE, Feilong CAO

2025, 45(5): 1671-1676. DOI: 10.11772/j.issn.1001-9081.2024050572

Asbtract ( )

HTML ( )

PDF (1903KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problem of insufficient local and global feature extraction in the feature extraction process of point cloud classification， a point cloud classification algorithm with local-global interaction and structural Transformer was proposed. Firstly， a dual-branch parallel local-global interaction framework was proposed and used to extract local and global features respectively， where in one branch， maximum pooling and convolution were used to extract local features， and in the other branch， global features were extracted by using average pooling and Transformer. Meanwhile， considering the importance of position information in Transformer， a structural Transformer was proposed to further enhance the global structural features by applying interaction of position information with current features for several times. Finally， the local-global features were used for classification to complete the classification task of point cloud. Experimental results show that the classification Overall Accuracies （OAs） of the proposed algorithm are 93.6% and 87.5% respectively on ModelNet40 and ScanObjectNN benchmark datasets. It can be seen that the proposed local-global interaction and structural Transformer network achieve good performance in point cloud classification task.

Unsupervised point cloud anomaly detection based on multi-representation fusion

Zihe CHEN, Bin CHEN

2025, 45(5): 1677-1685. DOI: 10.11772/j.issn.1001-9081.2024050652

Asbtract ( )

HTML ( )

PDF (2684KB) ( )

Figures and Tables | References | Related Articles | Metrics

With the growing demand of industrial automation， 3D point cloud anomaly detection has played an increasingly important role in product quality control. However， the existing methods often rely on a single feature， leading to information loss and accuracy reduction. To address these issues， an unsupervised point cloud anomaly detection method based on multi-representation fusion was proposed， called MRF （Multi-Representation Fusion）. MRF used multi-angle rotation and various coloring schemes to render point clouds into multi-modal images， and employed pre-trained 2D convolutional neural networks to extract rich semantic features. Simultaneously， pre-trained Point Transformer was adopted to extract 3D structural features. After the above， by fusing 2D image semantic features and 3D structural features， MRF was able to capture point cloud information more comprehensively. In the anomaly detection stage， abnormal point clouds were identified effectively by using a method based on positive sample memory banks and nearest neighbor search. Experimental results on MVTec 3D AD dataset show that MRF achieves a point cloud-level AUROC （Area Under the Receiver Operating Characteristic curve） of 0.972 and a point-level AUPRO （Area Under the Per-Region Overlap） of 0.948， significantly outperforming existing methods. It can be seen that the effectiveness and robustness of MRF makes it a highly promising solution for industrial applications.

Robotic grasp detection with feature fusion of spatial-Fourier domain information under low-light environments

Lu CHEN, Huaiyao WANG, Jingyang LIU, Tao YAN, Bin CHEN

2025, 45(5): 1686-1693. DOI: 10.11772/j.issn.1001-9081.2024111686

Asbtract ( )

HTML ( )

PDF (2948KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the inadequacy of the existing grasp detection methods that cannot effectively perceive sparse and weak features， leading to performance degradation in robot grasp detection under low-light environments， a robotic grasp detection method that integrated spatial-Fourier domain information for low-light environments was proposed. Firstly， the proposed model utilized an encoder-decoder architecture as its backbone， and performed spatial-Fourier domain feature extraction during the fusion of deep and shallow features within the network. Specifically， in the spatial domain， global contextual information was captured using strip convolutions applied in horizontal and vertical directions， enabling the extraction of information critical to the grasp detection task. In the Fourier domain， image details and texture features were restored by independently modulating amplitude and phase components. Furthermore， a R-CoA （Row-Column Attention） module was incorporated to effectively balance global and local image information， while encoding the relative positional relationships of image rows and columns to emphasize positional information pertinent to grasp tasks. Finally， validation on low-light Cornell， low-light Jacquard， and the constructed low-light C-Cornell datasets demonstrates that the proposed method achieves highest accuracies of 96.62%， 92.01%， and 95.50%， respectively. Specifically， on the low-light Cornell dataset （Gaussian noise and $γ = 1.5$ ）， the proposed method outperforms GR-ConvNetv2 （Generative Residual Convolutional Neural Network v2） and SE-ResUNet （Squeeze-and-Excitation ResUNet） in accuracy by 2.24 percentage points and 1.12 percentage points， respectively. The proposed method can effectively improve the robustness and accuracy of grasp detection in low-light environments， providing support for robotic grasping tasks under insufficient illumination conditions.

Refined inspection method for power transmission lines based on monocular vision

Wenshuai WANG, Jun HAN, Guangyi HU, Keyu CHEN

2025, 45(5): 1694-1702. DOI: 10.11772/j.issn.1001-9081.2024050632

Asbtract ( )

HTML ( )

PDF (5700KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the current challenges of the complexity， low accuracy， and inability to capture detailed local features of artificial targets from optimal angles in generating refined inspection trajectories for Unmanned Aerial Vehicles （UAVs） inspecting aerial artificial targets such as power transmission lines， a real-time depth perception and line component segmentation and localization algorithm for refined UAV inspection of power transmission lines was proposed， and an optimal inspection point path for monocular vision perception， positioning， and navigation of power transmission lines was constructed. In the method， by adjusting the UAV position and gimbal camera shooting angle quantitatively during the inspection process in real time， a safe inspection distance was maintained while allowing the gimbal camera to shoot images containing the targets to be inspected clearly and accurately. Experimental simulations were carried out by using real data collected by DJI UAV and the data under Unreal Engine 4 scenario. The results demonstrate that the optimized depth perception algorithm as well as the line component segmentation and localization algorithm meets real-time requirements. Under the guidance of the output information from depth perception as well as segmentation and localization， these algorithms can adjust the UAV position and gimbal camera posture optimally， resulting in high-quality UAV inspection images of power transmission lines， and the finally generated refined inspection trajectories can improve the efficiency of inspections of operation and maintenance personnel significantly.

Table of Content