Journal of Computer Applications

Review of vision-language model architecture development

Ziquan LIU, Xuyang SHI, Ke LI, Liang LIU, Zhewei ZHU

2026, 46(6): 1703-1711. DOI: 10.11772/j.issn.1001-9081.2025060695

Asbtract ( )

HTML ( )

PDF (1005KB) ( )

Figures and Tables | References | Related Articles | Metrics

With the advancement of deep learning technologies， artificial intelligence has been driven to evolve from single-modality intelligence toward multimodal intelligence. Vision?Language Models （VLMs）， which serve as the pivotal means of bridging vision and language， have been established as a core research area. Aiming at the technological evolution of VLMs， architecture development of VLM was reviewed systematically， and the core technologies and latest research progress in this field were summarized. Firstly， the progression of VLM from early explorations to the current flourishing state was traced， key technological nodes and development trends were analyzed， and a technology roadmap with “architecture development” as the core theme was delineated. Secondly， the current foundational techniques of VLM were analyzed deeply， including core architectures built around vision encoders， language encoders， and cross‐modal fusion mechanisms， as well as key pretraining optimization objectives such as Masked Language Modeling （MLM）， Masked Image Modeling （MIM）， and Contrastive Learning （CL）. Concurrently， the mainstream datasets， which VLM pretraining relies on， such as COCO and LAION-5B， were listed systematically. Finally， representative VLMs were compared and analyzed to discover the relationships among model performance， data scale， architectural innovations， and training strategies， and the advantages and limitations of the related core technologies were commented， thereby providing a comprehensive VLM technology map for researchers of related fields， and offering reference and inspiration for future research.

Large language model-enhanced ant colony optimization for multi-solution traveling salesman problems

Taixin CAI, Fengfeng WEI

2026, 46(6): 1712-1720. DOI: 10.11772/j.issn.1001-9081.2025050646

Asbtract ( )

HTML ( )

PDF (1197KB) ( )

Figures and Tables | References | Related Articles | Metrics

In Combinatorial Optimization （CO） problems， Multi-Solution Traveling Salesman Problem （MSTSP） aims to acquire a set of distinct globally optimal paths， and plays a critical role in scenarios such as logistics scheduling and tour route planning. As a traditional approach for solving path optimization problems， ACO （Ant Colony Optimization） suffers from bottlenecks including pheromone premature convergence and imbalance between solution quality and diversity. To address these challenges， a Large Language Model （LLM）-enhanced ACO for MSTSP （L-ACO） was proposed to integrate LLMs into traditional ACO through a multi-layer prompt engineering strategy： during the solution construction stage， the city topological features were parsed， so as to construct high-quality diverse initial paths； in the perturbation optimization stage， new paths were generated on the basis of the paths in solution pool and their statistical information， so as to escape from the local optimum. Additionally， a multi-dimensional evaluation system was developed to assess solution quality， diversity， and LLM effectiveness comprehensively. Experimental results on 25 MSTSP benchmark instances demonstrate that compared to traditional ACO， L-ACO improves the Structural Diversity Index （SDI） by 0.08 and the Quality-Quantity Composite Index （QQCI） by 13% relatively， indicating that L-ACO effectively optimize the convergence in multi-solution scenarios compared to traditional ACO.

Government affairs hotline question answering system based on knowledge-enhanced large language model architecture

Longyu XIONG, Shengdong DU, Haochen SHI, Jie HU, Yan YANG, Tianrui LI

2026, 46(6): 1721-1727. DOI: 10.11772/j.issn.1001-9081.2025060727

Asbtract ( )

HTML ( )

PDF (990KB) ( )

Figures and Tables | References | Related Articles | Metrics

A government affairs hotline Question Answering （QA） system based on knowledge-enhanced Large Language Model （LLM） architecture， namely ChatGovt， was proposed to address the issues in current systems， such as low manual response efficiency， as well as inaccurate query screening mechanisms and insufficient intention difference recognition in traditional Retrieval-Augmented Generation （RAG） systems. Firstly， to improve response efficiency， a system architecture integrating intelligent question diverting and structured feedback was designed， thereby enabling classified processing of problems of consultation， complaint/suggestion， and other types with intent recognition. Then， to improve the system’s knowledge retrieval quality， a multi-stage semantic-enhanced retrieval method was proposed， including three stages： historical dialogue summary retrieval， semantic re-ranking， and self-reflective decision-making. Finally， cross-domain knowledge was supplemented through online queries， so as to form a service closed-loop for government consultation. Experimental results show that in terms of retrieval quality， compared with the traditional RAG system， ChatGovt has the query-knowledge relevance， real answer-knowledge relevance， and knowledge support improved by 15.0%， 7.4%， and 24.6% respectively； in terms of overall system performance， ChatGovt has the answer recall increased by 55.4% compared with the fine-tuned GLM （General Language Model）4-9b-chat model， and has the manual evaluation improved by 27.3% compared with the commercial system “Doubao”. It can be seen that this system provides a reference-worthy architecture and methodology for the technical optimization of government affairs hotline QA systems， can improve the response efficiency and service accuracy of government affairs hotlines effectively， and promotes the intelligent transformation of government services.

Masked autoencoder enhanced dynamic heterogeneous graph representation learning model

Haoran YUAN, Huan LIU, Pengfei JIAO, Zhidong ZHAO, Xianfei ZHANG, Zunliang LIU

2026, 46(6): 1728-1737. DOI: 10.11772/j.issn.1001-9081.2025060754

Asbtract ( )

HTML ( )

PDF (1188KB) ( )

Figures and Tables | References | Related Articles | Metrics

Real-world networks are often composed of multiple types of entities and interaction relationships， with topological structure and attributes evolving with time continuously. The heterogeneity and dynamics inherent in such networks can be fully described by Dynamic Heterogeneous Graph （DHG）. To solve the problems of coarse spatio-temporal information fusion and heavy reliance of the supervised learning paradigm on manual labels in the existing DHG representation learning models， a Masked AutoEncoder （MAE） enhanced DHG representation learning model was proposed. Firstly， heterogeneous spatial information was fused through a multi-level attention structure， and temporal information was fused across snapshots. Then， representation information of nodes was enriched by leveraging the reconstruction loss of the masked autoencoder. Experimental results show that improvements of at least 1.26 to 3.99 percentage points in Area Under the receiver operating Characteristic curve （AUC） are achieved by the proposed model on link prediction tasks compared to baseline models on multiple real-world datasets. It can be seen that the proposed model provides an effective self-supervised framework for DHG representation learning， facilitating more precise capture of heterogeneous information and dynamic evolution laws in real networks.

Low-rank adaptive parameter-efficient fine-tuning algorithm based on YOLOv11

Yi DU, Mingjin XU, Jiayi KONG, Liyao WANG, Chen ZHAO

2026, 46(6): 1738-1745. DOI: 10.11772/j.issn.1001-9081.2025060751

Asbtract ( )

HTML ( )

PDF (1691KB) ( )

Figures and Tables | References | Related Articles | Metrics

In view of the limitations of deep learning algorithms’ generalization and robustness， as well as the high computational cost of Full Parameter Fine-Tuning （FPFT） in object detection tasks in complex scenarios， a low-rank adaptive Parameter-Efficient Fine-Tuning （PEFT） algorithm based on YOLOv11 （You Only Look Once version 11） was proposed. Firstly， a Low-Rank Adaptation （LoRA） module was embedded into the backbone and neck networks of YOLOv11. Secondly， three low-rank decomposition algorithms， including LoRA， weight-Decomposed low-Rank Adaptation （DoRA） and Principal Singular values and Singular vectors Adaptation （PiSSA） were combined， and efficient parameter updates were achieved through weight decomposition and dynamic adjustment mechanisms. Finally， during the training process， most of the pre-trained weights of the YOLOv11 network were kept frozen， and only the low-rank matrices generated by the three low-rank decomposition algorithms in the LoRA module were trained， thereby reducing the trainable parameter size to 1.56% of the original algorithm. Experimental results on the COCO （Common Objects in COntext） dataset demonstrate that the proposed algorithm improves the precision， recall and mean Average Precision （mAP） at IoU （Intersection over Union） threshold of 0.5 by 4.18， 7.11 and 7.85 percentage points， respectively， compared with the baseline algorithm YOLOv11. It can be seen that the proposed algorithm provides an effective technical path for lightweight and efficient fine-tuning of large-scale detection algorithms in resource-constrained scenarios.

Federated learning framework integrating dynamic feature alignment and temperature-aware aggregation

Zhijian DONG, Ruichun GU

2026, 46(6): 1746-1755. DOI: 10.11772/j.issn.1001-9081.2025050661

Asbtract ( )

HTML ( )

PDF (1402KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the degradation of model performance caused by statistical heterogeneity under Non-Independent and Identically Distributed （Non-IID） data in federated learning， a Federated learning framework integrating Dynamic feature alignment and Temperature-aware Aggregation （FedDTA） was proposed. In the framework， client drifts were mitigated through dynamic feature alignment and temperature-aware aggregation collaboratively. It has two core components： a dynamic regularization approach based on Sliced Wasserstein Distance （SWD） was used to achieve local-global feature distribution alignment via low-dimensional Monte Carlo projections， thereby reducing computational complexity and suppressing feature drifts； a hierarchical aggregation strategy incorporating a learnable projection network with annealing temperature scheduling was used to allocate client weights dynamically according to parameter differences. Experimental results indicate that under strong heterogeneity （Dirichlet α=0.1） condition， in accuracy， FedDTA outperforms suboptimal FedKTL（Federated Knowledge-Transfer-Loop） and FedCMD （Federated learning with Contrastive cloud-edge Model Decoupling） by 1.698 and 0.714 percentage points on the CIFAR-10 and CIFAR-100 datasets， respectively， demonstrating superior generalization capability in multi-data scenarios. Ablation experimental results confirm that SWD alignment reduces feature drifts significantly， while temperature scheduling optimization balances the exploration with exploitation. Without exposing raw data， FedDTA provides theoretical and methodological supports for privacy-sensitive scenarios such as medical collaboration and the industrial Internet of Things.

Social bot detection framework fusing multi-scale wavelet enhancement and self-supervised learning

Yu CHEN, Shuaikang QI, Liwei XU, Haotian ZHU

2026, 46(6): 1756-1766. DOI: 10.11772/j.issn.1001-9081.2025060744

Asbtract ( )

HTML ( )

PDF (1266KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the limitations of the existing social bot detection methods in multimodal feature modeling， disguised behavior recognition， and generalization under weakly supervised scenarios， a social bot detection framework fusing multi-scale wavelet enhancement and self-supervised learning， named W2A-BotNet （Wavelet-to-Attention Bot Network）， was proposed. In the framework， a unified three-channel representation of textual semantics， user attributes， and social relations was constructed to alleviate modality conflicts； a Multi-scale Attention Wavelet Neural Operator Block （MAWNOBlock） was designed to perform time-frequency decomposition of behavioral sequences， thereby capturing both periodic patterns and abrupt anomalies； a multi-source collaborative fusion mechanism was introduced to achieve dynamic semantic alignment through cross-modal interactions and gating； a self-supervised pretraining based on follower count distribution was incorporated to enhance feature representation under limited labeled data. Experimental results show that the accuracy of the W2A-BotNet is improved by 0.35， 4.86， and 2.21 percentage points respectively compared to the suboptimal methods on Cresci-15， Cresci-17， and TwiBot-20 datasets， respectively. It can be seen that W2A-BotNet enhances the identification of bot accounts on social platforms effectively and provides a generalized detection framework for social network security governance.

Dual-channel multimodal sentiment analysis model based on contrast invariance and reinforcement specificity

Yunping HE, Leichun WANG, Ruirui SONG, Xiangfeng LU, Jinxiang WEI, Xiaomeng LIU

2026, 46(6): 1767-1775. DOI: 10.11772/j.issn.1001-9081.2025060731

Asbtract ( )

HTML ( )

PDF (1064KB) ( )

Figures and Tables | References | Related Articles | Metrics

In view of the problem that the existing Multimodal Sentiment Analysis （MSA） methods often lead to inaccurate sentiment analysis results due to modal heterogeneity and insufficient internal interaction， a dual-channel MSA model based on Contrast Invariance and Reinforcement Specificity （CIRS） was proposed. Firstly， the features in text， video and audio data were extracted and dimensionally aligned. Secondly， the invariant features of the modals were compared in consistency， and the mutual learning of invariant features between modals was enhanced through homogeneous graph distillation， so as to improve the representation consistency of modals. Thirdly， the specific features of modals were strengthened， and the knowledge transfer of specific features between modals was performed， so as to achieve semantic spatial alignment between modals. Finally， the invariant features and specific features were deeply integrated and predicted through self-attention mechanism and cross-modal attention mechanism. Experimental results show that compared with DLF （Disentangled-Language-Focused multimodal sentiment analysis）， CIRS has the Mean Absolute Error （MAE） reduced by 4.11%， 2-Class Accuracy （Acc-2） and F1-score both improved by 1.29% on the CMU-MOSI （Carnegie Mellon University Multimodal Opinion Sentiment Intensity） dataset； CIRS has the MAE reduced by 1.85% ， and the Acc-2 and F1-score improved by 0.70% and 0.94%， respectively， on the CMU-MOSEI （Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity） dataset. The above verifies that CIRS can reduce errors and improve classification accuracy during multimodal sentiment analysis effectively.

Gradient orthogonal projection based continual embedding method for dynamic knowledge graph

Meihua WANG, Jie HUANG, Wen WEN, Ruichu CAI, Peijie HUANG, Yuhong XU, Xinlong LIN

2026, 46(6): 1776-1784. DOI: 10.11772/j.issn.1001-9081.2025060737

Asbtract ( )

HTML ( )

PDF (827KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the issue that the existing Knowledge Graph （KG） embedding models cannot adapt to the increasing KG， a dynamic KG continual embedding method based on gradient orthogonal projection， named GOPemb （Gradient Orthogonal Projection embedding）， was proposed. Firstly， the Core Gradient Spaces （CGSs） of old entities and old relationships were stored in historical snapshots during the training process. Secondly， when learning new triples， the gradient update directions for old entities and old relationships were constrained to align with the orthogonal directions of their respective CGSs， thereby learning new knowledge efficiently while preserving historical knowledge effectively. Finally， the CGSs of old entities and old relationships were updated to prepare for the next learning iteration. Experimental results show that compared to the best method in the comparison group， IncDE （Incremental Distillation Embedding）， GOPemb method achieves average improvements of 9.2%， 14.0%， and 8.0% in MRR （Mean Reciprocal Rank）， H@3 （Top-3 Hit Rate）， and H@10 （Top-10 Hit Rate）， respectively， on the selected datasets ICEWS05-15-CL、ICEWS18-CL and GDELT-CL. Furthermore， experimental results on learning efficiency confirm the time efficiency of GOPemb method， indicating that the method has efficient continual embedding capability.

Legal case retrieval method via case information reformulation using large language model

Jintao WANG, Zhilin GAO, Qixiang MENG, Fanliang BU

2026, 46(6): 1785-1792. DOI: 10.11772/j.issn.1001-9081.2025050662

Asbtract ( )

HTML ( )

PDF (978KB) ( )

Figures and Tables | References | Related Articles | Metrics

With the advancement of intelligent judiciary construction， legal case retrieval technology has garnered significant attention due to its crucial role in ensuring judicial fairness and efficiency. However， the existing text retrieval methods still face the following challenges： the traditional models are susceptible to interference from semantic structural similarities， making it difficult to capture elements that influence judgments accurately； the pre-trained language models are constrained by input length， leading to insufficient global semantic modeling of lengthy legal texts； and the existing aggregated similarity scoring mechanisms are prone to noise interference and lack strong interpretability. To address these challenges， a legal case retrieval method via case information reformulation using large language model （LLM） was proposed. Firstly， LLM was employed to extract information from case texts， so as to combine case elements， descriptions of applicable legal provisions for crimes， and case behavior chains into sub-facts of cases， thereby reducing information redundancy. Secondly， in the encoding part， an SFA-SAILER （Selective Feature Attention & Structure-Aware pre-traIned language model for LEgal case Retrieval） encoding architecture was designed. Thirdly， by encoding case information at two different dimensions deeply — word and feature， the dependency between case information and encoding dimensions was enhanced. Finally， the MaxSim operator was used to aggregate similarity scores. Experimental results show that on the LeCaRD （Legal Case Retrieval Dataset）， the proposed model achieves the mean Average Precision （mAP） and Top-3 Precision （P@3） of 67.45% and 60.95%， respectively， and has the Top-K Normalized Discounted Cumulative Gain （NDCG@K） higher than those of comparison models. It can be seen that the proposed model offers a new idea that integrates legal logic with deep semantic understanding for legal case retrieval， and has practical value for intelligent judiciary applications.

Complex event extraction method based on event element relation recognition and complete subgraph search

Junchi ZHANG, Naiyun ZHANG, Qun HOU

2026, 46(6): 1793-1800. DOI: 10.11772/j.issn.1001-9081.2025050665

Asbtract ( )

HTML ( )

PDF (834KB) ( )

Figures and Tables | References | Related Articles | Metrics

In response to the limitations of the existing complex Event Extraction （EE） methods in event classification， particularly their inability to handle the issue of a single trigger word activating multiple events of the same type， a complex EE method based on event element relation recognition and complete subgraph search was proposed to improve the effects of complex event classification. Firstly， a concise word-pair relation labeling system was designed， incorporating Span relations to identify event element boundaries and Event-Internal （EI） relations to determine whether elements belonged to the same event. Secondly， a single-stage word-pair relation recognition model was constructed， where text representations were obtained through an encoding layer， event type information was injected via an event information fusion layer， and word-pair relations were predicted using a distance-aware scoring function in the prediction layer. Finally， based on the predicted EI relations， an undirected graph was built， and a recursive complete subgraph search algorithm was designed to classify event elements， thereby enabling the complete extraction complex event of all patterns theoretically. Experimental results show that the proposed method outperforms various baselines like BERT-CRF-joint， PLMEE （Pre-trained Language Model for EE）， and CasEE （Cascade decoding for EE） in complex EE on the FewFC （Few-shot Financial Corpus） and DuEE （Dataset for Chinese EE） datasets. It can be seen that the method addresses the issue of a single trigger word activating multiple events of the same type effectively， leading to a comprehensive extraction of complex events.

Sign language generation model based on Kolmogorov-Arnold network and diffusion Transformer

Lili HE, Meng CAO, Lei ZHANG, Hongjun PAN, Yi LIU, Chengxin SUN

2026, 46(6): 1801-1810. DOI: 10.11772/j.issn.1001-9081.2025060730

Asbtract ( )

HTML ( )

PDF (1212KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the problems of blurry generation results， detail loss， and uneven feature distribution caused by insufficient local information extraction of the existing models in sign language generation tasks， a sign language generation model based on Kolmogorov-Arnold Network （KAN） and Diffusion Transformer （KDT） was proposed. Firstly， the nonlinear approximation capability of the KAN was utilized to fit complex data distribution， so as to enhance the detail representation and motion fluency between video frames， thereby addressing the blurriness problem of videos generated by traditional Multilayer Perceptron （MLP） models. Then， Contrast Normalization （ContraNorm） was used to replace the original normalization， so as to address the uneven feature distribution problem by calibrating differences in feature scales， thereby ensuring the model’s stability with poor data quality and interference. Finally， diffusion Transformer was employed to achieve refined evolution from random noise to the target sequence through multi-step iterative optimization， thereby addressing the detail loss problem of traditional models. Experimental results on the validation set of RWTH-Phoenix-2014T continuous sign language dataset show that compared to the Sign-IDD （Sign-Iconicity Disentangled Diffusion） model， this model has the BLEU-1 （Bilingual Evaluation Understudy 1-gram） and ROUGE （Recall-Oriented Understudy for Gisting Evaluation） metrics improved by 8.1% and 5.9%， respectively， and the Word Error Rate （WER） metric reduced by 4.5%. The above results verify the effectiveness of this model in enhancing the richness of video details and the fluency of sign language movements.

CORER： collaborative multi-knowledge large language model prompt framework for IT application innovation database migration

Yusheng YI, Zhaohao HUANG, Zihao DENG, Leilei KONG, Haoliang QI

2026, 46(6): 1811-1817. DOI: 10.11772/j.issn.1001-9081.2025060745

Asbtract ( )

HTML ( )

PDF (834KB) ( )

Figures and Tables | References | Related Articles | Metrics

The main task of Information Technology (IT) application innovation database migration is to migrate the data structure and data from non-domestic databases to domestic databases smoothly. In view of the challenges of syntax differences and complex business logic adaptation between heterogeneous databases in the current IT application innovation database migration， a collaborative multi-knowledge Large Language Model （LLM） prompt framework for IT application innovation-oriented databases migration， CORER （Context-Objective-Rules-Examples-Response）， was proposed， the openGauss Structured Query Language （SQL） syntax rule knowledge base covering 199 SQL syntax rule types and containing 4 162 syntax rules was constructed， and the migration sample knowledge base covering 20.6% of the syntax rule types was constructed by integrating official templates and real cases. Then， the syntax rule knowledge and migration sample knowledge were injected into the LLM context based on the prompt elements， thereby matching the syntax， logic and architecture characteristics of heterogeneous databases adaptively， and guiding the LLM to complete the SQL statement refactoring accurately. Experimental results show that the accuracy of CORER in the MySQL to openGauss migration task is 93.44%， which is 1.31 percentage points higher than that of the rule-based method， and is increased by 7.02% in advanced feature scenarios such as storage procedures and triggers， verifying the comprehensive advantages of CORER in IT innovation-oriented database migration scenarios.

Contrastive collaborative filtering method based on graph diffusion generation and adaptive sampling

Hang QI, Tingting DONG, Yongqiang NAI, Xian MO

2026, 46(6): 1818-1828. DOI: 10.11772/j.issn.1001-9081.2025060729

Asbtract ( )

HTML ( )

PDF (1424KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problems of the existing Graph Neural Network （GNN）-based collaborative filtering methods under sparse and noisy data conditions， such as the obscuring of true signals by static noise injection， the inability of fixed semantic prototypes to capture dynamic user interests， and the high computational overhead of complex augmentation， a graph diffusion generation and adaptive sampling-based contrastive collaborative filtering method was proposed. Firstly， a lightweight graph diffusion generation mechanism based on gradual denoising was designed， so as to optimize node representations through forward noise-adding and reverse denoising， thereby generating noise-resistant contrastive views. Then， random masking was integrated with Random Walk with Restart （RWR） to model local neighborhood features and global structural semantics collaboratively， thereby generating high-quality negative samples. Finally， an improved InfoNCE （Information Noise Contrastive Estimation） loss function was introduced to optimize the multi-view contrastive learning objective and enhance the discriminative power of representations. Experimental results on Gowalla， Yelp， and Amazon datasets show that compared to the best-performing baseline method， the proposed method improves the Top-20 Recall （Recall@20） by 0.63%， 1.36%， and 1.88%， respectively， and the Top-40 Normalized Discounted Cumulative Gain （NDCG@40） by 0.95%， 1.47%， and 1.24%， respectively， as well as improves the recommendation performance for long-tail users by 26.7%， increases the training efficiency by 90%， and accelerates the convergence speed by 32%. It can be seen that the proposed method enhances the noise resistance and dynamic adaptability of recommendation systems in open environments significantly.

Quality of service prediction model for data sparsity and cold start problems

Bingqing LI, Binhao HUANG, Yubei TANG, Baili ZHANG

2026, 46(6): 1829-1835. DOI: 10.11772/j.issn.1001-9081.2025060675

Asbtract ( )

HTML ( )

PDF (1014KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problem of data sparsity caused by few connections between users and service nodes in World Wide Web （Web） Quality of Service （QoS） prediction， as well as the cold start problem caused by the lack of historical call data， a QoS prediction model for data sparsity and cold start problems was proposed. Firstly， a random propagation strategy was adopted， where multiple augmented graphs were generated by dropping nodes randomly for propagation， so as to achieve data augmentation. At the same time a consistency regularization method was used to optimize the prediction consistency between multiple augmentations， thereby alleviating the data sparsity. Secondly， a multi-factor similarity calculation method was proposed， so that random node dropping was combined to construct user and service context subgraphs. Thirdly， graph contrastive learning was introduced to train each subgraph， making the context embedding representations of similar nodes closer， thereby alleviating the cold start. Experimental results show that compared with the existing QoS prediction models， this model maintains better performance in all scenarios with data density from 0.5% to 4.0%. It can be seen that this model provides a new paradigm for graph random neural network to process sparse data theoretically， and in application， it can improve the service recommendation accuracy of platforms such as community intelligent management and e-commerce， as well as reduce the trial-and-error cost of service calls.

Multi-level neighborhood contrastive attribute graph clustering based on adaptive learning

Jinghong WANG, Xiao CHEN, Yingmei MA, Bi LI, Jusheng MI, Wei WANG

2026, 46(6): 1836-1843. DOI: 10.11772/j.issn.1001-9081.2025050647

Asbtract ( )

HTML ( )

PDF (873KB) ( )

Figures and Tables | References | Related Articles | Metrics

Recently， deep graph clustering methods have outstanding performance in graph clustering studies. However， most existing deep graph clustering methods are based on the auto-encoder framework， and are vulnerable to reconstruction strategies and graph enhancement strategies. Therefore， a deep graph clustering method based on contrastive learning was proposed， namely Multi-level Neighborhood Contrastive attribute Graph Clustering based on adaptive learning （MNCGC）. Firstly， a dual masking strategy was designed to generate an adaptive augmented graph， which combined the node importance to generate edge weights， that is， edge masking probabilities， and a fixed masking probability was set for node features for node feature masking， so as to remove redundant information in the graph and provide rich sample pairs for neighborhood contrastive learning. Then， the edge weights were introduced into the neighborhood contrastive learning， so that the enhanced neighborhood contrastive learning was used to the original graph and the augmented graph at coding level and projection level， thereby emphasizing the local information learning and the global high-level semantic information learning. Finally， self-supervised clustering and code level representation were used to promote each other， thereby further improving the clustering effect. Experimental results on three benchmark datasets including Cora， CiteSeer and PubMed show that compared with fourteen advanced methods， MNCGC method achieves optimal values in most cases across four indicators： accuracy， Normalized Mutual Information （NMI）， Adjusted Rand Index （ARI） and F1-score， fully verifying the effectiveness of the proposed method.

Multi-view consistency-driven robust feature selection method

Xue XU, Hu FAN, Yandan WANG, Xue DING, Xuefeng GAO, Bo ZHANG, Bo LIU, Beihong JIN

2026, 46(6): 1844-1854. DOI: 10.11772/j.issn.1001-9081.2025060685

Asbtract ( )

HTML ( )

PDF (884KB) ( )

Figures and Tables | References | Related Articles | Metrics

Identifying important features from high-dimensional complex industrial data is crucial for production process anomaly monitoring. Aiming at the problem that the existing feature selection algorithms are difficult to model the complex intrinsic structure of data in the face of noise disturbance， a Multi-view Consistency-driven Robust feature selection method （MCR） was proposed. Firstly， a consistency-guided denoising mechanism with structure preservation was designed， in which multi-view collaborative modeling and inconsistency region detection were used to eliminate local noise disturbance while improving structural fidelity and integrity of the raw data. Then， a joint discriminative and consistency-driven feature fusion module was constructed， where high-quality multi-view embedding representations and a feature weight matrix were learned simultaneously， thereby enhancing the ability to perceive key feature dimensions. Finally， a cooperative sparse regularization-based feature selection strategy was introduced， so as to select the most discriminative and structurally consistent subset of features from the fused embedding space. Without relying on labeled information， this method achieves perception and selection of key feature dimensions through multi-view collaborative modeling and consistency-driven optimization. Extensive experimental results on several public benchmark datasets and a real-world cigarette production dataset demonstrate that MCR outperforms the existing mainstream feature selection methods such as Binary Horse herd Optimization Algorithm （BinHOA） and Improved Binary DJaya Algorithm （IBJA）， achieving classification accuracy improvements of 0.23 to 12.15 percentage points on public datasets and 2.22 to 5.00 percentage points on real industrial dataset， validating its robustness and effectiveness in complex scenarios.

Graph neural network node classification model incorporating clustering coefficients

Yasong ZHANG, Bihui CONG, Shuang XU

2026, 46(6): 1855-1862. DOI: 10.11772/j.issn.1001-9081.2025060793

Asbtract ( )

HTML ( )

PDF (1115KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the issues of structural unfairness and classification inaccuracy of Graph ATtention network （GAT） model in node classification tasks， a Graph Neural Network （GNN） node classification model incorporating clustering coefficients， named GATcc（GAT with clustering coefficient）， was proposed. Firstly， by introducing the clustering coefficients of neighboring nodes as structural information， and combining trainable weight parameters， the representation ability of the topological structure in the attention mechanism was enhanced. Then， feature scaling was employed to optimize node embeddings， and residual connections were added to mitigate the risk of feature over-smoothing. Experimental results on six real datasets demonstrate that the proposed model outperforms the mainstream models， such as Graph Isomorphism Network （GIN） and GOAT （Graph Ordering Attention Network）， in classification accuracy. For instance， compared to the baseline model GAT on the Cora dataset， the proposed model has the classification accuracy improved by 4.03 percentage points， the structural bias reduced from 0.31% to 0.11%， and the classification accuracy of isolated nodes improved by 3.69 percentage points. In conclusion， the proposed model not only achieves significant improvements in classification performance， but also shows superiority in structural fairness and stability.

Time series forecasting model based on dynamic weighted ensemble

Xinru LIU, Songhua LIU, Lusha QI, Yaofei MENG

2026, 46(6): 1863-1871. DOI: 10.11772/j.issn.1001-9081.2025060707

Asbtract ( )

HTML ( )

PDF (889KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the inadequate adaptability under rapidly changing data distribution and the difficulty in balancing prediction accuracy with computational overhead of the existing time series forecasting methods， a time series forecasting model based on dynamic weighted ensemble， namely TFEM （Time-Frequency Ensembled Model）， was proposed. Firstly， in the time domain module， a Low-Rank Self-Attention （LRSA） mechanism was designed to calculate attention through projecting high-dimensional features into a low-dimensional space， which reduced complexity while maintaining long-range dependency modeling. Simultaneously， in the frequency domain module， the signal was decomposed into dominant frequency components and non-stationary residuals to model global trends and local mutations， respectively， thereby enhancing the modeling capability for complex time series. Finally， at the ensemble level， a long- and short-term harmonic balanced weighting mechanism was proposed， where long-term weights was used to capture global trends robustly through recursive update， while short-term weights was used to respond to data distribution mutations promptly via a Multi-Layer Perceptron （MLP）， and a smoothing factor was incorporated to suppress violent fluctuations in weights. Experimental results demonstrate that compared with the online ensemble model OneNet （Online Network）， TFEM reduces the Mean Squared Error （MSE） by 6.4% to 44.8% and the Mean Absolute Error （MAE） by 2.8% to 17.6% on seven benchmark datasets， while reduces the parameter number by 69.4% and inference time by 50.5% on the ETTh1 dataset. It can be seen that TFEM enhances prediction accuracy while reducing computational overhead， providing a feasible solution for time series forecasting in resource-constrained scenarios.

Time series prediction of environmental electric field intensity with generalized correlation entropy loss function-based Transformer model

Wenjun FENG, Xinwei SONG, Yuntao YUE

2026, 46(6): 1872-1880. DOI: 10.11772/j.issn.1001-9081.2025050560

Asbtract ( )

HTML ( )

PDF (11588KB) ( )

Figures and Tables | References | Related Articles | Metrics

Predicting the time series of electromagnetic radiation in the environment is of great significance for public health protection and the adaptability of electronic devices to the electromagnetic environment. Aiming at the high volatility of the environmental electric field intensity time series， which leads to more outliers and interferes with model training， a Generalized Correlation entropy Loss function-based Transformer （GCL-Transformer） model was proposed. By applying nonlinear weighting to errors through kernel mapping， this model combined the gradient smoothing of Mean Square Error （MSE） with the outlier robustness of Mean Absolute Error （MAE）， effectively weakening the interference of outliers on the model training. Data were collected at three typical electromagnetic exposure monitoring sites in Beijing， and validation was carried out by multiple sets of cross-time scale prediction experiments. Comparisons were made with the traditional Transformer model， the variant model TOEformer （Temporal-Optimized Enhanced Transformer）， and the Long Short-Term Memory （LSTM） model. Experimental results indicate that GCL-Transformer model significantly outperforms the comparison models in terms of prediction accuracy. In short-term tasks with a prediction interval of one hour， the Root Mean Square Error （RMSE） of GCL-Transformer reaches 0.090 6 V/m， which is 30.6% lower than that of the traditional Transformer model （0.130 7 V/m）. Moreover， as the prediction interval extends to 72 hours， its error growth rate is the slowest （RMSE increases only from 0.090 6 V/m to 0.123 4 V/m）， demonstrating excellent long-term prediction stability.

Spatial-frequency collaborative adversarial example generation method based on class activation mapping

Erhao SHU, Guoqing TU, Shubo LIU

2026, 46(6): 1881-1892. DOI: 10.11772/j.issn.1001-9081.2025060701

Asbtract ( )

HTML ( )

PDF (3908KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the limitations of the existing image adversarial example generation methods that only applying global and uniform transformations within a single domain and thereby restricting the attack success rates and the transferability of adversarial examples， a Spatial-Frequency Collaborative adversarial example generation method based on Class Activation Mapping （CAM）（SFC-CAM） was proposed. Firstly， region sensitivity was quantified using CAM， and the input image was divided into high-sensitivity target region and low-sensitivity background region by Adaptive Partitioning （AP） according to the threshold of activation value. Then， for high-sensitivity region， Channel Resampling-Block-wise Random Scaling （CR-BRS） was applied in the spatial domain， while Discrete Cosine Transform （DCT） with Spectral Random Masking （DCT-SRM） was conducted in the frequency domain for low-sensitivity region. Finally， adversarial examples were generated on the basis of the average gradient of the co-transformed image iteratively. Experimental results on the ImageNet dataset show that with Inception-v3 as the source model， SFC-CAM improves the average attack success rate by 3.4 and 10.4 percentage points compared with the baseline methods — Channel Augmented Attack Method （CAAM） and Spectrum Simulation Attack （SSA）， respectively； compared with the proposed single-domain adversarial attack methods CR-BRS and DCT-SRM， SFC-CAM improves the average attack success rate by 15.9 and 19.7 percentage points， respectively. These verify that SFC-CAM enhances the diversity of surrogate model decision boundaries， thereby achieving model augmentation and improving the black-box attack success rate and transferability of adversarial examples.

Image tampering localization and detection network under brightness-contrast disturbances

Xiaoqin YU, Wuyang SHAN, Junying QIU, Yu LIN, Ronghao YANG, Mao TIAN

2026, 46(6): 1893-1903. DOI: 10.11772/j.issn.1001-9081.2025050655

Asbtract ( )

HTML ( )

PDF (1523KB) ( )

Figures and Tables | References | Related Articles | Metrics

Digital image tampering detection is critically important in the fields such as digital forensics and media content verification. However， in real-world applications， the tampered images are often post-processed in brightness and contrast， which will weaken tampering traces and degrade performance of the existing algorithms. To address this challenge， a restoration-assisted image tampering detection network ReConWave-Net was proposed. The network was consisted of two key modules： a classification-guided image restoration module was used to perform targeted restoration of images based on the categories of image disturbances， thereby reducing the impact of brightness and contrast disturbances； and a tampering localization module was used to strengthen the feature expression and localization ability of the tampered regions through multi-scale wavelet features and contrastive learning mechanism. The proposed network was evaluated on multiple datasets under various brightness and contrast disturbances. In terms of restoration quality， compared with the unrestored post-processed images， the proposed method increased the average Peak Signal-to-Noise Ratio （PSNR） in tampered regions from 10.86 dB to 31.57 dB， and improved the average Structural SIMilarity index （SSIM） from 0.40 to 0.92； in terms of detection performance， under typical disturbances， the network had the F1 score of 0.730 and an Intersection over Union （IoU） of 0.653. It can be seen that combining targeted restoration with detection can enhance the robustness of tampering localization of post-processed images significantly.

Intelligent recommendation model incorporating decision cost constraints and Lagrangian solution algorithm

Jinpeng YE, Jiubing LIU, Zixing CHEN, Jiaxin LIU, Dun LIU, Biao XU

2026, 46(6): 1904-1912. DOI: 10.11772/j.issn.1001-9081.2025060736

Asbtract ( )

HTML ( )

PDF (846KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the problem that the existing intelligent recommendation do not consider decision cost constraints， an intelligent recommendation model incorporating decision cost constraints and a Lagrangian solution algorithm were proposed. Firstly， based on the user-item rating matrix， the SVD++ （Singular Value Decomposition Plus Plus） model was adopted to predict unknown ratings of users on items. Secondly， according to the predicted ratings， a single-objective optimization model of intelligent recommendation under decision cost and distribution diversity constraints was constructed. Thirdly， the distribution diversity constraint was relaxed into the objective function， and a Lagrangian relaxation model under decision cost constraint was established. Finally， a dual sub-gradient algorithm based on greedy strategy was designed to solve the constructed Lagrangian relaxation model efficiently. Experimental results on the MovieLens dataset show that compared with the Gurobi solver， the proposed algorithm reduces the solution time by at least 90.317% significantly， with the objective function value decreased by no more than 0.694%； compared with the LightGCN （Light Graph Convolution Network） method， the constructed model achieves higher recommendation accuracy on all test cases， and improves the distribution diversity on 77.8% of cases. The above fully verifies the comprehensive advantages of the proposed model and solution algorithm in terms of efficiency and performance.

Airport gate assignment algorithm based on node prediction in conflict graph of assigned activities

Min LU, Hui ZHOU

2026, 46(6): 1913-1921. DOI: 10.11772/j.issn.1001-9081.2025060684

Asbtract ( )

HTML ( )

PDF (844KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the challenge that the existing methods struggle to balance solution efficiency， assignment quality， and generalization in airport gate pre-assignment at large hub airports under dynamic changes in flight number， gate layout， and assignment rules， an airport gate assignment algorithm based on node prediction in conflict graph of assigned activities was proposed. Firstly， an airport gate assignment model was established with the objectives of maximizing gate assignment rate and cumulative soft preference. Secondly， the feasible airport gate assignment activities were screened through an airfield zoning strategy， the corresponding conflict graph of assignment activities was constructed， the Node-Edge Collaborative Updating Graph Neural Network （NECU-GNN） was designed， and the NECU-GNN for Node Prediction model （NECU-GNN4NP） was developed. Finally， the NECU-GNN4NP-guided ranking-based Max Weighted Independent Set Algorithm （MWISA） was proposed on the basis of NECU-GNN4NP， so as to solve the optimal set of assignment activities for the conflict graph of assignment activities and obtain the airport gate assignment scheme. Experimental results based on Shenzhen Bao’an International Airport data show that compared with the current optimal assignment scheme at Shenzhen Bao’an International Airport， the proposed algorithm increases the gate assignment rate by 4.2， 4.3， and 3.1 percentage points， respectively， improves the cumulative soft preference by 38.1%， 30.3%， and 42.8%， respectively， and reduces the solution time by 65.3%， 39.1%， and 41.4%， respectively， in low-peak， normal， and high-peak scenarios. In addition， migration experimental results based on Yinchuan Hedong International Airport data demonstrate that the proposed algorithm can be adapted and applied to other airports rapidly. It can be seen that the proposed algorithm not only has good generalization but also enables efficient and high-quality airport gate assignment.

Multi-level teaching-learning-based optimization algorithm for green batch processing scheduling problem

Youlian ZHENG, Yingkun CUI, Deming LEI, Jing WANG

2026, 46(6): 1922-1930. DOI: 10.11772/j.issn.1001-9081.2025050652

Asbtract ( )

HTML ( )

PDF (820KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the green parallel Batch Processing Machine （BPM） scheduling problem with redyeing operations in textile factory dyeing workshops， a Multi-level Teaching-Learning-Based Optimization （MTLBO） algorithm was proposed to minimize makespan， total energy consumption， and total weighted advance/delay cost. Firstly， heuristic rules were employed to generate the initial population for improving the initial solution quality. Secondly， the population was divided into three layers — teacher group， elite class， and ordinary class through multi-level structure， with an inter-layer efficient communication mechanism designed for information sharing and knowledge inheritance. Finally， to enhance exploration ability of the population， and to avoid the algorithm from the local optimum， a diversity enhancement operator based on probability model was introduced to replace stagnant solutions. Test instances generated on the basis of industrial data were used to evaluate MTLBO’s performance， and it was compared with the algorithms such as Adaptive Shuffled Frog-Leaping Algorithm （ASFLA）， Multi- Objective Artificial Bee Colony （MOABC） algorithm， Fuzzy Genetic Algorithm （FGA）， and Non-dominated Sorting Genetic Algorithm-Ⅱ （NSGA-Ⅱ）. The experimental results indicate that on average， the MTLBO has the dominance relation of non-dominated solution set 81.92% higher， the coverage metric 97.58% lower， and the convergence metric 99.66% lower. The above verifies MTLBO’s superior exploration ability and stability in optimizing scheduling metrics， providing robust solutions with optimization efficiency for practical production decision-making.

Backstepping-based prescribed-time control for fully actuated cascaded strict-feedback nonlinear systems

Ruicheng ZHANG, Litong ZHOU, Yinzhou MA, Weizheng LIANG

2026, 46(6): 1931-1935. DOI: 10.11772/j.issn.1001-9081.2025050650

Asbtract ( )

HTML ( )

PDF (572KB) ( )

Figures and Tables | References | Related Articles | Metrics

The prescribed?time control of fully actuated cascaded strict-feedback nonlinear systems has hardly been explored， though such systems are widely applied in the fields such as missile guidance， missile interception， and spacecraft attitude control. Therefore， a prescribed?time controller design method based on Backstepping control strategy was proposed. In the method， backstepping concept and non?scaling design method were combined， which means not scaling the system states， virtual control laws were designed directly to construct the controller. The stability of the designed controller was demonstrated through proof by contradiction via analysis of the time derivative of the system’s Lyapunov function， ensuring that all system states converged asymptotically to zero within the specified time. The time is independent of initial conditions and can be set freely within physical limits. Numerical simulation results verify the effectiveness and practicality of the proposed controller.

Low-light image enhancement network based on lightweight residual and brightness-aware dynamic feature fusion

Songhao ZHU, Zhiyun ZHAO, Mengling WANG

2026, 46(6): 1936-1946. DOI: 10.11772/j.issn.1001-9081.2025050653

Asbtract ( )

HTML ( )

PDF (5227KB) ( )

Figures and Tables | References | Related Articles | Metrics

Low-light images often suffer from insufficient brightness， severe noise， detail loss， and color distortion， which significantly degrade visual quality and hinder the performance of subsequent vision tasks. To address these issues， a Low-Light Image Enhancement （LLIE） Network based on Lightweight Residual and Brightness-aware Dynamic feature fUsion （LRBDU-Net） was proposed. Firstly， a Lightweight Residual Feature Extraction （LRFE） module was designed in the encoding stage to mitigate information loss caused by downsampling and improve the extraction capability for low-light features. Secondly， a Brightness-aware Deep Semantic feature Processing （BDSP） module was designed in the encoding and decoding transition stage to strengthen the network’s perception and restoration abilities of brightness distribution of low-light images. Thirdly， a lightweight Dynamic Feature Fusion （DFF） mechanism was applied in the decoding stage to enhance the fusion effect of skip-connected and upsampled features， thereby improving network’s noise suppression and detail restoration abilities of low-light images. Fourthly， a Perception-Color Hybrid loss function （PCH） was proposed to further enhance structural consistency and color reproduction degree of LLIE. Finally， a combined structure of Group convolution and Ghost convolution （GpGh） was used to perform lightweight network design， thereby ensuring quality of LLIE and improving computational efficiency at the same time. Experimental results on the LOL （LOw-Light） datasets （LOL-v1， LOL-v2-real， and LOL-v2-syn） demonstrate that the proposed network achieves the Peak Signal-to-Noise Ratio （PSNR） of 23.71 dB， 21.46 dB， and 24.80 dB， respectively， and the Structural SIMilarity index （SSIM） of 0.852， 0.863， and 0.933， respectively. Overall， this network adopts pure convolutional architecture and lightweight design. Compared with the lightweight deep curve estimation method — Zero-reference Deep Curve Estimation （Zero-DCE） network， this network achieves significantly better quality of LLIE； compared with LLIE Generative Adversarial Network based on attention mechanism — EnGAN （Enlighten Generative Adversarial Network）， and LLIE method based on Transformer — LLFormer （Low-Light Transformer）， this network reduces model complexity and inference calculation cost significantly while maintaining high LLIE performance. It can be seen that the proposed network balances LLIE performance such as brightness improvement， noise suppression， detail restoration， structural integrity， and color reproduction degree with network computational efficiency well.

Frequency-domain driven and diffusion-based fusion for sonar image enhancement algorithm

Liwan YAO, Hailong LIU, Zhangfan ZENG

2026, 46(6): 1947-1955. DOI: 10.11772/j.issn.1001-9081.2025060678

Asbtract ( )

HTML ( )

PDF (2246KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the issues of low contrast， severe noise interference， and limited resolution in sonar images under complex marine environments， as well as the limitation of the existing algorithms that mainly limit in the pixel domain processing and thus lack effective feature extraction， a Frequency-domain driven and Diffusion-based fusion for Sonar Image Enhancement algorithm （FDSIE） was proposed， so as to enhance the image by utilizing its frequency-domain features. Specifically， the algorithm comprises three components： a Compact Feature Extraction Network （CFEN）， a Frequency-Domain Diffusion Module （FDDM）， and a Frequency Recovery Fusion Module （FRFM）. Firstly， the CFEN was designed to optimize and compress channel redundant features， effectively suppressing disturbances caused by ocean turbulence and acoustic artifacts. Then， the FDDM was incorporated， in which the diffusion generation submodule was used to train， infer， and reconstruct the images； the Selective Attention Feature Enhancement module （SAFE） was employed to maintain key information integrity while improving inference speed and reducing computational resource consumption， thereby enhancing accuracy of the generated images. Finally， the FRFM was employed to fuse the low?frequency and diagonal?direction information of the images adaptively， thereby improving representation abilities of horizontal and vertical edge details， and ultimately obtaining clearer target contours and texture details. Experimental results on public sonar dataset UATD （Underwater Acoustic Target Detection） show that the proposed algorithm achieves optimal Peak Signal-to-Noise Ratio （PSNR） and Structural Similarity Index Measure （SSIM） values of 29.93 dB and 0.898， respectively， surpassing the second-best algorithms Pixel Attention Transform Mechanism （PATM） and FlowIE （Flow-based Image Enhancement framework） by 8% and 5%， respectively. In addition， the proposed algorithm achieves the Learned Perceptual Image Patch Similarity （LPIPS） reached the lowest value of 0.103， which is reduced by 34% compared to that of the second-best algorithm FlowIE. These results demonstrate that the proposed algorithm provides superior image enhancement quality and perceptual consistency in sonar image enhancement tasks.

Horizon detection method for cross-camera bird’s-eye view road alignment

Wei WANG, Jiaxin LIU, Wanni XIANG, Hua CUI, Yangguang LI

2026, 46(6): 1956-1964. DOI: 10.11772/j.issn.1001-9081.2025060733

Asbtract ( )

HTML ( )

PDF (2197KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the problem that the limited field of view of single camera of widely deployed highway surveillance cameras makes it difficult to achieve large-scale continuous perception， a cross-camera Bird’s-Eye View （BEV） road geometric alignment task was proposed to improve scene consistency and completeness. However， this task faces challenges due to the perspective differences and structural misalignments among multi-camera images. The horizon， as a global geometric prior， can unify these perspective differences， but its detection is easily affected by occlusion and environments， limiting alignment accuracy. To solve this problem， a horizon detection method for cross-camera BEV road alignment， named RoadHoriNet （Road Horizon detection Network）， was proposed. Firstly， perspective transformation and bounding box cropping were applied for data augmentation. Secondly， a diamond space representation was introduced to alleviate instability in vanishing-point learning. Thirdly， Receptive-Field Attention Convolution （RFAConv） and upsampling by Dynamic Sampling （DySample） were used to enhance feature representation and reconstruction accuracy. Finally， a geometric consistency loss function was designed to enhance the constraints of the orientation and position of horizon detection. Experimental results demonstrate that on the BrnoCompSpeed dataset， RoadHoriNet achieves a pixel error of 5.166%， an angle error of 0.032 5°， and a detection accuracy of 94.834%， while reducing the pixel error by 4.815 percentage points and the angle error by 0.019 4° compared with the adaptive horizon detection method. In the task of cross-camera BEV road geometry alignment， the relative alignment accuracy of RoadHoriNet reaches at least 99.129% after being corrected by the RoadHoriNet method， demonstrating its practicality and generalization potential in real-world traffic environments. It can be seen that RoadHoriNet provides a stable geometric prior for camera pose normalization and multi-camera coordinate unification， improving the relative alignment accuracy and robustness of cross-camera BEV road geometric alignment significantly.

3D human pose estimation model based on temporal-spatial feature pyramid network and multi-hypothesis interaction mechanism

Jinxiao ZHANG, Chenglong LI, Xinyan GAO, Ming ZHANG

2026, 46(6): 1965-1972. DOI: 10.11772/j.issn.1001-9081.2025060763

Asbtract ( )

HTML ( )

PDF (1271KB) ( )

Figures and Tables | References | Related Articles | Metrics

Estimating ambiguous Three-Dimensional （3D） human poses from monocular videos accurately is a current research challenge. Though the existing methods can estimate 3D joint coordinates using deep learning models， most of them fail to consider the multi-solution nature of this inverse problem adequately. Some multi-hypothesis estimation methods address multi-solution problems， but they suffer from insufficient cross-level feature fusion. To address these issues， a 3D human pose estimation model based on Temporal-SPatial Feature Pyramid Network （TSP-FPN） and multi-hypothesis interaction mechanism， called TSP-FPN-MHFormer （Temporal-SPatial Feature Pyramid Network-Multi-Hypothesis Transformer）， was proposed. Firstly， based on Transformer encoder， the multi-possibility distribution of human poses was captured by using multi-head self-attention mechanism， thereby generating multiple initial hypothesis features. Then， a TSP-FPN was designed， and a gated adaptive fusion strategy was employed to achieve dynamic weighted integration of multi-level skeleton sequence features， thereby balancing the fusion of local details and global temporal information effectively. Finally， based on Multi-Hypothesis Transformer （MHFormer）， a multi-hypothesis optimization module that combined joint Relative Position Bias （RPB） with a cross-attention mechanism was implemented， thereby facilitating cross-hypothesis communication and feature aggregation to enhance the model’s long-range reasoning capability to human topology for high-precision 3D joint coordinate estimation. Experimental results on the Human3.6M dataset demonstrate that the proposed model achieves a Mean Per Joint Position Error （MPJPE） of 42.3 mm， and reduces the estimation error by 1.6% compared to the state-of-the-art method MHFormer， indicating substantial progress obtained by the proposed model in addressing multi-solution challenge of monocular 3D pose estimation.

Lightweight human pose estimation network based on redundant feature suppression

Chao LYU, Geyao MA

2026, 46(6): 1973-1980. DOI: 10.11772/j.issn.1001-9081.2025060700

Asbtract ( )

HTML ( )

PDF (1351KB) ( )

Figures and Tables | References | Related Articles | Metrics

A lightweight Human Pose Estimation （HPE） network based on redundant feature suppression was proposed to address the difficulty of balancing computational efficiency and localization accuracy of the existing HPE networks in complex scenarios. It was named LE-SHNet （Lightweight Enhanced Stacked Hourglass Network）. Firstly， the Multiple Separated Hourglass Module （MSHM） was designed to employ heterogeneous convolution branches for differential modeling of the features of large joints and distal limbs， while suppressing redundant computations. Then， the Shuffle Efficient Channel Attention （SECA） was integrated between MSHMs， so as to combine channel shuffling and adaptive kernel convolution to enhance hierarchical joint correlations with zero additional parameters. Finally， the Spatial and Channel Perception Module （SCPM） was constructed in non-MSHMs to strengthen perception ability of key areas by spatial-channel reconstruction and Triplet Attention （TA） mechanism. Experimental results show that LE-SHNet achieves Average Precision （AP） of 88.7% on MPII （Max Planck Institute for Informatics） and 71.3% on COCO2017 （Common Objects in COntext 2017）， while reduces the number of parameters by 49.3%， reduces the computational cost by 28.2%， and increases the Average Precision （AP） by 1.0 percentage points compared with the baseline network — Two Stacked Hourglass Network （2-SHNet）； compared with the lightweight HPE networks EL-HRNet （Efficient and Lightweight High-Resolution Network） and MobileMultiPose （Mobile-friendly and Multi-feature aggregation Pose estimation）， LE-SHNet achieves AP improvements of 1.0 and 0.8 percentage points， respectively， while reducing the number of parameters by 32.0% and 26.7%， respectively. It can be seen that LE-SHNet maintains lightweight properties while improving keypoint localization accuracy， so that it has potential application values for real-time deployment on edge devices in scenarios such as intelligent monitoring， human-computer interaction， and sports rehabilitation.

Human dimension attention regressor method for monocular occluded human mesh recovery

Menghua WANG, Yukun DONG, Long CHENG, Junqi SUN

2026, 46(6): 1981-1988. DOI: 10.11772/j.issn.1001-9081.2025060705

Asbtract ( )

HTML ( )

PDF (1499KB) ( )

Figures and Tables | References | Related Articles | Metrics

In real-world scenarios， human images are often occluded by clothing， self-posture， and environmental objects， leading to insufficient visible information， so that the existing human reconstruction methods tend to degrade to mean models in shape modelling， failing to recover real individual characteristics faithfully. To address this issue， a Human Dimension Attention Regressor （HDAR） method for monocular occluded human mesh recovery was proposed. Firstly， human dimensions in the visible region were used to infer the dimensions of occluded parts. Secondly， a hierarchical proportion constraint was introduced， in which first-level constraints were applied to adjacent body parts and second-level constraints were applied to distant body parts， thereby ensuring that the regressed shapes conform to human structural characteristics. Finally， Two-Dimensional （2D） joint information was integrated with the body dimension information for iterative optimization， so as to improve pose estimation accuracy. Experimental results on the 3DPW （Three-Dimensional （3D） Poses in the Wild） dataset show that， the proposed method achieves a Per Vertex Error （PVE） of 65.2 mm， which is 10.7 mm lower than that of Multi-HMR （Multi-person whole-body Human Mesh Recovery） under occlusion conditions， corresponding to a 14.1% error reduction. Visualization experimental results demonstrate that the proposed method improves the reconstruction accuracy of human shape and pose in complex occlusion scenarios effectively.

YOLO-AirPose： human pose estimation algorithm in UAV aerial view

Qiuyan YIN, Jing DING, Zhigang NIE

2026, 46(6): 1989-1997. DOI: 10.11772/j.issn.1001-9081.2025050663

Asbtract ( )

HTML ( )

PDF (1720KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the challenges of background interference， keypoint localization deviation， and target occlusion in Unmanned Aerial Vehicle （UAV） aerial view human pose estimation， an enhanced human pose estimation algorithm named YOLO-AirPose was proposed for non-ground view scenarios. Firstly， a symmetric flip augmentation strategy based on keypoint topology constraint， named IPSFA （Index-Preserved Symmetric Flip Augmentation）， was designed to improve generalization under multi-view scenarios. Secondly， a C2BRA （C2 Bi-level Routing Attention） module was constructed by integrating BRA （Bi-level Routing Attention） mechanism to replace the original C2PSA （Cross stage Partial with Spatial Attention）， thereby enhancing the model’s perception of small-scale targets and occluded keypoints. Thirdly， combining spatial modeling ability of Transformer， an AIFI （Adaptive Interaction Feature Integration） module was embedded into the backbone network， so that 2D positional encoding was combined to improve keypoint localization performance. Finally， a C3k2-DAttention module based on deformable attention mechanism was designed to strengthen the network’s global modeling and receptive field adjustment abilities. Experimental results show that YOLO-AirPose achieves improvements of 3.0， 5.0， 4.6， and 6.8 percentage points in precision of object detection and precision， recall， and mAP@0.5 of pose estimation compared to the baseline model YOLO-Pose， respectively， while maintaining low computational cost and parameter quantity. It can be seen that the proposed algorithm provides an improved solution to the accuracy limitations in UAV aerial view human pose estimation and enhances adaptability to complex human poses.

Maritime ship detection algorithm under complex weather environments based on enhanced YOLOv8

Zhenkai XIONG, Mengjun XU, Yinyin SUN, Xin WANG

2026, 46(6): 1998-2006. DOI: 10.11772/j.issn.1001-9081.2025060723

Asbtract ( )

HTML ( )

PDF (1356KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the problems of missed detection and false detection in maritime ship detection tasks under complex weather environments such as rain， fog， and low light， a maritime ship detection algorithm based on enhanced YOLOv8 under complex weather environments was proposed. Firstly， a Cross-Granularity Local Global Attention Fusion Block （CGLGAFB） was proposed， so that a refined local and global feature fusion mechanism was constructed and multi-path feature fusion strategies were combined to integrate multi-source feature information from different levels， thereby enhancing the model’s feature fusion capability as well as suppressing noise interference and information redundancy. Then， the original C2f （Faster Implementation of CSP Bottleneck with 2 convolutions） module was improved to an adaptive mixed C2f module （C2f-Adaptive Mixer Block， C2f-AMB）. The target features of different scales and complex spatial structure were captured by the model more flexibly and efficiently through deep convolution branches with adaptive receptive field adjustment capability， thereby enhancing feature extraction capabilities. Finally， a Multi-scale Spatial Perception Pyramid （MSPP） module was proposed to replace SPPF（Spatial Pyramid Pooling-Fast） module， so that dilated convolutions with different dilation rates were utilized to construct multi-scale receptive fields， thereby obtaining comprehensive contextual information， and reducing the omission of key information. Experimental results on the enhanced dataset SeaShips_aug show that the proposed algorithm achieves the mAP@50 and recall of 84.7% and 79.3%， respectively， which are 2.6 and 3.9 percentage points higher than those of the baseline model YOLOv8， respectively， verifying that the proposed algorithm is more suitable for maritime ship detection tasks under complex weather environments.

Wheel hub defect detection method based on perspective correction and lightweight attention mechanism

Shuhao ZHANG, Kunjin HE, Jiachen XU, Heshan SHA, Zhengming CHEN

2026, 46(6): 2007-2015. DOI: 10.11772/j.issn.1001-9081.2025050666

Asbtract ( )

HTML ( )

PDF (1371KB) ( )

Figures and Tables | References | Related Articles | Metrics

In wheel hub surface defect detection tasks in industrial visual inspection， geometric distortion caused by shooting angle deviation as well as the small scale and complex morphology of defect objects limit performance of the existing detection methods. To address these challenges， a defect detection method that combines perspective correction and lightweight attention mechanism was proposed. Firstly， the offset relationship between the ellipse center and the geometric center of the wheel hub was utilized to construct a perspective transformation quadrilateral for solving the homography matrix， so as to complete image perspective correction， thereby mitigating the impact of distortion on subsequent feature extraction. Secondly， based on YOLOv11 model， the conventional CBS （Convolution-BatchNorm-SiLU（Sigmoid Linear Unit）） modules in the backbone and neck were replaced by lightweight Ghost convolutions to reduce the number of parameters and the computational cost of the model. Meanwhile， an Efficient Channel Attention （ECA） mechanism was introduced to enhance the network’s perceptual ability to tiny defect regions， leading to the construction of the improved model YOLOv11n-GAConv. Experimental results on a self-built wheel hub defect dataset show that the mean Average Precision of the proposed model when the intersection over union threshold between the predicted bounding boxes and the ground truth bounding boxes is set to 0.5 （mAP@0.5） of 84.7%， with an improvement of 2.4 percentage points compared to that of YOLOv11n， and achieves a recall of 79.5%， with an improvement is 8.6 percentage points compared to that of YOLOv11n. At the same time， the number of parameters and computational cost of the proposed model are reduced by 12.4% and 11.1%， respectively， compared to those of YOLOv11n. It can be seen that the proposed method reduces the model complexity while achieving improved detection precision.

Probabilistic structural damage identification based on hypersphere ring description

Maozu GUO, Qingyu ZHANG, Lingling ZHAO, Yang DENG

2026, 46(6): 2016-2025. DOI: 10.11772/j.issn.1001-9081.2025050664

Asbtract ( )

HTML ( )

PDF (1222KB) ( )

Figures and Tables | References | Related Articles | Metrics

Unsupervised thresholding methods for structural damage identification in civil engineering do not require labelled data， yet suffer from identification inaccuracies near threshold values caused by the uncertainty of data. To address the false positives and false negatives in unsupervised structural damage identification thresholding methods near threshold values， a Deep Support Vector Data Description （Deep-SVDD） based probabilistic structural damage identification method based on a hypersphere ring description， namely VAEKL-RDDP（Variational AutoEncoder with Kullback?Leibler divergence constrained for hypersphere Ring Data Description Probabilistic damage identification）， was proposed. With Variational AutoEncoder （VAE） as the framework， the method constructed a hypersphere ring using KL （Kullback-Leibler） divergence. Firstly， the VAE was pre-trained to reconstruct structural acceleration responses. Then， KL divergence was introduced to train the pre-trained VAE encoder and the hypersphere ring description method jointly， thereby extracting reliable classification boundaries from the posterior distribution of acceleration data features. Finally， a hypersphere ring was constructed on the basis of classification boundaries， structural damage was identified on the basis of the constructed hypersphere ring， and the data in the hypersphere ring were evaluated by using a cumulative probability density method. In experiments of the real Z24 bridge structure involving progressive damage and full-scale vibration table on a wooden pavilion， the results show that VAEKL-RDDP achieves the average improvement of 24.9% in accuracy and 36.7% in recall compared with the baseline AutoEncoder （AE） reconstruction based method； compared to the methods such as Deep-SVDD and Imputed Diffusion （ImDiffusion）， VAEKL-RDDP achieves the average gains of 20.8% and 33.7% in accuracy and recall， respectively， verifying that the proposed method can improve the performance of structural damage detection and reduce the missed detections.

Non-intrusive load monitoring method combining BiLSTM-Transformer and Kolmogorov-Arnold network

Jun QIN, Xintao JIAO, Biqing ZENG

2026, 46(6): 2026-2033. DOI: 10.11772/j.issn.1001-9081.2025060728

Asbtract ( )

HTML ( )

PDF (999KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the shortcomings of the existing deep learning-based Non-Intrusive Load Monitoring （NILM） methods in capturing long-term dependencies and complex nonlinear dynamic features， an NILM method combining BiLSTM-Transformer and Kolmogorov-Arnold Network （KAN） was proposed， and a mix model BT-KAN was constructed. Firstly， the BiLSTM-Transformer module was designed to combine the advantage of the Bidirectional Long Short-Term Memory （BiLSTM） network in modeling bidirectional sequence dependencies with the capability of Transformer in modeling global context， and a multi-head attention mechanism was employed to capture long-term dependencies of power load effectively， thereby improving the disaggregation accuracy long-cycle appliances. Then， the KAN module was used to capture nonlinear dynamic features of power load signals more accurately through a hierarchical nonlinear mapping mechanism based on the Kolmogorov-Arnold representation theorem， thereby improving the disaggregation accuracy for complex load modes. Experimental results on the REDD （Reference Energy Disaggregation Dataset） and UK-DALE （UK Domestic Appliance-Level Electricity） datasets show that compared with four Transformer-based similar models， the proposed model achieves reduction of at least 1.6% and 5.5% in Mean Absolute Error （MAE）， the improvement of at least 8.3% and 0.7% in F1-score. It can be seen that the proposed method captures long-term dependencies and nonlinear dynamic features in power load signals more accurately and improves the disaggregation effect of complex appliance operating modes.

Intelligent decision-making method for solar panel cleaning timing based on multi-sensor fusion

Shiyang ZHAO, Yafei WANG

2026, 46(6): 2034-2042. DOI: 10.11772/j.issn.1001-9081.2025060765

Asbtract ( )

HTML ( )

PDF (798KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problem of decreased photovoltaic power generation efficiency caused by inaccurate start-up timing determination of photovoltaic cleaning robots， and based on a comprehensive consideration of power generation fluctuations induced by complex meteorological conditions and solar panel aging， an intelligent decision-making method for solar panel cleaning timing based on Multi-Sensor Fusion （MSF） was proposed to enhance the cleaning efficiency and the photovoltaic power generation efficiency of solar panels. Firstly， multi-source sensor data were collected in real-time by a microcontroller， such as solar panel output power and ambient temperature. Then， a predicted photovoltaic power generation value under the influence of multiple factors was calculated after the optimization of a Proportion Integration Differentiation （PID） algorithm. Finally， by comparing the predicted photovoltaic power generation value with the real-time power generation capacity of the solar panels， intelligent determination of the cleaning timing and automatic start-up control of the photovoltaic cleaning robot were implemented through the employment of data-level and decision-level fusion techniques. Experimental results indicate that the MSF-based method performs excellently across different test scenarios. The scenario of sunny weather with soiled solar panels， which best reflects the value of cleaning is taken as an example and a determination accuracy of 96% is achieved by the method. Compared with the decision-making method using the ResNet50-CA model， the MSF-based method achieves a relative accuracy improvement of 4.35% in the scenario of sunny weather with soiled panels. Furthermore， in the same scenario， compared to the decision-making methods based on the K-Nearest Neighbors （KNN） algorithm， Random Forest （RF） model， and Kalman Filter （KF）， the MSF-based method has the advantages more significant， with accuracy improvements of 35.21%， 45.45%， and 10.34%， respectively. It can be seen that the proposed method can enhance the timeliness and accuracy of cleaning operations effectively， providing a reliable technical solution for maintaining high efficiency of photovoltaic power generation systems under complex meteorological conditions and equipment aging states.

Table of Content