Journal of Computer Applications

Multimodal knowledge graph representation learning： a review

Chunlei WANG, Xiao WANG, Kai LIU

2024, 44(1): 1-15. DOI: 10.11772/j.issn.1001-9081.2023050583

Asbtract ( )

HTML ( )

PDF (3449KB) ( )

Figures and Tables | References | Related Articles | Metrics

By comprehensively comparing the models of traditional knowledge graph representation learning， including the advantages and disadvantages and the applicable tasks， the analysis shows that the traditional single-modal knowledge graph cannot represent knowledge well. Therefore， how to use multimodal data such as text， image， video， and audio for knowledge graph representation learning has become an important research direction. At the same time， the commonly used multimodal knowledge graph datasets were analyzed in detail to provide data support for relevant researchers. On this basis， the knowledge graph representation learning models under multimodal fusion of text， image， video， and audio were further discussed， and various models were summarized and compared. Finally， the effect of multimodal knowledge graph representation on enhancing classical applications， including knowledge graph completion， question answering system， multimodal generation and recommendation system in practical applications was summarized， and the future research work was prospected.

Image text retrieval method based on feature enhancement and semantic correlation matching

Jia CHEN, Hong ZHANG

2024, 44(1): 16-23. DOI: 10.11772/j.issn.1001-9081.2023060766

Asbtract ( )

HTML ( )

PDF (1434KB) ( )

Figures and Tables | References | Related Articles | Metrics

In order to achieve the precise semantic correlation between image and text， an image text retrieval method based on Feature Enhancement and Semantic Correlation Matching （FESCM） was proposed. Firstly， through the feature enhancement representation module， the multi-head self-attention mechanism was introduced to enhance image region features and text word features to reduce the interference of redundant information to alignment of image region and text word. Secondly， the semantic correlation matching module was used to not only capture the corresponding correlation between locally significant objects by local matching， but also incorporate the image background information into the global image features and achieve accurate global semantic correlation by global matching. Finally， the local matching scores and global matching scores were used to obtain the final matching scores of images and texts. The experimental results show that the FESCM-based image text retrieval method improves the recall sum over the extended visual semantic embedding method by 5.7 and 7.5 percentage points on Flickr8k and Flickr30k benchmark datasets， respectively； the recall sum is improved by 3.7 percentage points over the Two-Stream Hierarchical Similarity Reasoning method on the MS-COCO dataset. The proposed method can effectively improve the accuracy of image text retrieval and realize the semantic connection between image and text.

Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval

Qiujie LIU, Yuan WAN, Jie WU

2024, 44(1): 24-31. DOI: 10.11772/j.issn.1001-9081.2023010047

Asbtract ( )

HTML ( )

PDF (2170KB) ( )

Figures and Tables | References | Related Articles | Metrics

Cross-modal retrieval based on deep network often faces the challenge of insufficient cross-training data， which limits the training effect and easily leads to over-fitting. Transfer learning is an effective way to solve the problem of insufficient training data by learning the training data in the source domain and transferring the acquired knowledge to the target domain. However， most of the existing transfer learning methods focus on transferring knowledge from single-modal （like image） source domain to cross-modal （like image and text） target domain. If there is multiple modal information in the source domain， this asymmetric transfer would ignore the potential inter-modal semantic information contained in the source domain. At the same time， the similarity of the same modals in the source domain and the target domain cannot be well extracted， thereby reducing the domain difference. Therefore， a Deep Bi-modal source domain Symmetrical Transfer Learning for cross-modal retrieval （DBSTL） method was proposed. The purpose of this method is to realize the knowledge transfer from bi-modal source domain to multi-modal target domain， and obtain the common representation of cross-modal data. DBSTL consists of modal symmetric transfer subnet and semantic consistency learning subnet. With hybrid symmetric structure adopted in symmetric modal transfer subnet， the information between modals was more consistent to each other and the difference between source domain and target domain was reduced by this subnet. In semantic consistency learning subnet， all modalities shared the same common presentation layer， and the cross-modal semantic consistency was ensured under the guidance of the supervision information of the target domain. Experimental results show that on Pascal， NUS-WIDE-10k and Wikipedia datasets， the mean Average Precision （mAP） of the proposed method is improved by about 8.4， 0.4 and 1.2 percentage points compared with the best result obtained by the comparison methods respectively. DBSTL makes full use of the potential information of the dual-modal source domain to conduct symmetric transfer learning， ensures the semantic consistency between modals under the guidance of the supervision information， and improves the similarity of image and text distribution in the public representation space.

Multi-modal dialog reply retrieval based on contrast learning and GIF tag

Yirui HUANG, Junwei LUO, Jingqiang CHEN

2024, 44(1): 32-38. DOI: 10.11772/j.issn.1001-9081.2022081260

Asbtract ( )

HTML ( )

PDF (1653KB) ( )

Figures and Tables | References | Related Articles | Metrics

GIFs （Graphics Interchange Formats） are frequently used as responses to posts on social media platforms， but many approaches do not make good use of the GIF tag information on social media when dealing with the question “how to choose an appropriate GIF to reply to a post”. A Multi-Modal Dialog reply retrieval based on Contrast learning and GIF Tag （CoTa-MMD） approach was proposed， by which the tag information was integrated into the retrieval process. Specifically， the tags were used as intermediate variables， the retrieval of text to GIF was then converted to the retrieval of text to GIF tag to GIF. Then the modal representation was learned by a contrastive learning algorithm and the retrieval probability was calculated using a full probability formula. Compared to direct text image retrieval， the introduction of transition tags reduced retrieval difficulties caused by the heterogeneity of different modalities. Experimental results show that the CoTa-MMD model improved the recall sum of the text image retrieval task by 0.33 percentage points and 4.21 percentage points compared to the DSCMR （Deep Supervised Cross-Modal Retrieval） model on PEPE-56 multimodal dialogue dataset and Taiwan multimodal dialogue dataset， respectively.

Multi-channel multi-step integration model for generative visual dialogue

Sihang CHEN, Aiwen JIANG, Zhaoyang CUI, Mingwen WANG

2024, 44(1): 39-46. DOI: 10.11772/j.issn.1001-9081.2023010055

Asbtract ( )

HTML ( )

PDF (3323KB) ( )

Figures and Tables | References | Related Articles | Metrics

Visual dialogue task has made significant progress in multimodal information fusion and inference. However， the ability of mainstream models is still limited when answering questions that involve relatively clear semantic attributes and spatial relationships. A relatively few mainstream models can explicitly provide fine-grained semantic representation of image content before formal response. There is a lack of necessary bridges to the semantic gap between visual feature representation and text semantics such as dialogue history and current questions. Therefore， a visual dialogue model based on Multi-Channel and Multi-step Integration （MCMI） was proposed to explicitly provide a set of fine-grained semantic description information about visual content. Through the interactions and multi-step integration among vision， semantics and dialogue history， the semantic representation of questions was enriched and more accurate decoded answers were achieved. On VisDial v0.9/VisDial v1.0 datasets， compared to Dual-channel Multi-hop Reasoning Model （DMRM）， the proposed MCMI model improved Mean Reciprocal Ranking（MRR） by 1.95 and 2.12 percentage points respectively， Recall Rate （R@1） by 2.62 and 3.09 percentage points respectively， and Mean ranking of correct answers （Mean） by 0.88 and 0.99 respectively； On VisDial v1.0 dataset， compared to the latest Unified Transformer Contrastive learning model（UTC）， MCMI model improved the MRR， R@1， Mean by 0.06 percentage points， 0.68 percentage points， and 1.47 respectively. In order to further evaluate the quality of generated dialogue， two subjective indicators are proposed. They are the Turing-test passing proportion M1 and the dialogue quality score （five point scale） M2. When compared with baseline model DMRM in the VisDial v0.9 dataset， MCMI model improved M1 by 9.00 percentage points and M2 by 0.70.

Video dynamic scene graph generation model based on multi-scale spatial-temporal Transformer

Jia WANG-ZHU, Zhou YU, Jun YU, Jianping FAN

2024, 44(1): 47-57. DOI: 10.11772/j.issn.1001-9081.2023060861

Asbtract ( )

HTML ( )

PDF (2900KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the challenge of dynamic changes in object relationships over time in videos， a video dynamic scene graph generation model based on multi-scale spatial-temporal Transformer was proposed. The multi-scale modeling idea was introduced into the classic Transformer architecture to precisely model dynamic fine-grained semantics in videos. First， in the spatial dimension， the attention was given to both the global spatial correlations of objects， similar to traditional models， and the local spatial correlations among objects’ relative positions， which facilitated a better understanding of interactive dynamics between people and objects， leading to more accurate semantic analysis results. Then， in the temporal dimension， not only the traditional short-term temporal correlations of objects in videos were modeled， but also the long-term temporal correlations of the same object pairs throughout the entire videos were emphasized. Comprehensive modeling of long-term relationships between objects assisted in generating more accurate and coherent scene graphs， mitigating issues arising from occlusions， overlaps， etc. during scene graph generation. Finally， through the collaborative efforts of the spatial encoder and temporal encoder， dynamic fine-grained semantics in videos were captured more accurately by the model， avoiding limitations inherent in traditional single-scale approaches. The experimental results show that， compared to the baseline model STTran， the proposed model achieves an increase of 5.0 percentage points， 2.8 percentage points， and 2.9 percentage points in terms of Recall@10 for the tasks of predicate classification， scene graph classification， and scene graph detection， respectively， on the Action Genome benchmark dataset. This demonstrates that the multi-scale modeling concept can enhance precision and effectively boost performance in dynamic video scene graph generation tasks.

Scene graph-aware cross-modal image captioning model

Zhiping ZHU, Yan YANG, Jie WANG

2024, 44(1): 58-64. DOI: 10.11772/j.issn.1001-9081.2022071109

Asbtract ( )

HTML ( )

PDF (1879KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the forgetting and underutilization of the text information of image in image captioning methods， a Scene Graph-aware Cross-modal Network （SGC-Net） was proposed. Firstly， the scene graph was utilized as the image’s visual features， and the Graph Convolutional Network （GCN） was utilized for feature fusion， so that the visual and textual features were in the same feature space. Then， the text sequence generated by the model was stored， and the corresponding position information was added as the textual features of the image， so as to solve the problem of text feature loss brought by the single-layer Long Short-Term Memory （LSTM） Network. Finally， to address the issue of over dependence on image information and underuse of text information， the self-attention mechanism was utilized to extract significant image information and text information and fuse then. Experimental results on Flickr30K and MS-COCO （MicroSoft Common Objects in COntext） datasets demonstrate that SGC-Net outperforms Sub-GC on the indicators BLEU1 （BiLingual Evaluation Understudy with 1-gram）， BLEU4 （BiLingual Evaluation Understudy with 4-grams）， METEOR （Metric for Evaluation of Translation with Explicit ORdering）， ROUGE （Recall-Oriented Understudy for Gisting Evaluation） and SPICE （Semantic Propositional Image Caption Evaluation） with the improvements of 1.1，0.9，0.3，0.7，0.4 and 0.3， 0.1， 0.3， 0.5， 0.6， respectively. It can be seen that the method used by SGC-Net can increase the model’s image captioning performance and the fluency of the generated description effectively.

Multi-modal summarization model based on semantic relevance analysis

Yuxiang LIN, Yunbing WU, Aiying YIN, Xiangwen LIAO

2024, 44(1): 65-72. DOI: 10.11772/j.issn.1001-9081.2022101527

Asbtract ( )

HTML ( )

PDF (2804KB) ( )

Figures and Tables | References | Related Articles | Metrics

Multi-modal abstractive summarization is commonly based on the Sequence-to-Sequence （Seq2Seq） framework， and the objective function optimizes the model at the character level， which searches locally optimal results to generate words and ignores the global semantic information of the summary samples. It may cause a problem of semantic deviation between the summary and multimodal information， resulting in factual errors. In order to solve the above problems， a multi-modal summarization model based on semantic relevance analysis was proposed. Firstly， the summary generator based on Seq2Seq framework was trained to generate candidate summaries with semantic multiplicity. Secondly， a summary evaluator based on semantic relevance analysis was applied to learn the semantic differences among candidate summaries and the evaluation mode of ROUGE （Recall-Oriented Understudy for Gisting Evaluation） from a global perspective， so that the model could be optimized at the level of summary samples. Finally， the summary evaluator was used to carry out reference-free evaluation of the candidate summaries， making the finally selected summary sample as similar as possible to the source text in semantic space. Experiments on benchmark dataset MMSS show that the proposed model can improve the evaluation indexes of ROUGE-1， ROUGE-2 and ROUGE-L by 3.17， 1.21 and 2.24 percentage points respectively compared with the current optimal MPMSE （Multimodal Pointer-generator via Multimodal Selective Encoding） model.

Product summarization extraction model with multimodal information fusion

Qiang ZHAO, Zhongqing WANG, Hongling WANG

2024, 44(1): 73-78. DOI: 10.11772/j.issn.1001-9081.2022121910

Asbtract ( )

HTML ( )

PDF (1183KB) ( )

Figures and Tables | References | Related Articles | Metrics

On online shopping platforms， concise， authentic and effective product summarizations are crucial to improving the shopping experience. In addition， online shopping cannot touch the actual product， and the information contained in the product image is important visual information except the product text description， so product summarization that fuses multimodal information including product text and product image is of great significance for online shopping. Aiming at fusing product text descriptions and product images， a product summarization extraction model with multimodal information fusion was proposed. Different from the general product summarization task whose input only contains the product text description， the proposed model introduces product image as an additional source of information to make the extracted summary richer. Specifically， first the pre-trained model was used to represent the features of the product text description and product image by which the text feature representation of each sentence was extracted from the product text description， and the overall visual feature representation of the product was extracted from the product image. Then the low-rank tensor-based multimodal fusion method was used to modally fuse the text features and overall visual features to obtain the multimodal feature representation for each sentence. Finally， the multimodal feature representations of all sentences were fed into the summary generator to generate the final product summarization. Comparative experiments were conducted on CEPSUM 2.0 （Chinese E-commerce Product SUMmarization 2.0） dataset. On the three subsets of CEPSUM 2.0， the average ROUGE-1 （Recall-Oriented Understudy for Gisting Evaluation 1） of this model is 3.12 percentage points higher than that of TextRank and 1.75 percentage points higher than that of BERTSUMExt （BERT SUMmarization Extractive）. Experimental results show that the proposed model is effective in fusing product text and image information， which performs well on ROUGE evaluation index.

Multi-dynamic aware network for unaligned multimodal language sequence sentiment analysis

Junhao LUO, Yan ZHU

2024, 44(1): 79-85. DOI: 10.11772/j.issn.1001-9081.2023060815

Asbtract ( )

HTML ( )

PDF (1299KB) ( )

Figures and Tables | References | Related Articles | Metrics

Considering the issue that the word alignment methods commonly used in the existing methods for aligned multimodal language sequence sentiment analysis lack interpretability， a Multi-Dynamic Aware Network （MultiDAN） for unaligned multimodal language sequence sentiment analysis was proposed. The core of MultiDAN was multi-layer and multi-angle extraction of dynamics. Firstly， Recurrent Neural Network （RNN） and attention mechanism were used to capture the dynamics within the modalities； secondly， intra- and inter-modal， long- and short-term dynamics were extracted at once using Graph Attention neTwork （GAT）； finally， the intra- and inter-modal dynamics of the nodes in the graph were extracted again using a special graph readout method to obtain a unique representation of the multimodal language sequence， and the sentiment score of the sequence was obtained by applying a MultiLayer Perceptron （MLP） classification. The experimental results on two commonly used publicly available datasets， CMU-MOSI and CMU-MOSEI， show that MultiDAN can fully extract the dynamics， and the F1 values of MultiDAN on the two unaligned datasets improve by 0.49 and 0.72 percentage points respectively， compared to the optimal Modal-Temporal Attention Graph （MTAG） in the comparison methods， which have high stability. MultiDAN can improve the performance of sentiment analysis for multimodal language sequences， and the Graph Neural Network （GNN） can effectively extract intra- and inter-modal dynamics.

Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal

Mu LI, Yuheng YANG, Xizheng KE

2024, 44(1): 86-93. DOI: 10.11772/j.issn.1001-9081.2023060753

Asbtract ( )

HTML ( )

PDF (1891KB) ( )

Figures and Tables | References | Related Articles | Metrics

An emotion recognition model based on Hybrid-Mel Gama Frequency Cross-attention Transformer modal （H-MGFCT） was proposed to address the issues of effectively mining single modal representation information and achieving full fusion of multimodal information in multimodal sentiment analysis. Firstly， Hybird-Mel Gama Frequency Cepstral Coefficient （H-MGFCC） was obtained by fusing Mel Frequency Cepstral Coefficient （MFCC） and Gammatone Frequency Cepstral Coefficient （GFCC）， as well as their first-order dynamic features， to solve the problem of speech emotional feature loss； secondly， a cross modal prediction model based on attention weight was used to filter out text features more relevant to speech features； subsequently， a Cross Self-Attention Transformer （CSA-Transformer） incorporating contrastive learning was used to fuse highly correlated cross modal information of text features and speech modal emotional features； finally， the cross modal information features containing text and speech were fused with the selected text features with low correlation to achieve information supplement. The experimental results show that the proposed model improves the accuracy by 2.83， 2.64， and 3.05 percentage points compared to the weighted Decision Level Fusion Text-audio （DLFT） model on the publicly available IEMOCAP （Interactive EMotional dyadic MOtion CAPture）， CMU-MOSI （CMU-Multimodal Opinion Emotion Intensity）， and CMU-MOSEI （CMU-Multimodal Opinion Sentiment Emotion Intensity） datasets， verifying the effectiveness of this model for emotion recognition.

Adversarial training method with adaptive attack strength

Tong CHEN, Jiwei WEI, Shiyuan HE, Jingkuan SONG, Yang YANG

2024, 44(1): 94-100. DOI: 10.11772/j.issn.1001-9081.2023060854

Asbtract ( )

HTML ( )

PDF (1227KB) ( )

Figures and Tables | References | Related Articles | Metrics

The vulnerability of deep neural networks to adversarial attacks has raised significant concerns about the security and reliability of artificial intelligence systems. Adversarial training is an effective approach to enhance adversarial robustness. To address the issue that existing methods adopt fixed adversarial sample generation strategies but neglect the importance of the adversarial sample generation phase for adversarial training， an adversarial training method was proposed based on adaptive attack strength. Firstly， the clean sample and the adversarial sample were input into the model to obtain the output. Then， the difference between the model outputs of the clean sample and the adversarial sample was calculated. Finally， the change of the difference compared with the previous moment was measured to automatically adjust the strength of the adversarial sample. Comprehensive experimental results on three benchmark datasets demonstrate that compared with the baseline method Adversarial Training with Projected Gradient Descent （PGD-AT）， the proposed method improves the robust precision under AA （AutoAttack） attack by 1.92， 1.50 and 3.35 percentage points on three benchmark datasets， respectively， and the proposed method outperforms the state-of-the-art defense method Adversarial Training with Learnable Attack Strategy （LAS-AT） in terms of robustness and natural accuracy. Furthermore， from the perspective of data augmentation， the proposed method can effectively address the problem of diminishing augmentation effect during adversarial training.

Selective generation method of test cases for Chinese text error correction software

Chenghao FENG, Zhenping XIE, Bowen DING

2024, 44(1): 101-112. DOI: 10.11772/j.issn.1001-9081.2023010080

Asbtract ( )

HTML ( )

PDF (3173KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the lack of an effective method for generating test cases for Chinese text error correction software， and to measure and optimize the correction performance of software， a multi-user engineering-oriented method was designed， called Selective Generation Method of Test cases for Chinese text error Correction Software （SGMT-CCS）. Two different criteria were defined for evaluating test cases that users can choose from： error quantity density and error type density. SGMT-CCS consists of three modules： test case automatic generation module， test case selection module， and test case priority sorting module. Users can： 1） customize the minimum error interval and the size of the test case set during the automated generation of test cases； 2） customize the minimum error interval and expected value during the selection process； 3） select different criteria for evaluating and prioritizing test cases to meet the requirements of different datasets. Experimental results show that SGMT-CCS can generate effective test cases in a short period of time. The selection module satisfies the user’s customized goals under simulated requirements， and the priority sorting module effectively improves test efficiency in different time periods under different evaluation criteria than before sorting.

Video prediction model combining involution and convolution operators

Junhong ZHU, Junyu LAI, Lianqiang GAN, Zhiyong CHEN, Huashuo LIU, Guoyao XU

2024, 44(1): 113-122. DOI: 10.11772/j.issn.1001-9081.2023060853

Asbtract ( )

HTML ( )

PDF (4036KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the inadequate feature extraction from data space and low prediction accuracy in traditional deep learning based video prediction， a video prediction model Combining Involution and Convolution Operators （CICO） was proposed. The model enhanced video prediction performance through three aspects. Firstly， convolutions with varying kernel sizes were adopted to enhance extraction ability of multi-granularity spatial features and enable multi-angle representational learning of targets. In particular， larger kernels were applied to extract features from broader spatial ranges， while smaller kernels were employed to capture motion details more precisely. Secondly， large-kernel convolutions were replaced by the computationally efficient involution operators with fewer parameters in order to achieve efficient inter-channel interaction， avoid redundant parameters， decrease computational and storage costs. The predictive capacity of the model was enhanced at the same time. Finally， convolutions with kernel size 1×1 were introduced for linear mapping to strengthen joint expression between distinct features， improve parameter utilization efficiency， and strengthen prediction robustness. The proposed model’s superiority was validated through comprehensive experiments on various datasets， resulting in significant improvements over the state-of-the-art SimVP （Simpler yet Better Video Prediction） model. On Moving MNIST dataset， the Mean Squared Error （MSE） and Mean Absolute Error （MAE） were reduced by 25.2% and 17.4%， respectively. On Traffic Beijing dataset， the MSE was reduced by 1.2%. On KTH dataset， the Structure Similarity Index Measure （SSIM） and Peak Signal-to-Noise Ratio （PSNR） were improved by 0.66% and 0.47%， respectively. It can be seen that the proposed model is very effective in improving accuracy of video prediction.

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG

2024, 44(1): 123-128. DOI: 10.11772/j.issn.1001-9081.2023010062

Asbtract ( )

HTML ( )

PDF (1311KB) ( )

Figures and Tables | References | Related Articles | Metrics

In Query-by-Example Spoken Term Detection （QbE-STD）， the Acoustic Word Embedding （AWE） speech information extracted by Convolutional Neural Network （CNN） or Recurrent Neural Network （RNN） is limited. To better represent speech content and improve model performance， an acoustic word embedding model based on Bi-directional Long Short-Term Memory （Bi-LSTM） and convolutional-Transformer was proposed. Firstly， Bi-LSTM was utilized for extracting features， modeling speech sequences and improving the model learning ability by superposition. Secondly， to learn local information while capturing global information， CNN and Transformer encoder were connected in parallel to form convolutional-Transformer， which taking full advantages in feature extraction to aggregate more efficient information and improving the discrimination of embeddings. Under the constraint of contrast loss， the Average Precision （AP） of the proposed model reaches 94.36%， which is 1.76% higher than that of the Bi-LSTM model based on attention. The experimental results show that the proposed model can effectively improve model performance and better perform QbE-STD.

Self-distillation object segmentation method via scale-attention knowledge transfer

Xiaobing WANG, Xiongwei ZHANG, Tieyong CAO, Yunfei ZHENG, Yong WANG

2024, 44(1): 129-137. DOI: 10.11772/j.issn.1001-9081.2023010075

Asbtract ( )

HTML ( )

PDF (2683KB) ( )

Figures and Tables | References | Related Articles | Metrics

It is difficult for current object segmentation models to reach a good balance between segmentation performance and inference efficiency. To solve this challenge， a self-distillation object segmentation method via scale-attention knowledge transfer was proposed. Firstly， an object segmentation network only using features in backbone was constructed as the inference network， to achieve efficient forward inference process. Secondly， a self-distillation learning model via scale-attention knowledge was proposed. On the one hand， a scale-attention pyramid feature module was designed to adaptively capture context information at different semantic levels and extract more discriminative self-distillation knowledge. On the other hand， a distillation loss was constructed by fusing cross entropy， KL （Kullback-Leibler） divergence and L2 distance. It drove distillation knowledge to transfer into segmentation network efficiently to improve its generalization performance. The method was verified on five public object segmentation datasets of COD （Camouflaged Object Detection）， DUT-O （Dalian University of Technology-OMRON）， SOC （Salient Objects in Clutter）， etc.： considering the proposed inference network as the baseline network， the proposed self-distillation model can increase the segmentation performance by 3.01% on F_β metric， which was 1.00% higher better than that of Teacher-Free （TF） self-distillation model； compared with recent Residual learning Net （R2Net）， the proposed object segmentation network reduces the number of parameters by 2.33×106， improves the inference frame rate by 2.53%， decreases the floating-point operations by 40.50%， and increases segmentation performance by 0.51%. Experimental results show that the proposed self-distillation segmentation method can balance performance and efficiency， and is suitable for scenarios with limited computing and storage resources.

Commonsense reasoning and question answering method with three-dimensional semantic features

Hongbin WANG, Xiao FANG, Hong JIANG

2024, 44(1): 138-144. DOI: 10.11772/j.issn.1001-9081.2023010063

Asbtract ( )

HTML ( )

PDF (1225KB) ( )

Figures and Tables | References | Related Articles | Metrics

The existing commonsense question answering methods based on pre-trained language model and knowledge graph mainly focus on the construction of subgraphs of knowledge graph and combination of cross-modal information， ignoring the rich semantic features of knowledge graph itself， and lack dynamic adjustment of correlation among knowledge graph subgraph nodes to different question answering tasks， thus they do not achieve satisfactory prediction accuracies. To solve these above problems， a commonsense reasoning and question answering method integrating three-dimensional semantic features was proposed. Firstly， the quantitative indicators of three-dimensional semantic features at relation level， entity level and triple level for knowledge graph nodes were proposed. Secondly， the importance of semantic features of three dimensions of relation level， entity level and triple level to different entity nodes was dynamically calculated through attention mechanism. Finally， multi-layer aggregation iterative embedding of three-dimensional semantic features was carried out through graph neural network， to obtain more extrapolated knowledge representation， update subgraph node representation of knowledge graph， and improve the accuracy of answer prediction. Compared with QA-GNN commonsense question answering and reasoning method， the accuracy of proposed method in verification set and test set of CommonsenseQA dataset was improved by 1.70 percentage points and 0.74 percentage points， and the accuracy of the proposed method by AristoRoBERTa data processing method on OpenBookQA dataset was improved by 1.13 percentage points. Experimental results show that the proposed commonsense reasoning and question answering method integrating three-dimensional semantic features can effectively improve the accuracy of commonsense question answering tasks.

Text sentiment analysis model based on individual bias information

Li’an CHEN, Yi GUO

2024, 44(1): 145-151. DOI: 10.11772/j.issn.1001-9081.2023010103

Asbtract ( )

HTML ( )

PDF (766KB) ( )

Figures and Tables | References | Related Articles | Metrics

However，current text sentiment analysis often focus on the comment text itself， but ignore individual bias information between commenters and commentees， which has a considerable impact on the overall sentiment analysis. A text sentiment analysis model based on individual bias information， named UP-ATL （User and Product-Attention TranLSTM）， was proposed. In the model， self-attention mechanism and cross-attention mechanism were used to fuse the comment text and individual bias information in both directions. During the fusion process， a customized weight calculation method was used to alleviate the data sparsity problem caused by cold start in practical application scenarios. Finally， the feature fully fused comment text and bilateral representation information of the comment were obtained. Three real public datasets， Yelp2013， Yelp2014， and IMDB， were selected for effectiveness verification in the restaurant and film fields. The proposed model was compared with benchmark models such as UPNN （User Product Neural Network）， NSC （Neural Sentiment Classification）， CMA （Cascading Multiway Attention）and HUAPA （Hierarchical User And Product multi-head Attention）. The experimental results show that compared to the previous best performing HUAPA model， the accuracy of UP-ATL increases by 6.9 percentage points， 5.9 percentage points， and 1.6 percentage points， respectively on three datasets.

Domain-specific language for natural disaster risk map generation of immovable cultural heritage

Yihan HU, Jinlian DU, Hang SU, Hongyu GAO

2024, 44(1): 152-158. DOI: 10.11772/j.issn.1001-9081.2023010102

Asbtract ( )

HTML ( )

PDF (719KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problem of rapidly growing and frequently changing requirement for risk map generation of immovable cultural heritage， and existing programs and tools cannot meet the needs of actual applications， a method for constructing semantic model was proposed. Based on the semantic model， a Domain-Specific Language （DSL） close to natural language was designed for experts in the field of immovable cultural heritage. Firstly， a business model was extracted by conducting in-depth research on various indicators of immovable cultural heritage， as well as methods and processes for generating risk maps. Secondly， the meta-calculation units of the risk value calculation rules were abstracted， and a semantic model was constructed by analyzing the business model. On this basis， a DSL that can express all semantics in the semantic model was designed. The language script can be programmed by the field experts themselves and used to quickly and efficiently generate risk maps. It is easy to expand and can meet the needs of frequently changing requirements. Compared with the mainstream method of generating risk maps by using Geographic Information System （GIS）， the use of DSL to generate risk maps can reduce work hours by more than 66.7%.

Multi-task learning model for charge prediction with action words

Xiao GUO, Yanping CHEN, Ruixue TANG, Ruizhang HUANG, Yongbin QIN

2024, 44(1): 159-166. DOI: 10.11772/j.issn.1001-9081.2023010029

Asbtract ( )

HTML ( )

PDF (2318KB) ( )

Figures and Tables | References | Related Articles | Metrics

With the application of artificial intelligence technology in the judicial field， charge prediction based on case description has become an important research content. It aims at predicting the charges according to the case description. The terms of case contents are professional， and the description is concise and rigorous. However， the existing methods often rely on text features， but ignore the difference of relevant elements and lack effective utilization of elements of action words in diverse cases. To solve the above problems， a multi-task learning model of charge prediction based on action words was proposed. Firstly， the spans of action words were generated by boundary identifier， and then the core contents of the case were extracted. Secondly， the subordinate charge was predicted by constructing the structure features of action words. Finally， identification of action words and charge prediction were uniformly modeled， which enhanced the generalization of the model by sharing parameters. A multi-task dataset with action word identification and charge prediction was constructed for model verification. The experimental results show that the proposed model achieves the F value of 83.27% for action word identification task， and the F value of 84.29% for charge prediction task； compared with BERT-CNN， the F value respectively increases by 0.57% and 2.61%， which verifies the advantage of the proposed model in identification of action words and charge prediction.

Incomplete instance guided aeroengine blade instance segmentation

Rui HUANG, Chaoqun ZHANG, Xuyi CHENG, Yan XING, Bao ZHANG

2024, 44(1): 167-174. DOI: 10.11772/j.issn.1001-9081.2023010037

Asbtract ( )

HTML ( )

PDF (4546KB) ( )

Figures and Tables | References | Related Articles | Metrics

The current deep learning based instance segmentation methods cannot fully train the network model and result in sub-optimal segmentation results due to the lack of labeled engine blade data. To improve the precision of aeroengine blade instance segmentation， an aeroengine blade instance segmentation method based on incomplete instance guidance was proposed. Combining with an existing instance segmentation method and an interactive segmentation method， promising aeroengine blade instance segmentation results were obtained. First， a small amount of labeled data was used to train the instance segmentation network， which generated initial instance segmentation results of aeroengine blades. Secondly， the detected single blade instance was divided into foreground and background. By selecting foreground seed points and background seed points， the interactive segmentation method was used to generate complete segmentation results of the blade. After all the blade instances were processed in turn， the final segmentation result of engine blade instance was obtained by merging the results. All the 72 images were used to train the Sparse Instance activation map for real-time instance segmentation （SparseInst）， to produce the initial instance segmentation results. The testing dataset contained 56 images. The mean Average Precision （mAP） of the proposed method is higher than that of SparseInst by 5.1 percentage points. The mAP results of the proposed method are better than those of the state-of-the-art instance segmentation methods， e.g.， MASK R-CNN （Mask Region based Convolutional Neural Network）， YOLACT （You Only Look At CoefficienTs）， BMASK-RCNN （Boundary-preserving MASK R-CNN）.

Anomaly detection method for skeletal X-ray images based on self-supervised feature extraction

Yuning ZHANG, Abudukelimu ABULIZI, Tisheng MEI, Chun XU, Maierdana MAIMAITIREYIMU, Halidanmu ABUDUKELIMU, Yutao HOU

2024, 44(1): 175-181. DOI: 10.11772/j.issn.1001-9081.2023010002

Asbtract ( )

HTML ( )

PDF (2359KB) ( )

Figures and Tables | References | Related Articles | Metrics

In order to explore the feasibility of a self-supervised feature extraction method in skeletal X-ray image anomaly detection， an anomaly detection method for skeletal X-ray images based on self-supervised feature extraction was proposed. The self-supervised learning framework and Vision Transformer （ViT） model were combined for feature extraction in skeletal anomaly detection， and anomaly detection classification was carried out by linear classifiers， which can effectively avoid the dependence of supervised models on large-scale labeled data in feature extraction stage. Experiments were performed on publicly available skeletal X-ray image datasets， the skeletal anomaly detection models based on pre-trained Convolutional Neural Network （CNN） and self-supervised feature extraction were evaluated with accuracy. Experimental results show that self-supervised feature extraction model has better effect than the general CNN models， its classification results in seven parts are similar to those of supervised CNN models， but the abnormal detection accuracy for elbow， finger and humerus achieved optimal values， and the average accuracies increases by 5.37 percentage points compared to ResNet50. The proposed method is easy to implement and can be used as a visual assistant tool for radiologist initial diagnosis.

Identification method of influence nodes in multilayer hypernetwork based on evidence theory

Kuo TIAN, Yinghan WU, Feng HU

2024, 44(1): 182-189. DOI: 10.11772/j.issn.1001-9081.2023010021

Asbtract ( )

HTML ( )

PDF (2830KB) ( )

Figures and Tables | References | Related Articles | Metrics

In view of the fact that most researches on multilayer hypernetwork mainly focus on the topology structure， and influence node identification methods involve relatively single indicators， which cannot comprehensively and accurately identify influence nodes， an identification method of influence nodes in multilayer hypernetwork based on evidence theory was proposed. Firstly， based on the topology structure of multilayer hypernetwork， Multilayer Aggregation Hypernetwork （MAH） was constructed according to the idea of aggregation network. Secondly， the discernment framework of problem was defined based on evidence theory. Finally， Dempster-Shafer （D-S） evidence combination method was used to fuse local， location and global indicators of network to identify influence nodes. The proposed method was applied to physics-computer science double-layer scientific research cooperation hypernetwork constructed by arXiv dataset. Compared with hyperdegree centrality， K-shell， closeness centrality methods， etc.， the proposed method has the fastest propagation speed and reaches steady state first in the Susceptible-Infected-Susceptible （SIS） hypernetwork propagation model based on Reactive Process （RP） and Contact Process （CP） strategies. After isolating top 6% of influence nodes， the average network hyperdegree， clustering coefficient and network efficiency decreased. With the increase of proportion of isolated influence nodes， the growth rate of number of network subgraphs was similar to that of the closeness centrality method. The coarse granularity of identification result was measured by monotonicity index value， which reached 0.999 8， and recognition result had a high discrimination degree. The results of several experiments show that the proposed identification method of influence nodes in multilayer hypernetwork is accurate and effective.

Maximum cycle truss community search based on hierarchical tree index on directed graphs

Chuanyu ZONG, Chunhe ZHANG, Xiufeng XIA

2024, 44(1): 190-198. DOI: 10.11772/j.issn.1001-9081.2023010071

Asbtract ( )

HTML ( )

PDF (2751KB) ( )

Figures and Tables | References | Related Articles | Metrics

Community search aims to find highly cohesive connected subgraphs containing user query vertices in information networks. Cycle truss is a community search model based on cycle triangle. However， the existing index-based cycle truss community search methods suffer from the drawbacks of large index space， low search efficiency， and low community cohesion. A maximum cycle truss community search method based on hierarchical tree index was proposed to address this issue. Firstly， a k-cycle truss decomposition algorithm was proposed， and two important concepts， cycle triangle connectivity and k-level equivalence were introduced. Based on k-level equivalence， the hierarchical tree index TreeCIndex and the table index SuperTable were designed. On this basis， two efficient cycle truss community search algorithms were proposed. The proposed algorithms were compared with existing community search algorithms based on TrussIndex and EquiTruss on four real datasets. The experimental results show that the space consumptions of TreeCIndex and SuperTable are at least 41.5% lower and the index construction time is 8.2% to 98.3% lower compared to TrussIndex and EquiTruss； furthermore， the efficiencies of searching for maximum cycle truss communities is increased by one and two orders of magnitude.

Directed gene regulatory network inference algorithm based on t-test and stepwise network search

Du CHEN, Yuanyuan LI, Yu CHEN

2024, 44(1): 199-205. DOI: 10.11772/j.issn.1001-9081.2023010086

Asbtract ( )

HTML ( )

PDF (1783KB) ( )

Figures and Tables | References | Related Articles | Metrics

In order to overcome the shortage that the Path Consensus Algorithm based on Conditional Mutual Information （PCA-CMI） cannot identify the regulation direction and further improve the accuracy of network inference， a Directed Network Inference algorithm enhanced by t-Test and Stepwise Regulation Search （DNI-T-SRS） was proposed. First， the upstream and downstream relationships of genes were identified by a t-test performed on the expression data with different perturbation settings， by which the conditional genes were selected for guiding Path Consensus （PC） algorithm and calculating Conditional Mutual Inclusive Information （CMI2） to remove redundant regulations， and an algorithm named CMI2-based network inference guided by t-Test （CMI2NI-T） was developed. Then， the corresponding Michaelis-Menten differential equation model was established to fit the expression data， and the network inference result was further corrected by a stepwise network search based on Bayesian information criterion. Numerical experiments were conducted on two benchmark networks of the DREAM6 challenge， and the Area Under Curves （AUCs） of CMI2NI-T were 0.767 9 and 0.979 6， which were 16.23% and 11.62% higher than those of PCA-CMI. With the help of additional process of data fitting， the DNI-T-SRS achieved the inference accuracies of 86.67% and 100.00%， which were 18.19% and 10.52% higher than those of PCA-CMI. The experimental results demonstrate that the proposed DNI-T-SRS can eliminate indirect regulatory relationships and preserve direct regulatory connections， which contributes to precise inference results of gene regulatory networks.

Efficient similar exercise retrieval model based on unsupervised semantic hashing

Wei TONG, Liyang HE, Rui LI, Wei HUANG, Zhenya HUANG, Qi LIU

2024, 44(1): 206-216. DOI: 10.11772/j.issn.1001-9081.2023091260

Asbtract ( )

HTML ( )

PDF (1988KB) ( )

Figures and Tables | References | Related Articles | Metrics

Finding similar exercises aims to retrieve exercises with similar testing goals to a given query exercise from the exercise database. As online education evolves， the exercise database is growing in size， and due to the professional characteristic of the exercises， it is not easy to annotate their relations. Thus， online education systems require an efficient and unsupervised model for finding similar exercise. Unsupervised semantic hashing can map high-dimensional data to compact and efficient binary representation under the premise of unsupervised signals. However，it is inadequate to simply apply the semantic hashing model to the similar exercise retrieval model because exercise data contains rich semantic information while the representation space of binary vector is limited. To address this issue， a similar exercise retrieval model was introduced to acquire and retain crucial information. Firstly， a crucial information acquisition module was designed to acquire critical information from exercise data and a de-redundancy object loss was proposed to eliminate redundant information. Secondly， a time-aware activation function was introduced to reduce coding information loss. Thirdly， to maximize the utilization of the Hamming space， a bit balance loss and a bit independent loss were introduced to optimize the distribution of binary representation in the optimization process. Experimental results on MATH and HISTORY datasets demonstrate that the proposed model outperforms the state-of-the-art text semantic hashing model Deep Hash InfoMax （DHIM）， with an average improvement of approximately 54% and 23% respectively across three recall settings. Moreover， compared to the best-performing similar exercise retrieval model QuesCo， the proposed model demonstrates a clear advantage on search efficiency.

Differential privacy clustering algorithm in horizontal federated learning

Xueran XU, Geng YANG, Yuxian HUANG

2024, 44(1): 217-222. DOI: 10.11772/j.issn.1001-9081.2023010019

Asbtract ( )

HTML ( )

PDF (1418KB) ( )

Figures and Tables | References | Related Articles | Metrics

Clustering analysis can uncover hidden interconnections between data and segment the data according to multiple indicators， which can facilitate personalized and refined operations. However， data fragmentation and isolation caused by data islands seriously affects the effectiveness of cluster analysis applications. To solve data island problem and protect data privacy， an Equivalent Local differential privacy Federated K-means （ELFedKmeans） algorithm was proposed. A grid-based initial cluster center selection method and a privacy budget allocation scheme were designed for the horizontal federation learning model. To generate same random noise with lower communication cost， all organizations jointly negotiated random seeds， protecting local data privacy. The ELFedKmeans algorithm was demonstrated satisfying differential privacy protection through theoretical analysis， and it was also compared with Local Differential Privacy distributed K-means （LDPKmeans） algorithm and Hybrid Privacy K-means （HPKmeans） algorithm on different datasets. Experimental results show that all three algorithms increase F-measure and decrease SSE （Sum of Squares due to Error） gradually as privacy budget increases. As a whole， the F-measure values of ELFedKmeans algorithm was 1.794 5% to 57.066 3% and 21.245 2% to 132.048 8% higher than those of LDPKmeans and HPKmeans algorithms respectively； the Log（SSE） values of ELFedKmeans algorithm were 1.204 2% to 12.894 6% and 5.617 5% to 27.575 2% less than those of LDPKmeans and HPKmeans algorithms respectively. With the same privacy budget， ELFedKmeans algorithm outperforms the comparison algorithms in terms of clustering quality and utility metric.

Deep shadow defense scheme of federated learning based on generative adversarial network

Hui ZHOU, Yuling CHEN, Xuewei WANG, Yangwen ZHANG, Jianjiang HE

2024, 44(1): 223-232. DOI: 10.11772/j.issn.1001-9081.2023010088

Asbtract ( )

HTML ( )

PDF (4561KB) ( )

Figures and Tables | References | Related Articles | Metrics

Federated Learning （FL） allows users to share and interact with multiple parties without directly uploading the original data， effectively reducing the risk of privacy leaks. However， existing research suggests that the adversary can still reconstruct raw data through shared gradient information. To further protect the privacy of federated learning， a deep shadow defense scheme of federated learning based on Generative Adversarial Network （GAN） was proposed. The original real data distribution features were learned by GAN and replaceable shadow data was generated. Then， the original model trained on real data was replaced by a shadow model trained on shadow data and was not directly accessible to the adversary. Finally， the real gradient was replaced by the shadow gradient generated by the shadow data in the shadow model and was not accessible to the adversary. Experiments were conducted on CIFAR10 and CIFAR100 datasets for comparison of the proposed scheme with the five defense schemes of adding noise， gradient clipping， gradient compression， representation perturbation and local regularization and sparsification. On CIFAR10 dataset， the Mean Square Error （MSE） and the Feature Mean Square Error （FMSE） of the proposed scheme were 1.18-5.34 and 4.46-1.03×10⁷ times， and the Peak Signal-to-Noise Ratio （PSNR） of the proposed scheme was 49.9%-90.8%. On CIFAR100 dataset， the MSE and the FMSE of the proposed scheme were 1.04-1.06 and 5.93-4.24×10³ times， and the PSNR of the proposed scheme was 96.0%-97.6%. Compared with the deep shadow defense method， the proposed scheme takes into account the actual attack capability of the adversary and the problems in shadow model training， and designs threat models and shadow model generation algorithms. It performs better in theory analysis and experiment result that of the comparsion schemes， and it can effectively reduce the risk of federated learning privacy leaks while ensuring accuracy.

Authenticatable privacy-preserving scheme based on signcryption from lattice for vehicular ad hoc network

Jianyang CUI, Ying CAI, Yu ZHANG, Yanfang FAN

2024, 44(1): 233-241. DOI: 10.11772/j.issn.1001-9081.2023010083

Asbtract ( )

HTML ( )

PDF (2194KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the issues of user privacy leakage and message authentication in Vehicular Ad hoc NETwork （VANET）， an authenticatable privacy-preserving scheme based on signcryption from lattice was proposed. Firstly， the public key of receiver was used to signcrypt the message to generate the ciphertext， and only the receiver with corresponding private key could decrypt the ciphertext， which ensures messages visible only to authorized users. Secondly， after decrypting the message， the receiver calculated the hash value of the message by one-way secure hash function， and judged whether the hash value of the message changed， which realized message authentication. Finally， Number Theoretic Transform （NTT） algorithm was used to reduce the computational overhead of polynomial multiplication and improve the computational efficiency of the scheme. The proposed scheme was proved to have INDistinguishability under Chosen Ciphertext Attack （IND-CCA2） and Strong UnForgeability under Chosen Message Attack （SUF-CMA） under the random oracle model. In addition， the security of the proposed scheme is based on lattice hardness problems， so that it can resist quantum algorithm attack. Simulation experiment results show that the proposed scheme improves the performance in terms of communication delay （at least reducing 10.01%）， message loss rate （at least reducing 31.79%） and communication overhead （at least reducing 31.25%） compared to similar authenticated privacy-preserving schemes and a lattice-based signature scheme. Therefore， the proposed scheme is more suitable for resource-constrained VANETs.

User plagiarism identification scheme in social network under blockchain

Li LI, Chunyan YANG, Jiangwen ZHU, Ronglei HU

2024, 44(1): 242-251. DOI: 10.11772/j.issn.1001-9081.2023010031

Asbtract ( )

HTML ( )

PDF (4508KB) ( )

Figures and Tables | References | Related Articles | Metrics

To address the problem of difficulty in identifying user plagiarism in social networks and to protect the rights of original authors while holding users accountable for plagiarism actions， a plagiarism identification scheme for social network users under blockchain was proposed. Aiming at the lack of universal tracing model in existing blockchain， a blockchain-based traceability information management model was designed to record user operation information and provide a basis for text similarity detection. Based on the Merkle tree and Bloom filter structures， a new index structure BHMerkle was designed. The calculation overhead of block construction and query was reduced， and the rapid positioning of transactions was realized. At the same time， a multi-feature weighted Simhash algorithm was proposed to improve the precision of word weight calculation and the efficiency of signature value matching stage. In this way， malicious users with plagiarism cloud be identified， and the occurrence of malicious behavior can be curbed through the reward and punishment mechanism. The average precision and recall of the plagiarism detection scheme on news datasets with different topics were 94.8% and 88.3%， respectively. Compared with multi-dimensional Simhash algorithm and Simhash algorithm based on information Entropy weighting （E-Simhash）， the average precision was increased by 6.19 and 4.01 percentage points respectively， the average recall was increased by 3.12 and 2.92 percentage points respectively. Experimental results show that the proposed scheme improves the query and detection efficiency of plagiarism text， and has high accuracy in plagiarism identification.

Blockchain-based vehicle-to-infrastructure fast handover authentication scheme in VANET

Juangui NING, Guofang DONG

2024, 44(1): 252-260. DOI: 10.11772/j.issn.1001-9081.2023010068

Asbtract ( )

HTML ( )

PDF (3139KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problems of security risk in vehicle communication and complex identity re-authentication when vehicles enter new infrastructure coverage in Vehicular Ad hoc NETwork （VANET）， a blockchain-based V2I （Vehicle-to-Infrastructure） fast handover authentication scheme in VANET was proposed. The decentralized， distributed and tamper-proof characteristics of blockchain were utilized to realize the storage and query of vehicle authentication information. Token mechanism was used to reduce the number of queries of blockchain， and simplify handover authentication process between Road Side Units （RSUs）. Because only the validity of token needed to be checked in subsequent authentication， rapid handover authentication of RSU was realized. Batch authentication was adopted to reduce the computation overhead and improve the efficiency of message authentication. In addition， the traceability and revocation of malicious vehicles was realized， and the anonymous identities of vehicles were updated in time to ensure the anonymity of vehicles. Compared with anonymous batch authentication scheme， authentication scheme with full aggregation， certificateless aggregate signature scheme， blockchain-based authentication scheme， the proposed scheme reduced the time consumption for message authentication by 51.1%， 77.45%， 77.56% and 76.01%. The experimental results show that proposed scheme can effectively reduce the computation overhead and communication overhead in VANET.

Incentive mechanism of crowdsourcing multi-task assignment against malicious bidding

Peiyao ZHANG, Xiaodong FU

2024, 44(1): 261-268. DOI: 10.11772/j.issn.1001-9081.2023010024

Asbtract ( )

HTML ( )

PDF (1958KB) ( )

Figures and Tables | References | Related Articles | Metrics

The rapid development of crowdsourcing has enriched workers’ experience and skills of workers， making them more aware of tasks and tend to complete multiple tasks at the same time. Therefore， assigning tasks according to workers’ subjective preferences has become a common way of task assignment. However， out of personal interests， workers may take malicious bidding behaviors to obtain higher utility. It is detrimental to the development of crowdsourcing platforms. To this end， an incentive mechanism of crowdsourcing multi-task assignment against malicious bidding was proposed， named GIMSM （Greedy Incentive Mechanism for Single-Minded）. First， a linear ratio was defined as the allocation basis by this mechanism. Then， according to the greedy strategy， from a sequence of increasing worker ratios， tasks were selected and assigned. Finally， the workers selected by allocation algorithm were paid according to payment function， and the result of task assignment was obtained. The experiments were conducted on Taxi and Limousine Commission Trip Record Data dataset. Compared to TODA （Truthful Online Double Auction mechanism）， TCAM （Truthful Combinatorial Auction Mechanism） and FU method， GIMSM’s average quality level of task results under different numbers of workers increased by 25.20 percentage points， 13.20 percentage points and 4.40 percentage points， respectively. GIMSM’s average quality level of task results under different numbers of tasks increased by 26.17 percentage points， 16.17 percentage points and 9.67 percentage points， respectively. In addition， the proposed mechanism GIMSM satisfies individual rationality and incentive compatibility， and can obtain task assignment results in linear time. The experimental results show that the proposed mechanism GIMSM has good anti-malicious bidding performance， and has a better performance on the crowdsourcing platforms with a large amount of data.

Constrained multi-objective evolutionary algorithm based on two-stage search and dynamic resource allocation

Yongjian MA, Xuhua SHI, Peiyao WANG

2024, 44(1): 269-277. DOI: 10.11772/j.issn.1001-9081.2023010012

Asbtract ( )

HTML ( )

PDF (2145KB) ( )

Figures and Tables | References | Related Articles | Metrics

The difficulty of solving constrained multi-objective optimization problems lies in balancing objective optimization and constraint satisfaction， while balancing the convergence and diversity of solution sets. To solve complex constrained multi-objective optimization problems with large infeasible regions and small feasible regions， a constrained multi-objective evolutionary algorithm based on Two-Stage search and Dynamic Resource Allocation （TSDRA） was proposed. In the first stage， infeasible regions were crossed by ignoring constraints； in the second stage， two kinds of computing resources were allocated dynamically to coordinate local exploitation and global exploration， while balancing the convergence and diversity of the algorithm. The simulation results on LIRCMOP and MW series test problems show that compared with four representative algorithms of Constrained Multi-objective Evolutionary Algorithm with Multiple Stages （CMOEA-MS）， Two-phase （ToP）， Push and Pull Search （PPS） and Multi Stage Constrained Multi-Objective evolutionary algorithm （MSCMO）， the proposed algorithm obtains better results in both Inverted Generational Distance （IGD） and HyperVolume （HV）. TSDRA obtains 10 best IGD values and 9 best HV values on LIRCMOP series test problems， and 9 best IGD values and 10 best HV values on MW series test problems， indicating that the proposed algorithm can effectively solve problems with large infeasible regions and small feasible regions.

Construction and application of 3D dataset of human grasping objects

Jian LIU, Chenchen YOU, Jinming CAO, Qiong ZENG, Changhe TU

2024, 44(1): 278-284. DOI: 10.11772/j.issn.1001-9081.2023010009

Asbtract ( )

HTML ( )

PDF (5236KB) ( )

Figures and Tables | References | Related Articles | Metrics

Realistic human grasping data is of vital importance in the research of human grasping behavior analysis and human-like robotic grasping. A grasping dataset should include object shape information， contact points， and hand shapes and poses. However， related works often capture images or videos to estimate the human grasping behavior， which leads to the inaccuracy of joint degrees of freedom. Virtual Reality （VR） technology was used to establish a virtual environment， and digital gloves were used to directly capture 3D objects and hand poses in the virtual environment as capturing data. The proposed dataset contains 91 objects with various shapes （each with 108 poses） from 49 object categories， and 52 173 3D hand grasps， which scale and richness are far more than existing dataset used to study human grasping behavior and human-centered grasp technology. In addition， the collected dataset was used for grasp saliency analysis and human-like grasping calculation， and the experimental results demonstrate the practical value of this dataset.

Dimensional analysis of cutting edges of acetabular reamer based on 3D point cloud processing

Guowei YANG, Qifan CHEN, Xinyue LIU, Xiaoyang WANG

2024, 44(1): 285-291. DOI: 10.11772/j.issn.1001-9081.2023010033

Asbtract ( )

HTML ( )

PDF (8674KB) ( )

Figures and Tables | References | Related Articles | Metrics

Acetabular reamer is one of the most important surgical tools in hip replacement surgery. The milling quality of acetabular reamer on acetabulum is affected by the dimension change of cutting edges. The wear of acetabular reamer can be examined by processing 3D point cloud of acetabular reamer， so a dimensional analysis algorithm for the cutting edges of acetabular reamer based on 3D point cloud processing was proposed. Frist， an algorithm with tangency plane and maximum angle criterion were introduced in the proposed algorithm to obtain the boundary point cloud of acetabular reamer based on boundary characteristics of the tooth holes. Second， the boundary point cloud was partitioned into individual tooth hole point clouds by K-means clustering algorithm， and then the point cloud of each tooth hole boundary was searched by radius nearest neighbor search algorithm to obtain the point cloud of cutting edges belonging to different tooth holes. Finally， RANSAC （RANdom SAmple Consensus） algorithm was used to fit the point cloud of acetabular reamer to a sphere， and Euclidean distance from the point cloud of cutting edges to the center of the fitted sphere was calculated to analyze cutting edge dimensions of acetabular reamer. PCL （Point Cloud Library） was used as a development framework to process the point cloud of acetabular reamer. The accuracy of hole segmentation of the point cloud of acetabular reamer is 100%， and the accuracy of spherical fitting radius of the point cloud of the acetabular reamer is 0.004 mm. Experimental results show that the proposed algorithm has a good effect on the point cloud processing of acetabular reamer， and can effectively realize the dimensional analysis of the cutting edges of acetabular reamer.

Lightweight image super-resolution reconstruction network based on Transformer-CNN

Hao CHEN, Zhenping XIA, Cheng CHENG, Xing LIN-LI, Bowen ZHANG

2024, 44(1): 292-299. DOI: 10.11772/j.issn.1001-9081.2023010048

Asbtract ( )

HTML ( )

PDF (1855KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the high computational complexity and large memory consumption of the existing super-resolution reconstruction networks， a lightweight image super-resolution reconstruction network based on Transformer-CNN was proposed， which made the super-resolution reconstruction network more suitable to be applied on embedded terminals such as mobile platforms. Firstly， a hybrid block based on Transformer-CNN was proposed， which enhanced the ability of the network to capture local-global depth features. Then， a modified inverted residual block， with special attention to the characteristics of the high-frequency region， was designed， so that the improvement of feature extraction ability and reduction of inference time were realized. Finally， after exploring the best options for activation function， the GELU （Gaussian Error Linear Unit） activation function was adopted to further improve the network performance. Experimental results show that the proposed network can achieve a good balance between image super-resolution performance and network complexity， and reaches inference speed of 91 frame/s on the benchmark dataset Urban100 with scale factor of 4， which is 11 times faster than the excellent network called SwinIR （Image Restoration using Swin transformer）， indicates that the proposed network can efficiently reconstruct the textures and details of the image and reduce a significant amount of inference time.

Integrated deep reinforcement learning portfolio model

Jie LONG, Liang XIE, Haijiao XU

2024, 44(1): 300-310. DOI: 10.11772/j.issn.1001-9081.2023010028

Asbtract ( )

HTML ( )

PDF (3723KB) ( )

Figures and Tables | References | Related Articles | Metrics

The portfolio problem is a hot issue in the field of quantitative trading. An Integrated Deep Reinforcement Learning Portfolio Model （IDRLPM） was proposed to address the shortcomings of existing deep reinforcement learning-based portfolio models that cannot achieve adaptive trading strategies and effectively utilize supervised information. Firstly， multi-agent method was used to construct multiple base agents and design reward functions with different trading styles to represent different trading strategies. Secondly， integrated learning method was used to fuse the features of strategy network of the base agents to obtain the integrated agent adaptive to market environment. Then， a trend prediction network based on Convolutional Block Attention Module （CBAM） was embedded in the integrated agent， and the output of the trend prediction network guided integrated strategy network to adaptively select the proportion of trades. Finally， under the alternating iterative training of supervised deep learning and reinforcement learning， IDRLPM effectively utilized supervised information from training data to enhance model profitability. The Sharpe Ratio （SR） of IDRLPM reaches 1.87 and 1.88， and the Cumulative Return （CR） reaches 2.02 and 1.34 in Shanghai Stock Exchange （SSE） 50 constituent stocks and China Securities Index （CSI） 500 constituent stocks； compared with the Ensemble Deep Reinforcement Learning （EDRL） trading model， the SR improves by 105% and 55%， and the CR improves by 124% and 79%. The experimental results show that IDRLPM can effectively solve the portfolio problem.

Short-term power load forecasting by graph convolutional network combining LSTM and self-attention mechanism

Hanxiao SHI, Leichun WANG

2024, 44(1): 311-317. DOI: 10.11772/j.issn.1001-9081.2023010078

Asbtract ( )

HTML ( )

PDF (2173KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problems of the existing power load forecasting models such as heavy modeling workload， insufficient spatiotemporal joint representation， and low forecasting accuracy， a Short-Term power Load Forecasting model based on Graph Convolutional Network （GCN） combining Long Short-Term Memory （LSTM） network and Self-attention mechanism （GCNLS-STLF） was proposed. Firstly， original multi-dimensional time series data was transformed into a power load graph containing the correlation between series by using LSTM and self-attention mechanism. Then， the features were extracted from the power load graph by GCN， LSTM and Graph Fourier Transform （GFT）. Finally， a full connection layer was used to reconstruct features， and the residual was used to forecast the power load for multiple times to enhance the expression ability of the original power load data. The short-term power load forecasting experimental results on real historical power load data of power stations in Morocco and Panama showed that compared with Support Vector Machine （SVM）， LSTM， mixed model CNN-LSTM and CNN-LSTM based on attention （CNN-LSTM-attention）， the Mean Absolute Percentage Error （MAPE） of GCNLS-STLF was reduced by 1.94， 0.90， 0.49 and 0.37 percentage points， respectively， on the entire Morocco power load test set； the MAPE of GCNLS-STLF on the Panama power load test dataset decreased by 1.39， 0.94， 0.38 and 0.29 percentage points respectively in March and 1.40， 0.99， 0.35 and 0.28 percentage points respectively in June. Experimental results show that GCNLS-STLF can effectively extract key features of power load， and forecasting effects are satisfactory.

Trajectory similarity measurement algorithm based on three-dimensional space area division

Kai XU, Qikai GAO, Ming YIN, Jingjing TAN

2024, 44(1): 318-323. DOI: 10.11772/j.issn.1001-9081.2023010077

Asbtract ( )

HTML ( )

PDF (1595KB) ( )

Figures and Tables | References | Related Articles | Metrics

Aiming at the problem that most trajectory similarity measurement algorithms cannot distinguish the trajectories with opposite directions， a three-dimensional Triangulation Division （3TD） algorithm based on three-dimensional space area division was proposed. Firstly， the absolute time series of the trajectory set was transformed into the relative time series according to the time conversion rules of the 3TD algorithm. Then， in the three-dimensional space coordinate system composed of three elements of longitude， latitude， and time， the area between trajectories were divided into several non-overlapping triangles by partitioning rules， and the areas of the triangles were accumulated and the trajectory similarity was calculated. Finally， the proposed algorithm was compared with the Longest Common SubSequence （LCSS） algorithm and Triangle Division （TD） algorithm on the randomly sampled trajectory dataset collected from the ship Automatic Identification System （AIS）. Experimental results show that the accuracy of the 3TD algorithm reaches 100%. At the same time， the proposed algorithm can also maintain accurate measurement results and high operation efficiency on massive datasets and datasets with partial missing trajectory points， which can better adapt to the similarity measurement of divergent trajectories.

Hatch recognition algorithm of bulk cargo ship based on incomplete point cloud normal filtering and compensation

Yumin SONG, Hao SUN, Zhan LI, Chang’an LI, Xiaoshu QIAO

2024, 44(1): 324-330. DOI: 10.11772/j.issn.1001-9081.2023010051

Asbtract ( )

PDF (2041KB) ( )

References | Related Articles | Metrics

The operating cost of the port can be greatly reduced and economic benefits can be greatly improved by the automatic ship loading system， which is an important part of the smart port construction. Hatch recognition is the primary link in the automatic ship loading task， and its success rate and recognition accuracy are important guarantees for the smooth progress of subsequent tasks. Collected ship point cloud data is often missing due to issues such as the number and angle of the port lidars. In addition， the geometric information of the hatch cannot be expressed accurately by the collected point cloud data because there is often a large amount of material accumulation near the hatch. The recognition success rate of the existing algorithm is significantly reduced due to the frequent problems in the actual ship loading operation of the port mentioned above， which has a negative impact on the automatic ship loading operation. Therefore， it is urgent to improve the success rate of hatch recognition in the case of material interference or incomplete hatch data in the ship point cloud. A hatch recognition algorithm of bulk cargo ship based on incomplete point cloud normal filtering and compensation was proposed， by analyzing the ship structural features and point cloud data collected during the automatic ship loading process. Experiments were carried out to verify that the recognition success rate and recognition accuracy are improved compared with Miao’s and Li’s hatch recognition algorithms. The experimental results show that the proposed algorithm can not only filter out the material noise in the hatch， but also compensate for the missing data， which can effectively improve the hatch recognition effect.

Table of Content