Multimedia computing and computer simulation

Select

Review of deep learning-based medical image segmentation

CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang

Journal of Computer Applications 2021, 41 (8): 2273-2287. DOI: 10.11772/j.issn.1001-9081.2020101638

Abstract （2060）

PDF （2539KB）（1680）

Save

As a fundamental and key task in computer-aided diagnosis, medical image segmentation aims to accurately recognize the target regions such as organs, tissues and lesions at pixel level. Different from natural images, medical images show high complexity in texture and have the boundaries difficult to judge caused by ambiguity, which is the fault of much noise due to the limitations of the imaging technology and equipment. Furthermore, annotating medical images highly depends on expertise and experience of the experts, thereby leading to limited available annotations in the training and potential annotation errors. For medical images suffer from ambiguous boundary, limited annotated data and large errors in the annotations, which makes it is a great challenge for the auxiliary diagnosis systems based on traditional image segmentation algorithms to meet the demands of clinical applications. Recently, with the wide application of Convolutional Neural Network (CNN) in computer vision and natural language processing, deep learning-based medical segmentation algorithms have achieved tremendous success. Firstly the latest research progresses of deep learning-based medical image segmentation were summarized, including the basic architecture, loss function, and optimization method of the medical image segmentation algorithms. Then, for the limitation of medical image annotated data, the mainstream semi-supervised researches on medical image segmentation were summed up and analyzed. Besides, the studies related to measuring uncertainty of the annotation errors were introduced. Finally, the characteristics summary and analysis as well as the potential future trends of medical image segmentation were listed.

Reference | Related Articles | Metrics

Select

Embedded road crack detection algorithm based on improved YOLOv8

Huantong GENG, Zhenyu LIU, Jun JIANG, Zichen FAN, Jiaxing LI

Journal of Computer Applications 2024, 44 (5): 1613-1618. DOI: 10.11772/j.issn.1001-9081.2023050635

Abstract （1724）

HTML （68）

PDF （2002KB）（2322）

Save

Deploying the YOLOv8L model on edge devices for road crack detection can achieve high accuracy， but it is difficult to guarantee real-time detection. To solve this problem， a target detection algorithm based on the improved YOLOv8 model that can be deployed on the edge computing device Jetson AGX Xavier was proposed. First， the Faster Block structure was designed using partial convolution to replace the Bottleneck structure in the YOLOv8 C2f module， and the improved C2f module was recorded as C2f-Faster； second， an SE （Squeeze-and-Excitation） channel attention layer was connected after each C2f-Faster module in the YOLOv8 backbone network to further improve the detection accuracy. Experimental results on the open source road damage dataset RDD20 （Road Damage Detection 20） show that the average F1 score of the proposed method is 0.573， the number of detection Frames Per Second （FPS） is 47， and the model size is 55.5 MB. Compared with the SOTA （State-Of-The-Art） model of GRDDC2020 （Global Road Damage Detection Challenge 2020）， the F1 score is increased by 0.8 percentage points， the FPS is increased by 291.7%， and the model size is reduced by 41.8%， which realizes the real-time and accurate detection of road cracks on edge devices.

Table and Figures | Reference | Related Articles | Metrics

Select

Transformer based U-shaped medical image segmentation network： a survey

Liyao FU, Mengxiao YIN, Feng YANG

Journal of Computer Applications 2023, 43 (5): 1584-1595. DOI: 10.11772/j.issn.1001-9081.2022040530

Abstract （1691）

HTML （85）

PDF （1887KB）（1179）

Save

U-shaped Network （U-Net） based on Fully Convolutional Network （FCN） is widely used as the backbone of medical image segmentation models， but Convolutional Neural Network （CNN） is not good at capturing long-range dependency， which limits the further performance improvement of segmentation models. To solve the above problem， researchers have applied Transformer to medical image segmentation models to make up for the deficiency of CNN， and U-shaped segmentation networks combining Transformer have become the hot research topics. After a detailed introduction of U-Net and Transformer， the related medical image segmentation models were categorized by the position in which the Transformer module was located， including only in the encoder or decoder， both in the encoder and decoder， as a skip-connection， and others， the basic contents， design concepts and possible improvement aspects about these models were discussed， the advantages and disadvantages of having Transformer in different positions were also analyzed. According to the analysis results， it can be seen that the biggest factor to decide the position of Transformer is the characteristics of the target segmentation task， and the segmentation models of Transformer combined with U-Net can make better use of the advantages of CNN and Transformer to improve segmentation performance of models， which has great development prospect and research value.

Table and Figures | Reference | Related Articles | Metrics

Select

Automatic detection and recognition of electric vehicle helmet based on improved YOLOv5s

Zhouhua ZHU, Qi QI

Journal of Computer Applications 2023, 43 (4): 1291-1296. DOI: 10.11772/j.issn.1001-9081.2022020313

Abstract （1250）

HTML （58）

PDF （2941KB）（442）

PDF（mobile）（3142KB）（52）

Save

Aiming at the problems of low detection precision， poor robustness， and imperfect related systems in the current small object detection of electric vehicle helmet， an electric vehicle helmet detection model was proposed based on improved YOLOv5s algorithm. In the proposed model， Convolutional Block Attention Module （CBAM） and Coordinate Attention （CA） module were introduced， and the improved Non-Maximum Suppression （NMS） - Distance Intersection over Union-Non Maximum Suppression （DIoU-NMS） was used. At the same time， multi-scale feature fusion detection was added and densely connected network was combined to improve feature extraction effect. Finally， a helmet detection system for electric vehicle drivers was established. The improved YOLOv5s algorithm had the mean Average Precision （mAP） increased by 7.1 percentage points when the Intersection over Union （IoU） is 0.5， and Recall increased by 1.6 percentage points compared with the original YOLOv5s on the self-built electric vehicle helmet wearing dataset. Experimental results show that the improved YOLOv5s algorithm can better meet the requirements for detection precision of electric vehicles and the helmets of their drivers in actual situations， and reduce the incidence rate of electric vehicle traffic accidents to a certain extent.

Table and Figures | Reference | Related Articles | Metrics

Select

Review of remote sensing image change detection

REN Qiuru, YANG Wenzhong, WANG Chuanjian, WEI Wenyu, QIAN Yunyun

Journal of Computer Applications 2021, 41 (8): 2294-2305. DOI: 10.11772/j.issn.1001-9081.2020101632

Abstract （1186）

PDF （1683KB）（1687）

Save

As a key technology of land use/land cover detection, change detection aims to detect the changed part and its type in the remote sensing data of the same region in different periods. In view of the problems in traditional change detection methods, such as heavy manual labor and poor detection results, a large number of change detection methods based on remote sensing images have been proposed. In order to further understand the change detection technology based on remote sensing images and further study on the change detection methods, a comprehensive review of change detection was carried out by sorting, analyzing and comparing a large number of researches on change detection. Firstly, the development process of change detection was described. Then, the research progress of change detection was summarized in detail from three aspects:data selection and preprocessing, change detection technology, post-processing and precision evaluation, where the change detection technology was mainly summarized from analysis unit and comparison method respectively. Finally, the summary of the problems in each stage of change detection was performed and the future development directions were proposed.

Reference | Related Articles | Metrics

Select

Dual-channel night vision image restoration method based on deep learning

NIU Kangli, CHEN Yuzhang, SHEN Junfeng, ZENG Zhangfan, PAN Yongcai, WANG Yichong

Journal of Computer Applications 2021, 41 (6): 1775-1784. DOI: 10.11772/j.issn.1001-9081.2020091411

Abstract （841）

PDF （1916KB）（761）

Save

Due to the low light level and low visibility of night scene, there are many problems in night vision image, such as low signal to noise ratio and low imaging quality. To solve the problems, a dual-channel night vision image restoration method based on deep learning was proposed. Firstly, two Convolutional Neural Network (CNN) based on Fully connected Multi-scale Residual learning Block (FMRB) were used to extract multi-scale features and fuse hierarchical features of infrared night vision images and low-light-level night vision images respectively, so as to obtain the reconstructed infrared image and enhanced low-light-level image. Then, the two processed images were fused by the adaptive weighted averaging algorithm, and the effective information of the more salient one in the two images was highlighted adaptively according to the different scenes. Finally, the night vision restoration images with high resolution and good visual effect were obtained. The reconstructed infrared night vision image obtained by the FMRB based deep learning network had the average values of Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) by 3.56 dB and 0.091 2 higher than the image obtained by Super-Resolution Convolutional Neural Network (SRCNN) reconstruction algorithm respectively, and the enhanced low-light-level night vision image obtained by the FMRB based deep learning network had the average values of PSNR and SSIM by 6.82dB and 0.132 1 higher than the image obtained by Multi-Scale Retinex with Color Restoration (MSRCR). Experimental results show that, by using the proposed method, the resolution of reconstructed image is improved obviously and the brightness of the enhanced image is also improved significantly, and the visual effect of the fusion image obtained by the above two images is better. It can be seen that the proposed algorithm can effectively restore the night vision images.

Reference | Related Articles | Metrics

Select

Fitness action recognition method based on human skeleton feature encoding

GUO Tianxiao, HU Qingrui, LI Jianwei, SHEN Yanfei

Journal of Computer Applications 2021, 41 (5): 1458-1464. DOI: 10.11772/j.issn.1001-9081.2020071113

Abstract （839）

PDF （1143KB）（1312）

Save

Fitness action recognition is the core of the intelligent fitness system. In order to improve the accuracy and speed of fitness action recognition algorithm, and reduce the influence of the global displacement of fitness actions on the recognition results, a fitness action recognition method based on human skeleton feature encoding was proposed which included three steps:firstly, the simplified human skeleton model was constructed, and the information of skeleton model's joint point coordinates was extracted through the human pose estimation technology; secondly, the action feature region was extracted by using the human central projection method in order to eliminate the influence of the global displacement on action recognition; finally, the feature region was encoded as the feature vector and input to a multi-classifier to realize the action recognition, at the same time the length of the feature vector was optimized for improving the recognition rate and speed. Experiment results showed that the proposed method achieved the recognition rate of 97.24% on the self-built fitness dataset with 28 types of fitness actions, which verified the effectiveness of this method to recognize different types of fitness actions; on the public KTH and Weizmann datasets, the recognition rates of the proposed method were 91.67% and 90% respectively, higher than those of other similar methods.

Reference | Related Articles | Metrics

Select

3D virtual human animation generation based on dual-camera capture of facial expression and human pose

LIU Jie, LI Yi, ZHU Jiangping

Journal of Computer Applications 2021, 41 (3): 839-844. DOI: 10.11772/j.issn.1001-9081.2020060993

Abstract （811）

PDF （1377KB）（710）

Save

In order to generate a three-dimensional virtual human animation with rich expression and smooth movement, a method for generating three-dimensional virtual human animation based on synchronous capture of facial expression and human pose with two cameras was proposed. Firstly, the Transmission Control Protocol (TCP) network timestamp method was used to realize the time synchronization of the two cameras, and the ZHANG Zhengyou's calibration method was used to realize the spatial synchronization of the two cameras. Then, the two cameras were used to collect facial expressions and human poses respectively. When collecting facial expressions, the 2D feature points of the image were extracted and the regression of these 2D points was used to calculate the Facial Action Coding System (FACS) facial action unit in order to prepare for the realization of expression animation. Based on the standard head 3D coordinate, according to the camera internal parameters, the Efficient Perspective- n-Point (EP nP) algorithm was used to realize the head pose estimation. After that, the facial expression information was matched with the head pose estimation information. When collecting human poses, the Occlusion-Robust Pose-Map (ORPM) method was used to calculate the human poses and output data such as the position and rotation angle of each bone point. Finally, the established 3D virtual human model was used to show the effect of data-driven animation in the Unreal Engine 4 (UE4). Experimental results show that this method can simultaneously capture facial expressions and human poses and has the frame rate reached 20 fps in the experimental test, so it can generate natural and realistic three-dimensional animation in real time.

Reference | Related Articles | Metrics

Select

Review of object pose estimation in RGB images based on deep learning

Yi WANG, Jie XIE, Jia CHENG, Liwei DOU

Journal of Computer Applications 2023, 43 (8): 2546-2555. DOI: 10.11772/j.issn.1001-9081.2022071022

Abstract （804）

HTML （37）

PDF （858KB）（623）

Save

6 Degree of Freedom （DoF） pose estimation is a key technology in computer vision and robotics， and has become a crucial task in the fields such as robot operation， automatic driving， augmented reality by estimating 6 DoF pose of an object from a given input image， that is， 3 DoF translation and 3 DoF rotation. Firstly， the concept of 6 DoF pose and the problems of traditional methods based on feature point correspondence， template matching， and three-dimensional feature descriptors were introduced. Then， the current mainstream 6 DoF pose estimation algorithms based on deep learning were introduced in detail from different angles of feature correspondence-based， pixel voting-based， regression-based and multi-object instances-oriented， synthesis data-oriented， and category level-oriented. At the same time， the datasets and evaluation indicators commonly used in pose estimation were summarized and sorted out， and some algorithms were evaluated experimentally to show their performance. Finally， the challenges and the key research directions in the future of pose estimation were given.

Table and Figures | Reference | Related Articles | Metrics

Select

Lightweight ship target detection algorithm based on improved YOLOv5

Jiadong LI, Danpu ZHANG, Yaqiong FAN, Jianfeng YANG

Journal of Computer Applications 2023, 43 (3): 923-929. DOI: 10.11772/j.issn.1001-9081.2022071096

Abstract （681）

HTML （28）

PDF （4960KB）（401）

Save

Aiming at the problem of low accuracy of ship target detection at sea， a lightweight ship target detection algorithm YOLOShip was proposed on the basis of the improved YOLOv5. Firstly， dilated convolution and channel attention were introduced into Spatial Pyramid Pooling-Fast （SPPF） module， which integrated spatial feature details of different scales， strengthened semantic information， and improved the model’s ability to distinguish foreground and background. Secondly， coordinate attention and lightweight mixed depthwise convolution were introduced into Feature Pyramid Network （FPN） and Path Aggregation Network （PAN） structures to strengthen important features in the network， obtain features with more detailed information， and improve model detection ability and positioning precision. Thirdly， considering the uneven distribution and relatively small scale changes of targets in the dataset， the model performance was further improved while the model was simplified by modifying the anchors and decreasing the number of detection heads. Finally， a more flexible Polynomial Loss （PolyLoss） was introduced to optimize Binary Cross Entropy Loss （BCE Loss） to improve the model convergence speed and model precision. Experimental results show that on dataset SeaShips， in comparison with YOLOv5s， YOLOShip has the Precision， Recall， mAP@0.5 and mAP@0.5：0.95 increased by 4.2， 5.7， 4.6 and 8.5 percentage points. Thus， by using the proposed algorithm， better detection precision can be obtained while meeting the requirements of detection speed， effectively achieving high-speed and high-precision ship detection.

Table and Figures | Reference | Related Articles | Metrics

Select

Object detection algorithm based on attention mechanism and context information

Hui LIU, Linyu ZHANG, Fugang WANG, Rujin HE

Journal of Computer Applications 2023, 43 (5): 1557-1564. DOI: 10.11772/j.issn.1001-9081.2022040554

Abstract （675）

HTML （33）

PDF （3014KB）（408）

Save

Aiming at the problem of small object miss detection in object detection process， an improved YOLOv5 （You Only Look Once） object detection algorithm based on attention mechanism and multi-scale context information was proposed. Firstly， Multiscale Dilated Separable Convolutional Module （MDSCM） was added to the feature extraction structure to extract multi-scale feature information， increasing the receptive field while avoiding the loss of small object information. Secondly， the attention mechanism was added to the backbone network， and the location awareness information was embedded in the channel information， so as to further enhance the feature expression ability of the algorithm. Finally， Soft-NMS （Soft-Non-Maximum Suppression） was used instead of the NMS （Non-Maximum Suppression） used by YOLOv5 to reduce the missed detection rate of the algorithm. Experimental results show that the improved algorithm achieves detection precisions of 82.80%， 71.74% and 77.11% respectively on PASCAL VOC dataset， DOTA aerial image dataset and DIOR optical remote sensing dataset， which are 3.70， 1.49 and 2.48 percentage points higer than those of YOLOv5， and it has better detection effect on small objects. Therefore， the improved YOLOv5 can be better applied to small object detection scenarios in practice.

Table and Figures | Reference | Related Articles | Metrics

Select

3D point cloud face recognition based on deep learning

GAO Gong, YANG Hongyu, LIU Hong

Journal of Computer Applications 2021, 41 (9): 2736-2740. DOI: 10.11772/j.issn.1001-9081.2020111826

Abstract （658）

PDF （1375KB）（769）

Save

In order to enhance the robustness of the 3D point cloud face recognition system for multiple expressions and multiple poses, a deep learning-based point cloud feature extraction network was proposed, namely ResPoint. The modules such as grouping, sampling and local feature extraction (ResConv) were used in the ResPoint network, and skip connection was used in ResConv module, so that the proposed network had good recognition results for sparse point cloud. Firstly, the nose tip point was located by the geometric feature points of the face, and the face area was cut with this point as the center. The obtained area had noisy points and holes, so Gaussian filtering and 3D cubic interpolation were performed to it. Secondly, the ResPoint network was used to extract features of the preprocessed point cloud data. Finally, the features were combined in the fully connected layer to realize the classification of 3D faces. In the experiments on CASIA 3D face database, the recognition accuracy of the ResPoint network is increased by 5.06% compared with that of the Relation-Shape Convolutional Neural Network (RS-CNN). Experimental results show that the ResPoint network increases the depth of the network while using different convolution kernels to extract features, so that the ResPoint network has better feature extraction capability.

Reference | Related Articles | Metrics

Select

Lightweight image super-resolution reconstruction network based on Transformer-CNN

Hao CHEN, Zhenping XIA, Cheng CHENG, Xing LIN-LI, Bowen ZHANG

Journal of Computer Applications 2024, 44 (1): 292-299. DOI: 10.11772/j.issn.1001-9081.2023010048

Abstract （642）

HTML （24）

PDF （1855KB）（381）

Save

Aiming at the high computational complexity and large memory consumption of the existing super-resolution reconstruction networks， a lightweight image super-resolution reconstruction network based on Transformer-CNN was proposed， which made the super-resolution reconstruction network more suitable to be applied on embedded terminals such as mobile platforms. Firstly， a hybrid block based on Transformer-CNN was proposed， which enhanced the ability of the network to capture local-global depth features. Then， a modified inverted residual block， with special attention to the characteristics of the high-frequency region， was designed， so that the improvement of feature extraction ability and reduction of inference time were realized. Finally， after exploring the best options for activation function， the GELU （Gaussian Error Linear Unit） activation function was adopted to further improve the network performance. Experimental results show that the proposed network can achieve a good balance between image super-resolution performance and network complexity， and reaches inference speed of 91 frame/s on the benchmark dataset Urban100 with scale factor of 4， which is 11 times faster than the excellent network called SwinIR （Image Restoration using Swin transformer）， indicates that the proposed network can efficiently reconstruct the textures and details of the image and reduce a significant amount of inference time.

Table and Figures | Reference | Related Articles | Metrics

Select

2D/3D spine medical image real-time registration method based on pose encoder

Shaokang XU, Zhancheng ZHANG, Haonan YAO, Zhiwei ZOU, Baocheng ZHANG

Journal of Computer Applications 2023, 43 (2): 589-594. DOI: 10.11772/j.issn.1001-9081.2021122147

Abstract （628）

HTML （14）

PDF （2007KB）（359）

Save

2D/3D medical image registration is a key technology in 3D real-time navigation of orthopedic surgery. However， the traditional 2D/3D registration methods based on optimization iteration require multiple iterative calculations， which cannot meet the requirements of doctors for real-time registration during surgery. To solve this problem， a pose regression network based on autoencoder was proposed. In this network， the geometric pose information was captured through hidden space decoding， thereby quickly regressing the 3D pose of preoperative spine pose corresponding to the intraoperative X-ray image， and the final registration image was generated through reprojection. By introducing new loss functions， the model was constrained by “Rough to Fine” combined registration method to ensure the accuracy of pose regression. In CTSpine1K spine dataset， 100 CT scan image sets were extracted for 10-fold cross-validation. Experimental results show that the registration result image generated by the proposed model has the Mean Absolute Error （MAE） with the X-ray image of 0.04， the mean Target Registration Error （mTRE） with the X-ray image of 1.16 mm， and the single frame consumption time of 1.7 s. Compared to the traditional optimization based method， the proposed model has registration time greatly shortened. Compared with the learning-based method， this model ensures a high registration accuracy with quick registration. Therefore， the proposed model can meet the requirements of intraoperative real-time high-precision registration.

Table and Figures | Reference | Related Articles | Metrics

Select

General object detection framework based on improved Faster R-CNN

MA Jialiang, CHEN Bin, SUN Xiaofei

Journal of Computer Applications 2021, 41 (9): 2712-2719. DOI: 10.11772/j.issn.1001-9081.2020111852

Abstract （619）

PDF （2181KB）（563）

Save

Aiming at the problem that current detectors based on deep learning cannot effectively detect objects with irregular shapes or large differences between length and width, based on the traditional Faster Region-based Convolutional Neural Network (Faster R-CNN) algorithm, an improved two-stage object detection framework named Accurate R-CNN was proposed. First of all, a novel Intersection over Union (IoU) metric-Effective Intersection over Union (EIoU) was proposed to reduce the proportion of redundant bounding boxes in the training data by using the centrality weight. Then, a context related Feature Reassignment Module (FRM) was proposed to re-encode the features by the remote dependency and local context information of objects, so as to make up for the loss of shape information in the pooling process. Experimental results show that on the Microsoft Common Objects in COntext (MS COCO) dataset, for the bounding box detection task, when using Residual Networks (ResNets) with two different depths of 50 and 101 as the backbone networks, Accurate R-CNN has the Average Precision (AP) improvements of 1.7 percentage points and 1.1 percentage points respectively compared to the baseline model Faster R-CNN, which are significantly than those of the detectors based on mask with the same backbone networks. After adding mask branch, for the instance segmentation task, when ResNets with two different depths are used as the backbone networks, the mask Average Precisions of Accurate R-CNN are increased by 1.2 percentage points and 1.1 percentage points respectively compared with Mask Region-based Convolutional Neural Network (Mask R-CNN). The research results illustrate that compared to the baseline model, Accurate R-CNN achieves better performance on different datasets and different tasks.

Reference | Related Articles | Metrics

Select

Unmanned aerial vehicle image localization method based on multi-view and multi-supervision network

Jinkun ZHOU, Xianlan WANG, Nan MU, Chen WANG

Journal of Computer Applications 2022, 42 (10): 3191-3199. DOI: 10.11772/j.issn.1001-9081.2021081518

Abstract （617）

HTML （18）

PDF （2090KB）（221）

Save

Aiming at the problem of low accuracy of the existing cross-view image matching algorithms， an Unmanned Aerial Vehicle （UAV） image localization method based on Multi-view and Multi-supervision Network （MMNet） was proposed. Firstly， in the proposed method， satellite perspective and UAV perspective were integrated， global and local features were learnt under a unified network architecture， then classification network was trained and metric tasks were performed in multi-supervision way. Specifically， the Reweighted Regularization Triplet loss （RRT） was mainly used by MMNet to learn global features. In this loss， the reweighting and distance regularization strategies were to solve the problems of imbalance of multi-view samples and structure disorder of the feature space. Simultaneously， in order to pay attention to the context information of the central building in target location， the local features were obtained by MMNet via square ring cutting. After that， the cross entropy loss and RRT were used to perform classification and metric tasks respectively. Finally， the global and local features were aggregated by using a weighted strategy to present target location images. MMNet achieved Recall@1 （R@1） of 83.97% and Average Precision （AP） of 86.96% in UAV localization tasks on the currently popular UAV dataset University-1652. Experimental results show that MMNet significantly improves the accuracy of cross-view image matching， and then enhances the practicability of UAV image localization compared with LCM （cross-view Matching based on Location Classification）， SFPN （Salient Feature Partition Network） and other methods.

Table and Figures | Reference | Related Articles | Metrics

Select

Object detection algorithm for remote sensing images based on geometric adaptation and global perception

Yongxiang GU, Xin LAN, Boyi FU, Xiaolin QIN

Journal of Computer Applications 2023, 43 (3): 916-922. DOI: 10.11772/j.issn.1001-9081.2022010071

Abstract （608）

HTML （23）

PDF （2184KB）（304）

Save

Aiming at the problems such as small object size， arbitrary object direction and complex background of remote sensing images， on the basis of YOLOv5 （You Only Look Once version 5） algorithm， an algorithm involved with geometric adaptation and global perception was proposed. Firstly， deformable convolutions and adaptive spatial attention modules were stacked alternately in series through dense connections. As a result， a Dense Context-Aware Module （DenseCAM） which can model local geometric features was constructed on the basis of taking full advantage of different levels of semantic and location information. Secondly， by introducing Transformer in the end of the backbone network， the global perception ability of the model was enhanced at a low cost and the relationships between objects and scenario content were modeled. On UCAS-AOD and RSOD datasets， compared with YOLOv5s6 algorithm， the proposed algorithm has the mean Average Precision （mAP） increased by 1.8 percentage points and 1.5 percentage points， respectively. Experimental results show that the proposed algorithm can effectively improve the precision of object detection in remote sensing images.

Table and Figures | Reference | Related Articles | Metrics

Select

Image super-resolution reconstruction based on attention mechanism

WANG Yongjin, ZUO Yu, WU Lian, CUI Zhongwei, ZHAO Chenjie

Journal of Computer Applications 2021, 41 (3): 845-850. DOI: 10.11772/j.issn.1001-9081.2020060979

Abstract （595）

PDF （2394KB）（521）

Save

At present, super-resolution reconstruction of a single image achieves a good effect, but most models achieve the good effect by increasing the number of network layers rather than exploring the correlation between channels. In order to solve this problem, an image super-resolution reconstruction method based on Channel Attention mechanism (CA) and Depthwise Separable Convolution (DSC) was proposed. The multi-path global and local residual learning were adopted by the entire model. Firstly, the shallow feature extraction block was used to extract the features of the input image. Then, the channel attention mechanism was introduced in the deep feature extraction block, and the correlation of the channels was increased by adjusting the weights of the feature graphs of different channels to extract the high-frequency feature information. Finally, a high-resolution image was reconstructed. In order to reduce the huge parameter influence brought by the attention mechanism, the depthwise separable convolution technology was used in the local residual block to greatly reduce the training parameters. Meanwhile, the Adaptive moment estimation (Adam) optimizer was used to accelerate the convergence of the model, so as to improve the algorithm performance. The image reconstruction by the proposed method was carried out on Set5 and Set14 datasets. Experimental results show that the images reconstructed by the proposed method have higher Peak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity index (SSIM), and the parameters of the proposed model are reduced to 1/26 of that of the depth Residual Channel Attention Network (RCAN) model.

Reference | Related Articles | Metrics

Select

Deepfake image detection method based on autoencoder

ZHANG Ya, JIN Xin, JIANG Qian, LEE Shin-jye, DONG Yunyun, YAO Shaowen

Journal of Computer Applications 2021, 41 (10): 2985-2990. DOI: 10.11772/j.issn.1001-9081.2020122046

Abstract （594）

PDF （769KB）（473）

Save

The image forgery method based on deep learning can generate images which are difficult to distinguish with the human eye. Once the technology is abused to produce fake images and videos, it will have a serious negative impact on a country's politics, economy, and culture, as well as the social life and personal privacy. To solve the problem, a Deepfake detection method based on autoencoder was proposed. Firstly, the Gaussian filtering was used to preprocess the image, and the high-frequency information was extracted as the input of the model. Secondly, the autoencoder was used to extract features from the image. In order to obtain better classification effect, an attention mechanism module was added to the encoder. Finally, it was proved by the ablation experiments that the proposed preprocessing method and the addition of attention mechanism module were helpful for the Deepfake image detection. Experimental results show that, compared with ResNet50, Xception and InceptionV3, the proposed method can effectively detect images forged by multiple generation methods when the dataset has a small sample size and contains multiple scenes, and its average accuracy is up to 97.10%, which is significantly better than those of the comparison methods, and its generalization performance is also significantly better than those of the comparison methods.

Reference | Related Articles | Metrics

Select

Point cloud registration algorithm based on residual attention mechanism

Tingwei QIN, Pengcheng ZHAO, Pinle QIN, Jianchao ZENG, Rui CHAI, Yongqi HUANG

Journal of Computer Applications 2022, 42 (7): 2184-2191. DOI: 10.11772/j.issn.1001-9081.2021071319

Abstract （587）

HTML （15）

PDF （2278KB）（557）

Save

Aiming at the problems of low accuracy and poor robustness of traditional point cloud registration algorithms and the inability of accurate radiotherapy for cancer patients before and after radiotherapy， an Attention Dynamic Graph Convolutional Neural Network Lucas-Kanade （ADGCNNLK） was proposed. Firstly， residual attention mechanism was added to Dynamic Graph Convolutional Neural Network （DGCNN） to effectively utilize spatial information of point cloud and reduce information loss. Then， the DGCNN added with residual attention mechanism was used to extract point cloud features， this process was not only able to capture the local geometric features of the point cloud while maintaining the invariance of the point cloud replacement， but also able to semantically aggregate the information， thereby improving the registration efficiency. Finally， the extracted feature points were mapped to a high-dimensional space， and the classic image iterative registration algorithm LK （Lucas-Kanade） was used for registration of the nodes. Experimental results show that compared with Iterative Closest Point （ICP）， Globally optimal ICP （Go-ICP） and PointNetLK， the proposed algorithm has the best registration effect with or without noise. Among them， in the case without noise， compared with PointNetLK， the proposed algorithm has the rotation mean squared error reduced by 74.61%， and the translation mean squared error reduced by 47.50%； in the case with noise， compared with PointNetLK， the proposed algorithm has the rotation mean squared error reduced by 73.13%， and the translational mean squared error reduced by 44.18%， indicating that the proposed algorithm is more robust than PointNetLK. And the proposed algorithm is applied to the registration of human point cloud models of cancer patients before and after radiotherapy， assisting doctors in treatment， and realizing precise radiotherapy.

Table and Figures | Reference | Related Articles | Metrics

Select

Panoptic segmentation algorithm based on grouped convolution for feature fusion

FENG Xingjie, ZHANG Tianze

Journal of Computer Applications 2021, 41 (7): 2054-2061. DOI: 10.11772/j.issn.1001-9081.2020091523

Abstract （565）

PDF （1584KB）（576）

Save

Aiming at the problem that the computing of the image panoptic segmentation task is not fast enough for the existing network structures in practical applications, a panoptic segmentation algorithm based on grouped convolution for feature fusion was proposed. Firstly, through the bottom-up method, the classic Residual Network structure (ResNet) was selected for feature extraction, and the multi-scale feature fusion of semantic segmentation and instance segmentation was performed on the extracted features by using the Atrous convolutional Spatial Pyramid Pooling operation (ASPP) with different expansion rates. Secondly, a single-channel grouped convolution upsampling method was proposed to integrate the semantics and instance features for performing upsampling feature fusion to a specified size. Finally, a more refined panoptic segmentation output result was obtained by performing loss function on semantic branch, instance branch and instance center point respectively. The model was compared with Attention-guided Unified Network for panoptic segmentation (AUNet), Panoptic Feature Pyramid Network (Panoptic FPN), Single-shot instance Segmentation with Affinity Pyramid (SSAP), Unified Panoptic Segmentation Network (UPSNet), Panoptic-DeepLab and other methods on CityScapes dataset. Compared with the Panoptic-DeepLab model, which is the best-performing model in the comparison models, with the decoding network parameters reduced significantly, the proposed model has the Panoptic Quality (PQ) of 0.565, with a slight decrease of 0.003, and the segmentation qualities of objects such as buildings, trains, bicycles were improved by 0.3-5.5, the Average Precision (AP) and the Average Precision with target IoU (Intersection over Union) threshold over 50% (AP ₅₀) were improved by 0.002 and 0.014 respectively, and the mean IoU (mIoU) value was increased by 0.06. It can be seen that the proposed method improves the speed of image panoptic segmentation, has good accuracy in the three indexes of PQ, AP and mIoU, and can effectively complete the panoptic segmentation tasks.

Reference | Related Articles | Metrics

Select

Semi‑supervised end‑to‑end fake speech detection method based on time‑domain waveforms

FANG Xin, HUANG Zexin, ZHANG Yuhan, GAO Tian, PAN Jia, FU Zhonghua, GAO Jianqing, LIU Junhua, ZOU Liang

Journal of Computer Applications 2023, 43 (1): 227-231. DOI: 10.11772/j.issn.1001-9081.2021101845

Abstract （560）

HTML （15）

PDF （6257KB）（377）

Save

The fake speech produced by modern speech synthesis and timbre conversion systems poses a serious threat to the automatic speaker recognition system. Most of the existing fake speech detection systems perform well for the known attack types in the training process， but degrades significantly in detecting unknown attack types in practical applications. Therefore， combined with the recently proposed Dual?Path Res2Net （DP?Res2Net）， a semi?supervised end?to?end fake speech detection method based on time?domain waveforms was proposed. Firstly， semi?supervised learning was adopted for domain transfer to reduce the difference of data distribution between training set and test set. Then， for feature engineering， time-domain sampling points were input into DP?Res2Net directly， which increased the local multi?scale information and made full use of the dependence between audio segments. Finally， the embedded tensors were obtained to judge fake speech from natural speech after the input features going through the shallow convolution module， feature fusion module and global average pooling module. The performance of the proposed method was evaluated on the publicly available ASVspoof 2021 Speech Deep Fake evaluation set as well as the dataset VCC （Voice Conversion Challenge）. Experimental results show that the Equal Error Rate （EER） of the proposed method is 19.97%， which is 10.8% less than that of the official optimal baseline system， verifying that the semi?supervised end?to?end fake speech detection method based on time?domain waveforms is effective when recognizing unknown attacks and has higher generalization capability.

Reference | Related Articles | Metrics

Select

Survey of visual object tracking methods based on Transformer

Ziwen SUN, Lizhi QIAN, Chuandong YANG, Yibo GAO, Qingyang LU, Guanglin YUAN

Journal of Computer Applications 2024, 44 (5): 1644-1654. DOI: 10.11772/j.issn.1001-9081.2023060796

Abstract （560）

HTML （22）

PDF （1615KB）（1259）

Save

Visual object tracking is one of the important tasks in computer vision， in order to achieve high-performance object tracking， a large number of object tracking methods have been proposed in recent years. Among them， Transformer-based object tracking methods become a hot topic in the field of visual object tracking due to their ability to perform global modeling and capture contextual information. Firstly， existing Transformer-based visual object tracking methods were classified based on their network structures， an overview of the underlying principles and key techniques for model improvement were expounded， and the advantages and disadvantages of different network structures were also summarized. Then， the experimental results of the Transformer-based visual object tracking methods on public datasets were compared to analyze the impact of network structure on performance. in which MixViT-L （ConvMAE） achieved tracking success rates of 73.3% and 86.1% on LaSOT and TrackingNet， respectively， proving that the object tracking methods based on pure Transformer two-stage architecture have better performance and broader development prospects. Finally， the limitations of these methods， such as complex network structure， large number of parameters， high training requirements， and difficulty in deploying on edge devices， were summarized， and the future research focus was outlooked， by combining model compression， self-supervised learning， and Transformer interpretability analysis， more kinds of feasible solutions for Transformer-based visual target tracking could be presented.

Table and Figures | Reference | Related Articles | Metrics

Select

Handwritten mathematical expression recognition model based on attention mechanism and encoder-decoder

Lu CHEN, Daoxi CHEN, Yiming LU, Weizhong LU

Journal of Computer Applications 2023, 43 (4): 1297-1302. DOI: 10.11772/j.issn.1001-9081.2022020278

Abstract （548）

HTML （13）

PDF （1695KB）（257）

PDF（mobile）（993KB）（17）

Save

Aiming at the problem that the existing Handwritten Mathematical Expression Recognition （HMER） methods reduce image resolution and lose feature information after multiple pooling operations in Convolutional Neural Network （CNN）， which leads to parsing errors， an encoder-decoder model for HMER based on attention mechanism was proposed. Firstly， Densely connected convolutional Network （DenseNet） was used as the encoder， so that the dense connections were used to enhance feature extraction， promote gradient propagation， and alleviate vanishing gradient. Secondly， Gated Recurrent Unit （GRU） was used as the decoder， and attention mechanism was introduced， so that， the attention was allocated to different regions of image to realize symbol recognition and structural analysis accurately. Finally， the handwritten mathematical expression images were encoded， and the encoding results were decoded into LaTeX sequences. Experimental results on Competition on Recognition of Online Handwritten Mathematical Expressions （CROHME） dataset show that the proposed model has the recognition rate improved to 40.39%. And within the allowable error range of three levels， the model has the recognition rate improved to 52.74%， 58.82% and 62.98%， respectively. Compared with the Bidirectional Long Short-Term Memory （BLSTM） network model， the proposed model increases the recognition rate by 3.17 percentage points. And within the allowable error range of three levels， the proposed model has the recognition rate increased by 8.52 percentage points， 11.56 percentage points， and 12.78 percentage points， respectively. It can be seen that the proposed model can accurately parse the handwritten mathematical expression images， generate LaTeX sequences， and improve the recognition rate.

Table and Figures | Reference | Related Articles | Metrics

Select

Kinect-based human pose estimation optimization and animation generation

Wei ZHAO, Yi LI

Journal of Computer Applications 2022, 42 (9): 2830-2837. DOI: 10.11772/j.issn.1001-9081.2021061043

Abstract （544）

HTML （24）

PDF （7004KB）（207）

Save

In order to generate more accurate and smooth virtual human animation， the Kinect device was used to capture the 3D human body pose data， and the monocular 3D human body pose estimation algorithm was used to reason the skeleton points in the color information of the Kinect at the same time， thereby optimizing the human pose estimation effect at real time， and driving the virtual character model to generate animation. Firstly， a spatio-temporal optimization method of skeleton point data processing was proposed to improve the stability of monocular estimation of the 3D human body pose. Secondly， a human pose estimation method based on the fusion of Kinect and Occlusion-Robust Pose-Maps （ORPM） algorithm was proposed to solve the occlusion problem of Kinect. Finally， a virtual human animation system based on quaternion vector interpolation and inverse kinematics constraints was developed， which was able to perform motion simulation and real-time animation generation. Compared with the animation generation method that only uses Kinect to capture human motion， the proposed method has the human body estimation data more robust， and has a certain anti-occlusion ability. The animation frame rate generated by this method is two times higher compared to that of the ORPM-based animation generation method， so that the effect of the animation generated by the proposed method is more realistic and smooth.

Table and Figures | Reference | Related Articles | Metrics

Select

Semantic SLAM algorithm based on deep learning in dynamic environment

ZHENG Sicheng, KONG Linghua, YOU Tongfei, YI Dingrong

Journal of Computer Applications 2021, 41 (10): 2945-2951. DOI: 10.11772/j.issn.1001-9081.2020111885

Abstract （541）

PDF （1572KB）（1391）

Save

Concerning the problem that the existence of moving objects in the application scenes will reduce the positioning accuracy and robustness of the visual Synchronous Localization And Mapping (SLAM) system, a semantic information based visual SLAM algorithm in dynamic environment was proposed. Firstly, the traditional visual SLAM front end was combined with the YOLOv4 object detection algorithm, during the extraction of ORB (Oriented FAST and Rotated BRIEF) features of the input image, the image was semantically segmented. Then, the object type was judged to obtain the area of the dynamic object in the image, and the feature points distributed on the dynamic object were eliminated. Finally, the camera pose was solved by using inter-frame matching between the processed feature points and the adjacent frames. The test results on TUM dataset show that, the accuracy of the pose estimation of this algorithm is 96.78% higher than that of ORB-SLAM2 (Orient FAST and Rotated BRIEF SLAM2) in a high dynamic environment, and the average consumption time per frame of tracking thread of the algorithm is 0.065 5 s, which is the shortest time consumption compared to those of the other SLAM algorithms used in dynamic environment. The above experimental results illustrate that the proposed algorithm can realize real-time precise positioning and mapping in dynamic environment.

Reference | Related Articles | Metrics

Select

Fast calibration algorithm in surgical navigation system based on augmented reality

SUN Qichang, MAI Yongfeng, CHEN Xiaojun

Journal of Computer Applications 2021, 41 (3): 833-838. DOI: 10.11772/j.issn.1001-9081.2020060776

Abstract （536）

PDF （1272KB）（955）

Save

To solve the problem of fusion of virtual scene and real scene of Optical See-Through Head-Mounted Display (OST-HMD) in Augmented Reality (AR), a fast calibration method for OST-HMD was proposed on the basis of optical positioning and tracking system. Firstly, the virtual makers in OST-HMD and the corresponding points in real world were collected as two 3D point datasets, and the transformation between the virtual space and optical positioning and tracking space was estimated to solve the transformation matrix from the virtual space to the real scene. Then, the transitive relation among matrices of the entire navigation system was built, and an AR-based surgical navigation system was designed and implemented on this basis, and the accuracy validation experiment and model experiment were conducted to this system. Experimental results show that the proposed algorithm makes the root mean square error between the virtual datum points and the corresponding real datum points achieved 1.39 ±0.49 mm, and the average time of calibration process of 23.8 seconds, which demonstrates the algorithm has the potential in clinical applications.

Reference | Related Articles | Metrics

Select

Indoor scene recognition method combined with object detection

XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan

Journal of Computer Applications 2021, 41 (9): 2720-2725. DOI: 10.11772/j.issn.1001-9081.2020111815

Abstract （532）

PDF （1357KB）（446）

Save

In the method of combining Object detection Network (ObjectNet) and scene recognition network, the object features extracted by the ObjectNet and the scene features extracted by the scene network are inconsistent in dimensionality and property, and there is redundant information in the object features that affects the scene judgment, resulting in low recognition accuracy of scenes. To solve this problem, an improved indoor scene recognition method combined with object detection was proposed. First, the Class Conversion Matrix (CCM) was introduced into the ObjectNet to convert the object features output by ObjectNet, so that the dimension of the object features was consistent with that of the scene features, as a result, the information loss caused by inconsistency of the feature dimensions was reduced. Then, the Context Gating (CG) mechanism was used to suppress the redundant information in the features, reducing the weight of irrelevant information, and increasing the contribution of object features in scene recognition. The recognition accuracy of the proposed method on MIT Indoor67 dataset reaches 90.28%, which is 0.77 percentage points higher than that of Spatial-layout-maintained Object Semantics Features (SOSF) method; and the recognition accuracy of the proposed method on SUN397 dataset is 81.15%, which is 1.49 percentage points higher than that of Hierarchy of Alternating Specialists (HoAS) method. Experimental results show that the proposed method improves the accuracy of indoor scene recognition.

Reference | Related Articles | Metrics

Select

Video similarity detection method based on perceptual hashing and dicing

WU Yue, LUO Jiangtao, LIU Rui, HU Zhongyin

Journal of Computer Applications 2021, 41 (7): 2070-2075. DOI: 10.11772/j.issn.1001-9081.2020081177

Abstract （531）

PDF （1358KB）（332）

Save

For a long time, video copyright infringement problems have emerged one after another, and the detection of video similarity is an important approach of identifying video copyright infringement. Concerning the problems of the correlation difficulty of multi-feature relation and high time complexity in the existing video similarity detection methods, a fast comparison method based on perceptual hashing and dicing was proposed. First, the key image frames of the video were used to generate a digital fingerprint set. Then, based on the dicing method, the corresponding inverted index was generated to speed up the comparison between digital fingerprints. Finally, the similarity was judged according to the obtained Hamming distance between the digital fingerprints. Experimental results show that the proposed method can reduce the detection time by an average of 93% with ensuring the detection accuracy compared to the traditional perceptual hashing comparison methods; in the comparison with three common methods including Multi-Feature Hashing (MFH), Self-Taught Hashing (STH) and SPectral Hashing (SPH), the mean Average Precision (mAP) of the proposed method is increased by 1.4%, 2% and 2.3%,respectively, and the detection time is shortened by 25%, 32% and 16%, respectively, which verifies the feasibility of the proposed method.

Reference | Related Articles | Metrics

Select

Infrared small target tracking method based on state information

Xin TANG, Bo PENG, Fei TENG

Journal of Computer Applications 2023, 43 (6): 1938-1942. DOI: 10.11772/j.issn.1001-9081.2022050762

Abstract （528）

HTML （14）

PDF （1552KB）（175）

Save

Infrared small targets occupy few pixels and lack features such as color， texture and shape， so it is difficult to track them effectively. To solve this problem， an infrared small target tracking method based on state information was proposed. Firstly， the target， background and distractors in the local area of the small target to be detected were encoded to obtain dense local state information between consecutive frames. Secondly， feature information of the current and the previous frames were input into the classifier to obtain the classification score. Thirdly， the state information and the classification score were fused to obtain the final degree of confidence and determine the center position of the small target to be detected. Finally， the state information was updated and propagated between the consecutive frames. After that， the propagated state information was used to track the infrared small target in the entire sequences. The proposed method was validated on an open dataset DIRST （Dataset for Infrared detection and tRacking of dim-Small aircrafT）. Experimental results show that for infrared small target tracking， the recall of the proposed method reaches 96.2%， and the precision of the method reaches 97.3%， which are 3.7% and 3.7% higher than those of the current best tracking method KeepTrack. It proves that the proposed method can effectively complete the tracking of small infrared targets under complex background and interference.

Table and Figures | Reference | Related Articles | Metrics

Project Articles