CCF Bigdata 2021

Select

User incentive based bike‑sharing dispatching strategy

Bing SHI, Xizi HUANG, Zhaoxiang SONG, Jianqiao XU

Journal of Computer Applications 2022, 42 (11): 3395-3403. DOI: 10.11772/j.issn.1001-9081.2021122109

Abstract （454）

HTML （19）

PDF （2192KB）（258）

Save

To address the dispatching problem of bike?sharing， considering the budget constraints， user maximum walking distance restrictions， user temporal and spatial demands and dynamic changes in the distribution of shared bicycles， a bike?sharing dispatching strategy with user incentives was proposed to improve the long?term user service rate of the bike?sharing platform. The dispatching strategy consists of a task generation algorithm， a budget allocation algorithm and a task allocation algorithm. In the task generation algorithm， the Long Short?Term Memory （LSTM） network was used to predict the future bike demand of users； in the budget allocation algorithm， the Deep Deterministic Policy Gradient （DDPG） algorithm was used to design a budget allocation strategy； after the budget was allocated to the tasks， the tasks needed to be allocated to the user for execution， so a greedy matching strategy was used for task allocation. Experiments were carried out on the Mobike dataset to compare the proposed strategy with the dispatching strategy with unlimited budget （that is， the platform is not limited by budget and can use any money to encourage users to ride to the target area）， the greedy dispatching strategy， the dispatching strategy with truck hauling， and the situation without dispatching. Experimental results show that the proposed dispatching strategy with user incentive can effectively improve the service rate in the bike?sharing system compared to the greedy dispatching strategy and dispatching strategy with truck hauling.

Table and Figures | Reference | Related Articles | Metrics

Select

Popularity prediction method of Twitter topics based on evolution patterns

Weifan XIE, Yan GUO, Guangsheng KUANG, Zhihua YU, Yuanhai XUE, Huawei SHEN

Journal of Computer Applications 2022, 42 (11): 3364-3370. DOI: 10.11772/j.issn.1001-9081.2022010045

Abstract （448）

HTML （13）

PDF （934KB）（233）

Save

A popularity prediction method of Twitter topics based on evolution patterns was proposed to address the problem that the differences between evolution patterns and the time?effectiveness of prediction were not taken into account in previous popularity prediction methods. Firstly， the K?SC （K?Spectral Centroid） algorithm was used to cluster the popularity sequences of a large number of historical topics， and 6 evolution patterns were obtained. Then， a Fully Connected Network （FCN） was trained as the prediction model by using historical topic data of each evolution pattern. Finally， in order to select the prediction model for the topic to be predicted， Amplitude?Alignment Dynamic Time Warping （AADTW） algorithm was proposed to calculate the similarity between the known popularity sequence of the topic to be predicted and each evolution pattern， and the prediction model of the evolution pattern with the highest similarity was selected to predict the popularity. In the task of predicting the popularity of the next 5 hours based on the known popularity of the first 20 hours， the Mean Absolute Percentage Error （MAPE） of the prediction results of the proposed method was reduced by 58.2% and 31.0% respectively， compared with those of the Auto?Regressive Integrated Moving Average （ARIMA） method and method using a single fully connected network. Experimental results show that the model group based on the evolution patterns can predict the popularity of Twitter topic more accurately than single model.

Table and Figures | Reference | Related Articles | Metrics

Select

Efficient failure recovery method for stream data processing system

Yang LIU, Yangyang ZHANG, Haoyi ZHOU

Journal of Computer Applications 2022, 42 (11): 3337-3345. DOI: 10.11772/j.issn.1001-9081.2021122108

Abstract （403）

HTML （15）

PDF （2031KB）（155）

Save

Focusing on the issue that the single point of failure cannot be efficiently handled by streaming data processing system Flink， a new fault?tolerant system based on incremental state and backup， Flink+， was proposed. Firstly， backup operators and data paths were established in advance. Secondly， the output data in the data flow diagram was cached， and disks were used if necessary. Thirdly， task state synchronization was performed during system snapshots. Finally， backup tasks and cached data were used to recover calculation in case of system failure. In the system experiment and test， Flink+ dose not significantly increase the additional fault tolerance overhead during fault?free operation； when dealing with the single point of failure in both single?machine and distributed environments， compared with Flink system， the proposed system has the failure recovery time reduced by 96.98% in single?machine 8?task parallelism and by 88.75% in distributed 16?task parallelism. Experimental results show that using incremental state and backup method together can effectively reduce the recovery time of the single point of failure of the stream system and enhance the robustness of the system.

Table and Figures | Reference | Related Articles | Metrics

Select

Process tracking multi‑task rumor verification model combined with stance

Bin ZHANG, Li WANG, Yanjie YANG

Journal of Computer Applications 2022, 42 (11): 3371-3378. DOI: 10.11772/j.issn.1001-9081.2021122148

Abstract （294）

HTML （9）

PDF （1420KB）（102）

Save

At present， social media platforms have become the main ways for people to publish and obtain information， but the convenience of information publish may lead to the rapid spread of rumors， so verifying whether information is a rumor and stoping the spread of rumors has become an urgent problem to be solved. Previous studies have shown that people's stance on information can help determining whether the information is a rumor or not. Aiming at the problem of rumor spread， a Joint Stance Process Multi?Task Rumor Verification Model （JSP?MRVM） was proposed on the basis of the above result. Firstly， three propagation processes of information were represented by using topology map， feature map and common Graph Convolutional Network （GCN） respectively. Then， the attention mechanism was used to obtain the stance features of the information and fuse the stance features with the tweet features. Finally， a multi?task objective function was designed to make the stance classification task better assist in verifying rumors. Experimental results prove that the accuracy and Macro?F1 of the proposed model on RumorEval dataset are improved by 10.7 percentage points and 11.2 percentage points respectively compared to those of the baseline model RV?ML （Rumor Verification scheme based on Multitask Learning model）， verifying that the proposed model is effective and can reduce the spread of rumors.

Table and Figures | Reference | Related Articles | Metrics

Select

Survey on imbalanced multi‑class classification algorithms

Mengmeng LI, Yi LIU, Gengsong LI, Qibin ZHENG, Wei QIN, Xiaoguang REN

Journal of Computer Applications 2022, 42 (11): 3307-3321. DOI: 10.11772/j.issn.1001-9081.2021122060

Abstract （996）

HTML （99）

PDF （1861KB）（646）

Save

Imbalanced data classification is an important research content in machine learning， but most of the existing imbalanced data classification algorithms foucus on binary classification， and there are relatively few studies on imbalanced multi?class classification. However， datasets in practical applications usually have multiple classes and imbalanced data distribution， and the diversity of classes further increases the difficulty of imbalanced data classification， so the multi?class classification problem has become a research topic to be solved urgently. The imbalanced multi?class classification algorithms proposed in recent years were reviewed. According to whether the decomposition strategy was adopted， imbalanced multi?class classification algorithms were divided into decomposition methods and ad?hoc methods. Furthermore， according to the different adopted decomposition strategies， the decomposition methods were divided into two frameworks： One Vs. One （OVO） and One Vs. All （OVA）. And according to different used technologies， the ad?hoc methods were divided into data?level methods， algorithm?level methods， cost?sensitive methods， ensemble methods and deep network?based methods. The advantages and disadvantages of these methods and their representative algorithms were systematically described， the evaluation indicators of imbalanced multi?class classification methods were summarized， the performance of the representative methods were deeply analyzed through experiments， and the future development directions of imbalanced multi?class classification were discussed.

Table and Figures | Reference | Related Articles | Metrics

Select

Multi‑agent reinforcement learning based on attentional message sharing

Rong ZANG, Li WANG, Tengfei SHI

Journal of Computer Applications 2022, 42 (11): 3346-3353. DOI: 10.11772/j.issn.1001-9081.2021122169

Abstract （551）

HTML （20）

PDF （1668KB）（222）

Save

Communication is an important way to achieve effective cooperation among multiple agents in a non? omniscient environment. When there are a large number of agents， redundant messages may be generated in the communication process. To handle the communication messages effectively， a multi?agent reinforcement learning algorithm based on attentional message sharing was proposed， called AMSAC （Attentional Message Sharing multi?agent Actor?Critic）. Firstly， a message sharing network was built for effective communication among agents， and information sharing was achieved through message reading and writing by the agents， thus solving the problem of lack of communication among agents in non?omniscient environment with complex tasks. Then， in the message sharing network， the communication messages were processed adaptively by the attentional message sharing mechanism， and the messages from different agents were processed with importance order to solve the problem that large?scale multi?agent system cannot effectively identify and utilize the messages during the communication process. Moreover， in the centralized Critic network， the Native Critic was used to update the Actor network parameters according to Temporal Difference （TD） advantage policy gradient， so that the action values of agents were evaluated effectively. Finally， during the execution period， the decision was made by the agent distributed Actor network based on its own observations and messages from message sharing network. Experimental results in the StarCraft Multi?Agent Challenge （SMAC） environment show that compared with Native Actor?Critic （Native AC）， Game Abstraction Communication （GA?Comm） and other multi?agent reinforcement learning methods， AMSAC has an average win rate improvement of 4 - 32 percentage points in four different scenarios. AMSAC’s attentional message sharing mechanism provides a reasonable solution for processing communication messages among agents in a multi?agent system， and has broad application prospects in both transportation hub control and unmanned aerial vehicle collaboration.

Table and Figures | Reference | Related Articles | Metrics

Select

Graph convolutional network method based on hybrid feature modeling

Zhuoran LI, Zhonglin YE, Haixing ZHAO, Jingjing LIN

Journal of Computer Applications 2022, 42 (11): 3354-3363. DOI: 10.11772/j.issn.1001-9081.2021111981

Abstract （608）

HTML （15）

PDF （3410KB）（158）

Save

For the complex information contained in the network， more ways are needed to extract useful information from it， but the relevant characteristics in the network cannot be completely described by the existing single?feature Graph Neural Network （GNN）. To resolve the above problems， a Hybrid feature?based Dual Graph Convolutional Network （HDGCN） was proposed. Firstly， the structure feature vectors and semantic feature vectors of nodes were obtained by Graph Convolutional Network （GCN）. Secondly， the features of nodes were aggregated selectively so that the feature expression ability of nodes was enhanced by the aggregation function based on attention mechanism or gating mechanism. Finally， the hybrid feature vectors of nodes were gained by the fusion mechanism based on a feasible dual?channel GCN， and the structure features and semantic features of nodes were modeled jointly to make the features be supplement for each other and promote the method's performance on subsequent machine learning tasks. Verification was performed on the datasets CiteSeer， DBLP （DataBase systems and Logic Programming） and SDBLP （Simplified DataBase systems and Logic Programming）. Experimental results show that compared with the graph convolutional network model based on structure feature training， the dual channel graph convolutional network model based on hybrid feature training has the average value of Micro?F1 increased by 2.43， 2.14， 1.86 and 2.13 percentage points respectively， and the average value of Macro?F1 increased by 1.38， 0.33， 1.06 and 0.86 percentage points respectively when the training set proportion is 20%， 40%， 60% and 80%. The difference in accuracy is no more than 0.5 percentage points when using concat or mean as the fusion strategy， which shows that both concat and mean can be used as the fusion strategy. HDGCN has higher accuracy on node classification and clustering tasks than models trained by structure or semantic network alone， and has the best results when the output dimension is 64， the learning rate is 0.001， the graph convolutional layer number is 2 and the attention vector dimension is 128.

Table and Figures | Reference | Related Articles | Metrics

Select

Detection of unsupervised offensive speech based on multilingual BERT

Xiayang SHI, Fengyuan ZHANG, Jiaqi YUAN, Min HUANG

Journal of Computer Applications 2022, 42 (11): 3379-3385. DOI: 10.11772/j.issn.1001-9081.2021112005

Abstract （536）

HTML （12）

PDF （1536KB）（227）

Save

Offensive speech has a serious negative impact on social stability. Currently， automatic detection of offensive speech focuses on a few high?resource languages， and the lack of sufficient offensive speech tagged corpus for low?resource languages makes it difficult to detect offensive speech in low?resource languages. In order to solve the above problem， a cross?language unsupervised offensiveness transfer detection method was proposed. Firstly， an original model was obtained by using the multilingual BERT （multilingual Bidirectional Encoder Representation from Transformers， mBERT） model to learn the offensive features on the high?resource English dataset. Then， by analyzing the language similarity between English and Danish， Arabic， Turkish， Greek， the obtained original model was transferred to the above four low?resource languages to achieve automatic detection of offensive speech on low?resource languages. Experimental results show that compared with the four methods of BERT， Linear Regression （LR）， Support Vector Machine （SVM） and Multi?Layer Perceptron （MLP）， the proposed method increases both the accuracy and F1 score of detecting offensive speech of languages such as Danish， Arabic， Turkish， and Greek by nearly 2 percentage points， which are close to those of the current supervised detection， showing that the combination of cross?language model transfer learning and transfer detection can achieve unsupervised offensiveness detection of low?resource languages.

Table and Figures | Reference | Related Articles | Metrics

Select

Neural tangent kernel K‑Means clustering

Mei WANG, Xiaohui SONG, Yong LIU, Chuanhai XU

Journal of Computer Applications 2022, 42 (11): 3330-3336. DOI: 10.11772/j.issn.1001-9081.2021111961

Abstract （580）

HTML （24）

PDF （2237KB）（224）

Save

Aiming at the problem that the clustering results of K-Means clustering algorithm are affected by the sample distribution because of using the mean to update the cluster centers， a Neural Tangent Kernel K-Means （NTKKM） clustering algorithm was proposed. Firstly， the data of the input space were mapped to the high-dimensional feature space through the Neural Tangent Kernel （NTK）， then the K-Means clustering was performed in the high-dimensional feature space， and the cluster centers were updated by taking into account the distance between clusters and within clusters at the same time. Finally， the clustering results were obtained. On the car and breast-tissue datasets， three evaluation indexes including accuracy， Adjusted Rand Index （ARI） and FM index of NTKKM clustering algorithm and comparison algorithms were counted. Experimental results show that the effect of clustering and the stability of NTKKM clustering algorithm are better than those of K-Means clustering algorithm and Gaussian kernel K?Means clustering algorithm. Compared with the traditional K?Means clustering algorithm， NTKKM clustering algorithm has the accuracy increased by 14.9% and 9.4% respectively， the ARI increased by 9.7% and 18.0% respectively， and the FM index increased by 12.0% and 12.0% respectively， indicating the excellent clustering performance of NTKKM clustering algorithm.

Table and Figures | Reference | Related Articles | Metrics

Select

Deep fusion model for predicting differential gene expression by histone modification data

Xin LI, Tao JIA

Journal of Computer Applications 2022, 42 (11): 3404-3412. DOI: 10.11772/j.issn.1001-9081.2021111956

Abstract （324）

HTML （8）

PDF （1734KB）（171）

Save

Concering the problem that the Cell type?Specificity （CS） and similarity and difference information between different cell types are not properly used when predicting Differential Gene Expression （DGE） with large?scale Histone Modification （HM） data， as well as large volume of input and high computational cost， a deep learning?based method named dcsDiff was proposed. Firstly， multiple AutoEncoders （AEs） and Bi?directional Long Short?Term Memory （Bi?LSTM） networks were introduced to reduce the dimensionality of HM signals and model them to obtain the embedded representation. Then， multiple Convolutional Neural Networks （CNNs） were used to mine the HM combined effects in each single cell type， and the similarity and difference information of each HM and joint effects of all HMs between two cell types. Finally， the two kinds of information were fused to predict DGE between two cell types. In the comparison experiments with DeepDiff on 10 pairs of cell types in the REMC （Roadmap Epigenomics Mapping Consortium） database， the Pearson Correlation Coefficient （PCC） of dcsDiff in DGE prediction was increased by 7.2% at the highest and 3.9% on average， the number of differentially expressed genes accurately detected by dcsDiff was increased by 36 at most and 17.6 on average， and the running time of dcsDiff was saved by 78.7%. The validity of reasonable integration of the above two kinds of information was proved in the component analysis experiment. The parameters of dcsDiff were also determined by experiments. Experimental results show that the proposed dcsDiff can effectively improve the efficiency of DGE prediction.

Table and Figures | Reference | Related Articles | Metrics

Select

Neural machine translation method based on source language syntax enhanced decoding

Longchao GONG, Junjun GUO, Zhengtao YU

Journal of Computer Applications 2022, 42 (11): 3386-3394. DOI: 10.11772/j.issn.1001-9081.2021111963

Abstract （425）

HTML （7）

PDF （1267KB）（172）

Save

Transformer， one of the best existing machine translation models， is based on the standard end?to?end structure and only relies on pairs of parallel sentences， which is believed to be able to learn knowledge in the corpus automatically. However， this modeling method lacks explicit guidance and cannot effectively mine deep language knowledge， especially in the low?resource environment with limited corpus size and quality， where the sentence encoding has no prior knowledge constraints， leading to the decline of translation quality. In order to alleviate the issues above， a neural machine translation model based on source language syntax enhanced decoding was proposed to explicitly use the source language syntax to guide the encoding， namely SSED （Source language Syntax Enhanced Decoding）. A syntax?aware mask mechanism based on the syntactic information of the source sentence was constructed at first， and an additional syntax?dependent representation was generated by guiding the encoding self?attention. Then the syntax?dependent representation was used as a supplement to the representation of the original sentence and the decoding process was integrated by attention mechanism， which jointly guided the generation of the target language， realizing the enhancement of the prior syntax. Experimental results on several standard IWSLT （International Conference on Spoken Language Translation） and WMT （Conference on Machine Translation） machine translation evaluation task test sets show that compared with the baseline model Transformer， the proposed method obtains a BLEU score improvement of 0.84 to 3.41 respectively， achieving the state?of?the?art results of the syntactic related research. The fusion of syntactic information and self?attention mechanism is effective， the use of source language syntax can guide the decoding process of the neural machine translation system and significantly improve the quality of translation.

Table and Figures | Reference | Related Articles | Metrics

Select

K‑nearest neighbor imputation subspace clustering algorithm for high‑dimensional data with feature missing

Yongjian QIAO, Xiaolin LIU, Liang BAI

Journal of Computer Applications 2022, 42 (11): 3322-3329. DOI: 10.11772/j.issn.1001-9081.2021111964

Abstract （594）

HTML （32）

PDF （1207KB）（382）

Save

During the clustering process of high?dimensional data with feature missing， there are problems of the curse of dimensionality caused by data high dimension and the invalidity of effective distance calculation between samples caused by data feature missing. To resolve above issues， a K?Nearest Neighbor （KNN） imputation subspace clustering algorithm for high?dimensional data with feature missing was proposed， namely KISC. Firstly， the nearest neighbor relationship in the subspace of the high?dimensional data with feature missing was used to perform KNN imputation on the feature missing data in the original space. Then， multiple iterations of matrix decomposition and KNN imputation were used to obtain the final reliable subspace structure of the data， and the clustering analysis was performed in that obtained subspace structure. The clustering results in the original space of six image datasets show that the KISC algorithm has better performance than the comparison algorithm which clusters directly after interpolation， indicating that the subspace structure can identify the potential clustering structure of the data more easily and effectively； the clustering results in the subspace of six high?dimensional datasets shows that the KISC algorithm outperforms the comparison algorithm in all datasets， and has the optimal clustering Accuracy and Normalized Mutual Information （NMI） on most of the datasets. The KISC algorithm can deal with high?dimensional data with feature missing more effectively and improve the clustering performance of these data.

Table and Figures | Reference | Related Articles | Metrics

Project Articles