Data science and technology

Select

Point-of-interest recommendation algorithm combing dynamic and static preferences

YANG Li, WANG Shihui, ZHU Bo

Journal of Computer Applications 2021, 41 (2): 398-406. DOI: 10.11772/j.issn.1001-9081.2020050677

Abstract （633）

PDF （1355KB）（1108）

Save

Since most existing Point-Of-Interest (POI) recommendation algorithms ignore the complexity of the modeling of the fusion of user dynamic and static preferences, a POI recommendation algorithm called CLSR (Combing Long Short Recommendation) was proposed that combined complex dynamic user preferences and general static user preferences. Firstly, in the process of modeling complex dynamic preferences, a hybrid neural network was designed based on the user's check-in behaviors and the skip behaviors in check-in behaviors to achieve the modeling of complex dynamic interests of the user. Secondly, in the process of general static preference modeling, a high-level attention network was used to learn the complex interactions between the user and POIs. Thirdly, a multi-layer neural network was used to further learn and express the above dynamic preferences and static preferences. Finally, a unified POI recommendation framework was used to integrate the preferences. Experimental results on real datasets show that, compared with FPMC-LR (Factorizing Personalized Markov Chain and Localized Region), PRME (Personalized Ranking Metric Embedding), Rank-GeoFM (Ranking based Geographical Factorization Method) and TMCA (Temporal and Multi-level Context Attention), CLSR has the performance greatly improved, and compared to the optimal TMCA among the comparison methods, the proposed algorithm has the precision, recall and normalized Discounted Cumulative Gain (nDCG) increased by 5.8%, 5.1%, and 7.2% on Foursquare dataset, and 7.3%, 10.2%, and 6.3% on Gowalla dataset. It can be seen that CLSR algorithm can effectively improve the results of POI recommendation.

Reference | Related Articles | Metrics

Select

Patent text classification based on ALBERT and bidirectional gated recurrent unit

WEN Chaodong, ZENG Cheng, REN Junwei, ZHANG Yan

Journal of Computer Applications 2021, 41 (2): 407-412. DOI: 10.11772/j.issn.1001-9081.2020050730

Abstract （773）

PDF （979KB）（913）

Save

With the rapid increase in the number of patent applications, the demand for automatic classification of patent text is increasing. Most of the existing patent text classification algorithms utilize methods such as Word2vec and Global Vectors (GloVe) to obtain the word vector representation of the text, while a lot of word position information is abandoned and the complete semantics of the text cannot be expressed. In order to solve these problems, a multilevel patent text classification model named ALBERT-BiGRU was proposed by combining ALBERT (A Lite BERT) and BiGRU (Bidirectional Gated Recurrent Unit). In this model, dynamic word vector pre-trained by ALBERT was used to replace the static word vector trained by traditional methods like Word2vec, so as to improve the representation ability of the word vector. Then, the BiGRU neural network model was used for training, which preserved the semantic association between long-distance words in the patent text to the greatest extent. In the effective verification on the patent text dataset published by State Information Center, compared with Word2vec-BiGRU and GloVe-BiGRU, the accuracy of ALBERT-BiGRU was increased by 9.1 percentage points and 10.9 percentage points respectively at the department level of patent text, and was increased by 9.5 percentage points and 11.2 percentage points respectively at the big class level. Experimental results show that ALBERT-BiGRU can effectively improve the classification effect of patent texts of different levels.

Reference | Related Articles | Metrics

Select

Fixed word-aligned partition compression algorithm of inverted list based on directed acyclic graph

JIANG Kun, LIU Zheng, ZHU Lei, LI Xiaoxing

Journal of Computer Applications 2021, 41 (3): 727-732. DOI: 10.11772/j.issn.1001-9081.2020060874

Abstract （580）

PDF （905KB）（533）

Save

In Fixed Word-Aligned (FWA) inverted index compression algorithms of Web search engines, due to the "greedy" block partition strategy of the inverted list and the interleaved storage of the codeword information, it is difficult for the algorithm to achieve the optimal compression performance. To solve the above problem, an FWA partition compression algorithm based on Directed Acyclic Graph (DAG) was proposed. Firstly, considering the small integer information in the inverted list brought by the clustering characteristics of Web pages, a novel FWA compression format with data area of 64-bit blocks was designed. In this compression format, the data area was divided into 16 storage modes suitable for continuous small integer compression through 4-bit selector area, and the selector area and data area in each block of the inverted list were stored separately, so as to ensure good batch decompression performance. Secondly, a DAG described Word-Aligned Partition (WAP) algorithm was proposed on the basis of the new compression format. In the algorithm, the inverted list block partitioning problem was regarded as a Single-Source Shortest-Path (SSSP) problem by DAG, and by considering the constraints of various storage modes of data area in FWA compression format, the structure and recursive definition of the SSSP problem were determined. Thirdly, the dynamic programming technique was used to solve the problem of SSSP and generate the pseudo-code and algorithm complexity of the optimal partition vector. The original storage modes of traditional FWA algorithms such as S9, S16 and S8b were partitioned and optimized based on DAG, and the computational complexities of the algorithms before and after optimization were compared and analyzed. Finally, the compression performance experiments were conducted with simulation integer sequence data and Text REtrieval Conference (TREC) GOV2 Web page index data. Experimental results show that, compared with the traditional FWA algorithms, the DAG based FWA partition algorithm can improve the compression ratio and decompression speed by batch decompression and partition optimization technology. At the same time, it can obtain a higher compression ratio than the traditional Frame Of Reference (FOR) algorithms for the compression of continuous small integer sequence.

Reference | Related Articles | Metrics

Select

Method of dynamically constructing spatial topic R-tree based on k-means++

ZOU Zhiwen, QIN Cheng

Journal of Computer Applications 2021, 41 (3): 733-737. DOI: 10.11772/j.issn.1001-9081.2020060851

Abstract （461）

PDF （769KB）（563）

Save

The existing R-tree spatial clustering technology usually randomly designates or calculates the Euclidean distance between spatial data to select the cluster centers, without considering the topic relevance between spatial data, so that the clustering result is influenced by the initial value of k, and the association between spatial data is only based on geographic location. Aiming at this situation, a method of dynamically constructing spatial Topic R-tree (TR-tree) based on k-means++ was proposed. Firstly, in the traditional k-means++ algorithm, k clusters were dynamically determined by the clustering measure function, and Latent Dirichlet Allocation (LDA) model was introduced into the clustering measure function to calculate the topic probability of each spatial data text, as a result, the topic relevance between spatial data was strengthened. Secondly, the cluster center with the highest probability was selected through the topic probabilities. Finally, the TR-tree was constructed, and the spatial data were dynamically allocated during the construction. Experimental results show that with a slight increase of the R-tree construction time, this method has the indexing efficiency and correlation between nodes significantly improved compared to the algorithm of constructing R-tree based only on geographic location clustering.

Reference | Related Articles | Metrics

Select

Comparative density peaks clustering algorithm with automatic determination of clustering center

GUO Jia, HAN Litao, SUN Xianlong, ZHOU Lijuan

Journal of Computer Applications 2021, 41 (3): 738-744. DOI: 10.11772/j.issn.1001-9081.2020071071

Abstract （668）

PDF （2809KB）（651）

Save

In order to solve the problem that the clustering centers cannot be determined automatically by Density Peaks Clustering (DPC) algorithm, and the clustering center points and the non-clustering center points are not obvious enough in the decision graph, Comparative density Peaks Clustering algorithm with Automatic determination of clustering center (ACPC) was designed. Firstly, the distance parameter was replaced by the distance comparison quantity, so that the potential clustering centers were more obvious in the decision graph. Then, the 2D interval estimation method was used to perform the automatic selection of clustering centers, so as to realize the automation of clustering process. Experimental results show that the ACPC algorithm has better clustering effect on four synthetic datasets; and the comparison of the Accuracy indicator on real datasets shows that on the dataset Iris, the clustering accuracy of ACPC can reach 94%, which is 27.3% higher than that of the traditional DPC algorithm, and the problem of selecting clustering centers interactively is solved by ACPC.

Reference | Related Articles | Metrics

Select

Survey on online hashing algorithm

GUO Yicun, CHEN Huahui

Journal of Computer Applications 2021, 41 (4): 1106-1112. DOI: 10.11772/j.issn.1001-9081.2020071047

Abstract （910）

PDF （1188KB）（1266）

Save

In the current large-scale data retrieval tasks, learning to hash methods can learn compact binary codes, which saves storage space and can quickly calculate the similarity in Hamming space. Therefore, for approximate nearest neighbor search, hashing methods are often used to improve the mechanism of fast nearest neighbor search. In most current hashing methods, the offline learning models are used for batch training, which cannot adapt to possible data changes appeared in the environment of large-scale streaming data, resulting in reduction of retrieval efficiency. Therefore, the adaptive hash functions were proposed and learnt in online hashing methods, which realize the continuous learning in the process of inputting data and make the methods can be applied to similarity retrieval in real-time. Firstly, the basic principles of learning to hash and the inherent requirements to realize online hashing were explained. Secondly, the different learning methods of online hashing were introduced from the perspectives such as the reading method, learning mode, and model update method of streaming data under online conditions. Thirdly, the online learning algorithms were further divided into six categories, that is, categories based on passive-aggressive algorithms, matrix factorization technology, unsupervised clustering, similarity supervision, mutual information measurement, codebook supervision respectively. And the advantages, disadvantages and characteristics of these algorithms were analyzed. Finally, the development directions of online hashing were summarized and discussed.

Reference | Related Articles | Metrics

Select

Improved wavelet clustering algorithm based on peak grid

LONG Chaoqi, JIANG Yu, XIE Yu

Journal of Computer Applications 2021, 41 (4): 1122-1127. DOI: 10.11772/j.issn.1001-9081.2020071042

Abstract （479）

PDF （1096KB）（694）

Save

Aiming at the difference between the clustering effects of wavelet clustering algorithm under different grid division scales, an improved method based on peak grid was proposed. The algorithm mainly aimed at improving the detection method of connected regions in wavelet clustering. First, the spatial grids after wavelet transform were sorted according to the grid values; then, the breadth-first-search method was used to traverse each spatial grid to detect the peak connected regions in the data after wavelet transform; finally, the connected regions were marked and mapped to the original data space to obtain the clustering result. Experimental results of 8 synthetic datasets(4 convex datasets and 4 non-convex datasets) and 2 real datasets in the UCI database showed that the improved algorithm had good performance at low grid division scales, and compared with the original wavelet clustering algorithm, this algorithm had the requirement for grid division scale reduced by 25% to 60%, and the clustering time reduced by 14% under the same clustering effect.

Reference | Related Articles | Metrics

Select

Reliability analysis models for replication-based storage systems with proactive fault tolerance

LI Jing, LUO Jinfei, LI Bingchao

Journal of Computer Applications 2021, 41 (4): 1113-1121. DOI: 10.11772/j.issn.1001-9081.2020071067

Abstract （362）

PDF （1396KB）（521）

Save

Proactive fault tolerance mechanism, which predicts disk failures and prompts the system to perform migration and backup for the data in danger in advance, can be used to enhance the storage system reliability. In view of the problem that the reliability of the replication-based storage systems with proactive fault tolerance cannot be evaluated by the existing research accurately, several state transition models were proposed for replication-based storage systems; then the models were implemented based on Monte Carlo simulation, so as to simulate the running of the replication-based storage systems with proactive fault tolerance; at last, the expected number of data-loss events during a period in the systems was counted. The Weibull distribution function was used to model the time distribution of device failure and failure repair events, and the impact of proactive fault tolerance mechanism, node failures, node failure repairs, disk failures and disk failure repairs on the system reliability were evaluated quantitatively. Experimental results showed that when the accuracy of the prediction model reached 50%, the reliability of the systems were able to be improved by 1-3 times, and compared with 2-way replication systems, 3-way replication systems were more sensitive to system parameters. By using the proposed models, system administrators can easily assess system reliability under different fault tolerance schemes and system parameters, and then build storage systems with high reliability and high availability.

Reference | Related Articles | Metrics

Select

PageRank-based talent mining algorithm based on Web of Science

LI Chong, WANG Yuchen, DU Weijing, HE Xiaotao, LIU Xuemin, ZHANG Shibo, LI Shuren

Journal of Computer Applications 2021, 41 (5): 1356-1360. DOI: 10.11772/j.issn.1001-9081.2020081206

Abstract （408）

PDF （775KB）（501）

Save

The high-level paper is one of the symbolic achievements of excellent scientific talents. Focusing on the "Web of Science (WOS)" hot research disciplines, on the basis of constructing the Neo4j semantic network graph of academic papers and mining active scientific research communities, the PageRank-based talent mining algorithm was used to realize the mining of outstanding scientific research talents in the scientific research communities. Firstly, the existing talent mining algorithms were studied and analyzed in detail. Secondly, combined with the WOS data, the PageRank-based talent mining algorithm was optimized and implemented by adding consideration factors such as the paper publication time factor, the author's order descending model, the influence of surrounding author nodes on this node, the number of citations of the paper. Finally, experiments and verifications were carried out based on the paper data of the communities of the hot discipline computer science in the past five years. The results show that community-based mining is more targeted, and can quickly find representative excellent and potential talents in various disciplines, and the improved algorithm is more effective and objective.

Reference | Related Articles | Metrics

Select

Patent clustering method based on functional effect

MA Jianhong, CAO Wenbin, LIU Yuangang, XIA Shuang

Journal of Computer Applications 2021, 41 (5): 1361-1366. DOI: 10.11772/j.issn.1001-9081.2020081203

Abstract （391）

PDF （916KB）（596）

Save

At present, patents are divided according to their domains, and cross-domain patent clustering can be realized based on the functional effect, which is of great significance in enterprise innovation design. Accurate extraction of patent functional effect and fast acquisition of optimal clustering results are the key tasks in it. Therefore, a Functional Effect Information-Joint (FEI-Joint) model combining Enhanced Language Representation with Informative Entities (ERNIE) and Convolutional Neural Network (CNN) was proposed to extract the functional effects of patent documents, and the Self-Organizing Map (SOM) algorithm was improved, so as to propose an Early Reject based Class Merge Self-Organizing Map (ERCM-SOM) to realize the patent clustering based on functional effect. FEI-Joint model was compared with Term-Frequency-Inverse-Document-Frequency (TF-IDF), Latent Dirichlet Allocation (LDA) and CNN in the clustering effect after feature extraction, and the results show that the F-measure value of the proposed model was obviously improved than those of other models. Compared with K-Means algorithm and SOM algorithm, ERCM-SOM algorithm has higher F-measure value while has significantly shorter time than that of SOM algorithm. Compared with the patent classification using International Patent Classification (IPC), the clustering method based on functional effect can achieve cross-domain patent clustering effect, which lays a foundation for designers to learn from design methods in other domains.

Reference | Related Articles | Metrics

Select

Hybrid recommendation model based on heterogeneous information network

LIN Yixing, TANG Hua

Journal of Computer Applications 2021, 41 (5): 1348-1355. DOI: 10.11772/j.issn.1001-9081.2020081340

Abstract （521）

PDF （1265KB）（653）

Save

The current personalized recommendation platform has the characteristics of a wide range of data sources and many data types. With the data sparsity of the platform as an important reason for affecting the performance of the recommendation system, there are many challenges faced by the recommendation system:how to mine structured data and unstructured data of the platform to discover more features, improve the accuracy of recommendations in data-sparse scenarios, alleviate the cold start problem, and make recommendations interpretable. Therefore, for the personalized scenario of recommending Items for Users, the Heterogeneous Information Network (HIN) was used to build the association relationships between objects in the recommendation platform, and the Meta-Graph was used to describe the association paths between objects and calculate the User-Item similarity matrices under different paths; the FunkSVD matrix decomposition algorithm was adopted to calculate the implicit features of Users and Items, and for the unstructured data with text as an example, the Convolutional Neural Network (CNN) technology was used to mine the text features of the data; after splicing the features obtained by the two methods, a Factorization Machine (FM) incorporating historical average scores of Users and Items was used to predict Users' scores for Items. In the experiment, based on the public dataset Yelp, the proposed hybrid recommendation model, the single recommendation model based on Meta-Graph, the FM Recommendation model (FMR) and the FunkSVD based recommendation model were established and trained. Experimental results show that the proposed hybrid recommendation model has good validity and interpretability, and compared with the comparison models, the recommendation accuracy of this model has been greatly improved.

Reference | Related Articles | Metrics

Select

Time series clustering based on new robust similarity measure

LI Guorong, YE Jimin, ZHEN Yuanting

Journal of Computer Applications 2021, 41 (5): 1343-1347. DOI: 10.11772/j.issn.1001-9081.2020071142

Abstract （450）

PDF （683KB）（436）

Save

For time series data with outliers, a Robust Generalized Cross-Correlation measure (RGCC) between time series based on robust estimation of correlation coefficient was proposed. First, a robust correlation coefficient was introduced to replace Pearson correlation coefficient to calculate the covariance matrix between time series data. Second, the determinant of the new covariance matrix was used to construct a similarity measure between two time series named RGCC. Finally, the distance matrix between the time series was calculated based on this measure, and the matrix was used as the input of the clustering algorithm to cluster the data. Time series clustering simulation experiments showed that for time series data with outliers, the clustering results based on RGCC were obviously closer to the real ones compared to the clustering results based on the original Generalized Cross-Correlation measure (GCC). It can be seen that the proposed new robust similarity measure is fully applicable to time series data with outliers.

Reference | Related Articles | Metrics

Select

Adaptive affinity propagation clustering algorithm based on universal gravitation

WANG Zhihe, CHANG Xiaoqing, DU Hui

Journal of Computer Applications 2021, 41 (5): 1337-1342. DOI: 10.11772/j.issn.1001-9081.2020071130

Abstract （477）

PDF （1267KB）（521）

Save

Focused on the problem that Affinity Propagation (AP) clustering algorithm is sensitive to parameter Preference, which is not suitable for sparse data, and has the incorrectly clustered sample points in the clustering results, an algorithm named Adaptive Affinity Propagation clustering based on universal gravitation (GA-AP) was proposed. Firstly, the gravitational search mechanism was introduced into the traditional AP algorithm in order to perform the global optimization to the sample points. Secondly, on the basis of global optimization, the correctly clustered and incorrectly clustered sample points in each cluster were found through the information entropy and Adaptive Boosting (AdaBoost) algorithm, the weights of the sample points were calculated. Each sample point was updated by the corresponding weight, so that the similarity, Preference value, attractiveness and membership degree were updated, and the re-clustering was performed. The above steps were continuously operated until the maximum number of iterations was reached. Through simulation experiments on nine datasets, it can be seen that compared to Affinity Propagation clustering based on Adaptive Attribute Weighting (AFW_AP) algorithm, AP algorithm, K-means clustering (K-means) algorithm and Fuzzy C-Means (FCM) algorithm, the proposed algorithm has the average values of Purity, F-measure and Accuracy (ACC) increased by 0.69, 71.74% and 98.5% respectively at most. Experimental results show that the proposed algorithm reduces the dependence on Preference and improves the clustering effect, especially the accuracy of clustering results for sparse datasets.

Reference | Related Articles | Metrics

Select

Bichromatic reverse k nearest neighbor query method based on distance-keyword similarity constraint

ZHANG Hao, ZHU Rui, SONG Fuyao, FANG Peng, XIA Xiufeng

Journal of Computer Applications 2021, 41 (6): 1686-1693. DOI: 10.11772/j.issn.1001-9081.2020091453

Abstract （428）

PDF （1025KB）（406）

Save

In order to solve the problem of low quality of results returned by spatial keyword bichromatic reverse k nearest neighbor query, a bichromatic reverse k nearest neighbor query method based on distance-keyword similarity constraint was proposed. Firstly, a threshold was set to filter out the low-quality users in the query results, so that the existence of users with relatively long spatial distance in the query results was avoided and the quality of the query results was ensured. Then, in order to support this query, an index of Keyword Multiresolution Grid rectangle-tree (KMG-tree) was proposed to manage the data. Finally, the Six-region-optimize algorithm based on Six-region algorithm was proposed to improve the query processing efficiency. The query efficiency of the Six-region-optimize algorithm was about 85.71% and 23.45% on average higher than those of the baseline and Six-region algorithms respectively. Experimental test and analysis were carried out based on real spatio-temporal data. The experimental results verify the effectiveness and high efficiency of the Six-region-optimize algorithm.

Reference | Related Articles | Metrics

Select

Extended isolation forest algorithm based on random subspace

XIE Yu, JIANG Yu, LONG Chaoqi

Journal of Computer Applications 2021, 41 (6): 1679-1685. DOI: 10.11772/j.issn.1001-9081.2020091436

Abstract （545）

PDF （1335KB）（570）

Save

Aiming at the problem of excessive time overhead of the Extended Isolation Forest (EIF) algorithm, a new algorithm named Extended Isolation Forest based on Random Subspace (RS-EIF) was proposed. Firstly, multiple random subspaces were determined in the original data space. Then, in each random subspace, the extended isolated tree was constructed by calculating the intercept vector and slope of each node, and multiple extended isolated trees were integrated into a subspace extended isolation forest. Finally, the average traversal depth of data point in the extended isolation forest was calculated to determine whether the data point was abnormal. Experimental results on 9 real datasets in Outliter Detection DataSet (ODDS) and 7 synthetic datasets with multivariate distribution show that, the RS-EIF algorithm is sensitive to local anomalies and reduces the time overhead by about 60% compared with the EIF algorithm; on the ODDS datasets with many samples, its recognition accuracy is 2 percentage points to 12 percentage points higher than those of the isolation Forest (iForest) algorithm, Lightweight On-line Detection of Anomalies (LODA) algorithm and COPula-based Outlier Detection (COPOD) algorithm. The RS-EIF algorithm has the higher recognition efficiency in the dataset with a large number of samples.

Reference | Related Articles | Metrics

Select

Parameter independent weighted local mean-based pseudo nearest neighbor classification algorithm

CAI Ruiguang, ZHANG Desheng, XIAO Yanting

Journal of Computer Applications 2021, 41 (6): 1694-1700. DOI: 10.11772/j.issn.1001-9081.2020091370

Abstract （383）

PDF （895KB）（568）

Save

Aiming at the problem that the Local Mean-based Pseudo Nearest Neighbor (LMPNN) algorithm is sensitive to the value of k and ignores the different influence of different attributes on the classification results, a Parameter Independent Weighted Local Mean-based Pseudo Nearest Neighbor classification (PIW-LMPNN) algorithm was proposed. Firstly, the Success-History based parameter Adaptation for Differential Evolution (SHADE) algorithm, the latest variant of differential evolution algorithm, was used to optimize the training set samples to obtain the best k value and a set of best weights related to the classes. Secondly, when calculating the distance between samples, different weights were assigned to different attributes of different classes, and the test set samples were classified. Finally, simulations were performed on 15 real datasets and the proposed algorithm was compared to other eight classification algorithms. The results show that the proposed algorithm has the classification accuracy and F1 value increased by about 28 percentage points and 23.1 percentage points respectively. At the same time, the comparision results of Wilcoxon signed-rank test, Friedman rank variance test and Hollander-Wolfe's pairwise processing show that the proposed improved algorithm outperforms the other eight classification algorithms in terms of classification accuracy and k value selection.

Reference | Related Articles | Metrics

Select

Influence maximization algorithm based on user interactive representation

ZHANG Meng, LI Weihua

Journal of Computer Applications 2021, 41 (7): 1964-1969. DOI: 10.11772/j.issn.1001-9081.2020081225

Abstract （390）

PDF （952KB）（384）

Save

The problem of influence maximization is to select a group of effective seed users in social networks, through which information can reach the largest scope of spread. Traditional researches on influence maximization rely on the specific network structures and diffusion models, however, the manually processed simplified networks and the diffusion models based on assumptions have great limitations on assessing the real influence of users. To solve this problem, an Influence Maximization algorithm based on User Interactive Representation (IMUIR) was proposed. First, the context pairs were constructed through random sampling according to users' interaction traces, and the vector representations of the users were obtained by the SkipGram model training. Then, the greedy strategy was used to select the best seed set according to the activity degrees of the source users and the interaction degrees between these users with other users. To verify the effectiveness of IMUIR, experiments were conducted to compare it with Random, Average Cascade (AC), Kcore and Imfector algorithms on two social networks with real interactive information. The results show that IMUIR selects the seed set with higher quality, produces a wider scope of influence spread, and performs stablely on the two datasets.

Reference | Related Articles | Metrics

Select

Deep network embedding method based on community optimization

LI Yafang, LIANG Ye, FENG Weiwei, ZU Baokai, KANG Yujian

Journal of Computer Applications 2021, 41 (7): 1956-1963. DOI: 10.11772/j.issn.1001-9081.2020081193

Abstract （531）

PDF （1616KB）（518）

Save

With the rapid development of technologies such as modern network communication and social media, the networked big data is difficult to be applied due to the lack of efficient and available node representation. Network representation learning is widely concerned by transforming high-dimensional sparse network data into low-dimensional, compact and easy-to-apply node representation. However, the existing network embedding methods obtain the low-dimensional feature vectors of nodes and then use them as the inputs for other applications (such as node classification, community discovery, link prediction and visualization) for further analysis, without building models for specific applications, which makes it difficult to achieve satisfactory results. For the specific application of network community discovery, a deep auto-encoder clustering model that combines community structure optimization for low-dimensional feature representation of nodes was proposed, namely Community-Aware Deep Network Embedding (CADNE). Firstly, based on the deep auto-encoder model, the node low-dimensional representation was learned by maintaining the topological characteristics of the local and global links of the network, and then the low-dimensional representation of the nodes was further optimized by using the network clustering structure. In this method, the low-dimensional representations of the nodes and the indicator vectors of the communities that the nodes belong to were learnt at the same time, so that the low-dimensional representation of the nodes can not only maintain the topological characteristics of the original network structure, but also maintain the clustering characteristics of the nodes. Comparing with the existing classical network embedding methods, the results show that CADNE achieves the best clustering results on Citeseer and Cora datasets, and improves the accuracy by up to 0.525 on 20NewsGroup. In classification task, CADNE performs the best on Blogcatalog and Citeseer datasets and the performance on Blogcatalog is improved by up to 0.512 with 20% training samples. In the visualization comparison, CADNE molel can get a low-dimensional representation of nodes with clearer class boundary, which verifies that the proposed method has better low-dimensional representation ability of nodes.

Reference | Related Articles | Metrics

Select

Ensemble classification model for distributed drifted data streams

YIN Chunyong, ZHANG Guojie

Journal of Computer Applications 2021, 41 (7): 1947-1955. DOI: 10.11772/j.issn.1001-9081.2020081277

Abstract （432）

PDF （1255KB）（379）

Save

Aiming at the problem of low classification accuracy in big data environment, an ensemble classification model for distributed data streams was proposed. Firstly, the microcluster mode was used to reduce the amount of data transmitted from local nodes to the central nodes, so as to reduce the communication cost. Secondly, the training samples of the global classifier were generated by using the sample reconstruction algorithm. Finally, an ensemble classification model for drift data streams was proposed, which adopted the weighted combination strategy of dynamic classifiers and steady classifiers, and the mixed labeling strategy was used to label the most representative instances to update the ensemble model. Experiments on two virtual datasets and two real datasets showed that the model suffered less fluctuation from concept drift compared with two distributed mining models DS-means and BDS-ensemble, and had higher accuracy than Online Active Learning Ensemble model (OALEnsemble), with the accuracy on four datasets improved by 1.58、0.97、0.77 and 1.91 percentage points respectively. Although the memory consumption of this model was slightly higher than those of BDS-ensemble and DS-means models, this model was able to improve the classification performance at a lower memory cost. Therefore, the model is suitable for the classification of big data with distributed and mobility characteristics, such as network monitoring and banking business system.

Reference | Related Articles | Metrics

Select

Impact and enhancement of similarity features on link prediction

CAI Biao, LI Ruicen, WU Yuanyuan

Journal of Computer Applications 2021, 41 (9): 2569-2577. DOI: 10.11772/j.issn.1001-9081.2020111744

Abstract （343）

PDF （4634KB）（412）

Save

Link prediction focuses on the design of prediction algorithms that can describe a given network mechanism more accurately to achieve the prediction result with higher accuracy. Based on an analysis of the existing research achievements, it is found that the similarity characteristics of a network has a great impact on the link prediction method used. In networks with low tag similarity between nodes, increasing the tag similarity is able to improve the prediction accuracy; in networks with high tag similarity between nodes, more attention should be paid to the contribution of structural information to link prediction to improve the prediction accuracy. Then, a tag-weighted similarity algorithm was proposed by weighting the tags, which was able to improve the accuracy of link prediction in networks with low similarity. Meanwhile, in networks with relatively high similarity, the structural information of the network was introduced into the node similarity calculation, and the accuracy of link prediction was improved through the preferential attachment mechanism. Experimental results on four real networks show that the proposed algorithm achieves the highest accuracy compared to the comparison algorithms Cosine Similarity between Tag Systems (CSTS), Preferential Attachment (PA), etc. According to the network similarity characteristics, using the proposed corresponding algorithm for link prediction can obtain more accurate prediction results.

Reference | Related Articles | Metrics

Select

Differential privacy high-dimensional data publishing method via clustering analysis

CHEN Hengheng, NI Zhiwei, ZHU Xuhui, JIN Yuanyuan, CHEN Qian

Journal of Computer Applications 2021, 41 (9): 2578-2585. DOI: 10.11772/j.issn.1001-9081.2020111786

Abstract （458）

PDF （1281KB）（459）

Save

Aiming at the problem that the existing differential privacy high-dimensional data publishing methods are difficult to take into account both the complex attribute correlation between data and computational cost, a differential privacy high-dimensional data publishing method based on clustering analysis technology, namely PrivBC, was proposed. Firstly, the attribute clustering method was designed based on the K-means++, the maximum information coefficient was introduced to quantify the correlation between the attributes, and the data attributes with high correlation were clustered. Secondly, for each data subset obtained by the clustering, the correlation matrix was calculated to reduce the candidate space of attribute pairs, and the Bayesian network satisfying differential privacy was constructed. Finally, each attribute was sampled according to the Bayesian networks, and a new private dataset was synthesized for publishing. Compared with PrivBayes method, PrivBC method had the misclassification rate and running time reduced by 12.6% and 30.2% averagely and respectively. Experimental results show that the proposed method can significantly improve the computational efficiency with ensuring the data availability, and provides a new idea for the private publishing of high-dimensional big data.

Reference | Related Articles | Metrics

Select

Parallel decompression algorithm for high-speed train monitoring data

WANG Zhoukai, ZHANG Jiong, MA Weigang, WANG Huaijun

Journal of Computer Applications 2021, 41 (9): 2586-2593. DOI: 10.11772/j.issn.1001-9081.2020111173

Abstract （363）

PDF （1272KB）（370）

Save

The real-time monitoring data generated by high-speed trains during running are usually processed by variable-length coding compression technology, which is convenient for transmission and storage. However, this method will complicate the internal structure of the compressed data, so that the corresponding data decompression process must follow the composition order of the compressed data, which is inefficient. In order to improve the decompression efficiency of high-speed train monitoring data, a parallel decompression algorithm for high-speed train monitoring data was proposed with the help of the speculation technology. Firstly, the structural characteristics of high-speed train monitoring data were studied, and the internal dependence that affects data division was analyzed. Secondly, the speculation technology was used to clean up internal dependence, and then, the data were divided into different parts tentatively. Thirdly, the division results were decompressed in a distributed computing environment in parallel. Finally, the parallel decompression results were combined together. Through this way, the decompression efficiency of high-speed train monitoring data was improved. Experimental results showed that on the computing cluster composed of 7 computing nodes, compared with the serial algorithm, the speedup of the proposed speculative parallel algorithm was about 3, showing a good performance of this algorithm. It can be seen that this algorithm can improve the monitoring data decompression efficiency significantly.

Reference | Related Articles | Metrics

Select

Diversity represented deep subspace clustering algorithm

Zhifeng MA, Junyang YU, Longge WANG

Journal of Computer Applications 2023, 43 (2): 407-412. DOI: 10.11772/j.issn.1001-9081.2021122126

Abstract （455）

HTML （16）

PDF （1851KB）（170）

Save

Focusing on the challenge task for mining complementary information in different levels of features in the deep subspace clustering problem， based on the deep autoencoder， by exploring complementary information between the low-level and high-level features obtained by the encoder， a Diversity Represented Deep Subspace Clustering （DRDSC） algorithm was proposed. Firstly， based on Hilbert-Schmidt Independence Criterion （HSIC）， a diversity representation measurement model was established for different levels of features. Secondly， a feature diversity representation module was introduced into the deep autoencoder network structure， which explored image features beneficial to enhance the clustering effect. Furthermore， the form of loss function was updated to effectively fuse the underlying subspaces of multi-level representation. Finally， several experiments were conducted on commonly used clustering datasets. Experimental results show that on the datasets Extended Yale B， ORL， COIL20 and Umist， the clustering error rates of DRDSC reach 1.23%， 10.50%， 1.74% and 17.71%， respectively， which are reduced by 10.41， 16.75， 13.12 and 12.92 percentage points， respectively compared with those of Efficient Dense Subspace Clustering （EDSC）， and are reduced by 1.44， 3.50， 3.68 and 9.17 percentage points， respectively compared with Deep Subspace Clustering （DSC）， which indicates that the proposed DRDSC algorithm has better clustering effect.

Table and Figures | Reference | Related Articles | Metrics

Select

Reviewer recommendation algorithm based on affinity and research direction coverage

Lei ZHONG, Yunsheng ZHOU, Dunhui YU, Haibo CUI

Journal of Computer Applications 2023, 43 (2): 430-436. DOI: 10.11772/j.issn.1001-9081.2021122127

Abstract （412）

HTML （14）

PDF （2659KB）（72）

Save

To deal with the problem that the existing reviewer recommendation algorithms assign reviewers only through affinity score and ignore the research direction matching between reviewers and papers to be reviewed， a reviewer recommendation algorithm based on Affinity and Research Direction Coverage （ARDC） was proposed. Firstly， the order of the paper’s selection of reviewers was determined according to the frequencies of the research directions appearing in the papers and the reviewer’s paper groups. Secondly， the reviewer’s comprehensive review score to the paper to be reviewed was calculated based on the affinity score between the reviewers and the paper to be reviewed and the research direction coverage score of the reviewers to the paper to be reviewed， and the pre-assigned review team for the paper was obtained on the basis of round-robin scheduling. Finally， the final recommendation of the review team was realized based on the conflict of interest conflict inspection and resolution. Experimental results show that compared with assignment based reviewer recommendation algorithms such as Fair matching via Iterative Relaxation （FairIR） and Fair and Accurate reviewer assignment in Peer Review （PR4A）， the proposed algorithm has the average research direction coverage score increased by 38% on average at the expense of a small amount of affinity score， so that the recommendation result is more accurate and reasonable.

Table and Figures | Reference | Related Articles | Metrics

Select

Partial periodic pattern incremental mining of time series data based on multi-scale

Yaling XUN, Linqing WANG, Jianghui CAI, Haifeng YANG

Journal of Computer Applications 2023, 43 (2): 391-397. DOI: 10.11772/j.issn.1001-9081.2021122190

Abstract （448）

HTML （9）

PDF （2226KB）（153）

Save

Aiming at the problems of high computational complexity and poor expansibility in the mining process of partial periodic patterns from dynamic time series data， a partial periodic pattern mining algorithm for dynamic time series data combined with multi-scale theory， named MSI-PPPGrowth （Multi-Scale Incremental Partial Periodic Frequent Pattern） was proposed. In MSI-PPPGrowth， the objective multi-scale characteristics of time series data， were made full use， and the multi-scale theory was introduced in the mining process of partial periodic patterns from time series data. Firstly， both the original data after scale division and incremental time series data were used as a finer-grained benchmark scale dataset for independent mining. Then， the correlation between different scales was used to realize scale transformation， so as to indirectly obtain global frequent patterns corresponding to the dynamically updated dataset. Therefore， the repeated scanning of the original dataset and the constant adjustment of the tree structure were avoided. In which， a new frequent missing count estimation model PJK-EstimateCount was designed based on Kriging method considering the periodicity of time series to effectively estimate the frequent missing item support count in scale transformation. Experimental results show that MSI-PPPGrowth has good scalability and real-time performance. Especially for dense datasets， MSI-PPPGrowth has significant performance advantages.

Table and Figures | Reference | Related Articles | Metrics

Select

Efficient complex event matching algorithm based on ordered event lists

Tao QIU, Jianli DING, Xiufeng XIA, Hongmei XI, Peiliang XIE, Qingyi ZHOU

Journal of Computer Applications 2023, 43 (2): 423-429. DOI: 10.11772/j.issn.1001-9081.2021122186

Abstract （437）

HTML （14）

PDF （2336KB）（112）

Save

Aiming at the problem of high matching cost in the existing complex event matching processing methods， a complex event matching algorithm ReCEP was proposed， which uses event buffers （ordered event lists） for recursive traversal. Different from the existing method that uses automaton to match on the event stream， this method decomposes the constraints in the complex event query mode into different types， and then recursively verifies the different constraints on the ordered list. Firstly， according to the query mode， the related event instances were cached according to the event type. Secondly， the query filtering operation was performed to the event instances on the ordered list， and an algorithm based on recursive traversal was given to determine the initial event instance and obtain candidate sequence. Finally， the attribute constraints of the candidate sequence were further verified. Experimental testing and analysis results based on simulated stock transaction data show that compared with the current mainstream matching methods SASE and Siddhi， ReCEP algorithm can effectively reduce the processing time of query matching， has overall performance better than both of the two methods， and has the query matching efficiency improved by more than 8.64%. It can be seen that the proposed complex event matching method can effectively improve the efficiency of complex event processing.

Table and Figures | Reference | Related Articles | Metrics

Select

Fast sanitization algorithm based on BCU-Tree and dictionary for high-utility mining

Chunyong YIN, Ying LI

Journal of Computer Applications 2023, 43 (2): 413-422. DOI: 10.11772/j.issn.1001-9081.2021122161

Abstract （398）

HTML （11）

PDF （2958KB）（121）

Save

Privacy Preserving Utility Mining （PPUM） has problems of long sanitization time， high computational complexity， and high side effect. To solve these problems， a fast sanitization algorithm based on BCU-Tree and Dictionary （BCUTD） for high-utility mining was proposed. In the algorithm， a new tree structure called BCU-Tree was presented to store sensitive item information， and based on the bitwise operator coding model， the tree construction time and search space were reduced. The dictionary table was used to store all nodes in the tree structure， and only the dictionary table needed to be accessed when the sensitive item was modified. Finally， the sanitization process was completed. In the experiments on four different datasets， BCUTD algorithm has better performance on sanitization time and high side effect than Hiding High Utility Item First （HHUIF）， Maximum Sensitive Utility-MAximum item Utility （MSU-MAU）， and Fast Perturbation Using Tree and Table structures （FPUTT）. Experimental results show that BCUTD algorithm can effectively speed up the sanitization process， reduce the side effect and computational complexity of the algorithm.

Table and Figures | Reference | Related Articles | Metrics

Select

Parameter calculation algorithm of structural graph clustering driven by instance clusters

Chuanyu ZONG, Chao XIAN, Xiufeng XIA

Journal of Computer Applications 2023, 43 (2): 398-406. DOI: 10.11772/j.issn.1001-9081.2022010082

Abstract （1753）

HTML （19）

PDF （2584KB）（88）

Save

Clustering results of the pSCAN （pruned Structural Clustering Algorithm for Network） algorithm are influenced by the density constraint parameter and the similarity threshold parameter. If the requirements cannot be satisfied by the clustering results obtained by the clustering parameters provided by the user， then the user’s own clustering requirements can be expressed through instance clusters. Aiming at the problem of instance clusters expressing clustering query requirements， an instance cluster-driven structural graph clustering parameter calculation algorithm PART and its improved algorithm ImPART were proposed. Firstly， the influences of two clustering parameters on the clustering results were analyzed， and correlation subgraph of instance cluster was extracted. Secondly， the feasible interval of the density constraint parameter was obtained by analyzing the correlation subgraph， and the nodes in the instance cluster were divided into core nodes and non-core nodes according to the current density constraint parameter and the structural similarity between nodes. Finally， according to the node division result， the optimal similarity threshold parameter corresponding to the current density constraint parameter was calculated， and the obtained parameters were verified and optimized on the relevant subgraph until the clustering parameters that satisfy the requirements of the instance cluster were obtained. Experimental results on real datasets show that a set of effective parameters can be returned for user instance clusters by using the proposed algorithm， and the proposed improved algorithm ImPART is more than 20% faster than the basic algorithm PART， and can return the optimal clustering parameters that satisfy the requirements of instance clusters quickly and effectively for the user.

Table and Figures | Reference | Related Articles | Metrics

Select

Range query algorithm for large scale moving objects in distributed environment

MA Yongqiang, CHEN Xiaomeng, YU Ziqiang

Journal of Computer Applications 2023, 43 (1): 111-121. DOI: 10.11772/j.issn.1001-9081.2021101853

Abstract （330）

HTML （11）

PDF （3320KB）（69）

Save

Continuous range queries over moving objects is essential to many location-based services. Aiming at this issue， a distributed search method was proposed for processing concurrent range queries over large scale moving objects. Firstly， formed by a Global Grid Index （GGI） and a local elastic quadtree， a Distributed Dynamic Index （DDI） structure was proposed. Then， a Distributed Search Algorithm （DSA） was proposed based on DDI structure. At the first time， an incremental strategy of updating the query results as objects and query points continuously changed their locations was introduced by DSA. After that， in the incremental update process， a shared computing optimization strategy for multiple concurrent queries was introduced， to incrementally search the range query results of the moving object according to the existing calculation results. Finally， three moving object datasets with different spatial distributions were simulated on the basis of the German road network， and NS （Naive Search）， GI （Grid Index） and Distributed Hybrid Index （DHI） were compared with the proposed algorithm. The results show that compared with DHI， the comparison algorithm with the best performance， DSA decreases the initial query time by 22.7%， while drops the incremental query time by 15.2%， verifying that DSA is superior to the comparison algorithms.

Reference | Related Articles | Metrics

Select

Local community detection algorithm based on Monte-Carlo iterative solving strategy

LI Zhanli, LI Ying, LUO Xiangyu, LUO Yingxiao

Journal of Computer Applications 2023, 43 (1): 104-110. DOI: 10.11772/j.issn.1001-9081.2021111942

Abstract （338）

HTML （11）

PDF （1690KB）（111）

Save

Aiming at the problems of premature convergence and low recall caused by using greedy strategy for community expansion in the existing local community detection algorithms， a local community detection algorithm based on Monte-Carlo iterative solving strategy was proposed. Firstly， in the community expansion stage of each iteration， the selection probabilities were given to all adjacent candidate nodes according to the contribution ratio of each node to the community tightness gain， and one node was randomly selected to join the community according to these probabilities. Then， in order to avoid random selection causing the expansion direction to deviate from the target community， it was determined whether the node elimination mechanism was triggered in this round of iteration according to the changes in community quality. If it was triggered， the similarity sum of each node joining the community and other nodes in the community was calculated， the elimination probabilities were assigned according to the reciprocal of the similarity sum， a node was randomly eliminated according to these probabilities. Finally， whether to continue the iteration was judged on the basis of whether the community size increased in a given number of recent iteration rounds. Experimental results show that， on three real network datasets， compared to Local Tightness Expansion （LTE） algorithm， Clauset algorithm， Common Neighbors with Weighted Neighbor Nodes （CNWNN） algorithm and Fuzzy Similarity Relation （FSR） algorithm， the proposed algorithm has the F-score value of local community detection results increased by 32.75 percentage points， 17.31 percentage points， 20.66 percentage points and 25.51 percentage points respectively， and can effectively avoid the influence of the location of the query node in the community on the local community detection results.

Reference | Related Articles | Metrics

Project Articles