Data science and technology

Select

Parameter calculation algorithm of structural graph clustering driven by instance clusters

Chuanyu ZONG, Chao XIAN, Xiufeng XIA

Journal of Computer Applications 2023, 43 (2): 398-406. DOI: 10.11772/j.issn.1001-9081.2022010082

Abstract （1621）

HTML （14）

PDF （2584KB）（69）

Save

Clustering results of the pSCAN （pruned Structural Clustering Algorithm for Network） algorithm are influenced by the density constraint parameter and the similarity threshold parameter. If the requirements cannot be satisfied by the clustering results obtained by the clustering parameters provided by the user， then the user’s own clustering requirements can be expressed through instance clusters. Aiming at the problem of instance clusters expressing clustering query requirements， an instance cluster-driven structural graph clustering parameter calculation algorithm PART and its improved algorithm ImPART were proposed. Firstly， the influences of two clustering parameters on the clustering results were analyzed， and correlation subgraph of instance cluster was extracted. Secondly， the feasible interval of the density constraint parameter was obtained by analyzing the correlation subgraph， and the nodes in the instance cluster were divided into core nodes and non-core nodes according to the current density constraint parameter and the structural similarity between nodes. Finally， according to the node division result， the optimal similarity threshold parameter corresponding to the current density constraint parameter was calculated， and the obtained parameters were verified and optimized on the relevant subgraph until the clustering parameters that satisfy the requirements of the instance cluster were obtained. Experimental results on real datasets show that a set of effective parameters can be returned for user instance clusters by using the proposed algorithm， and the proposed improved algorithm ImPART is more than 20% faster than the basic algorithm PART， and can return the optimal clustering parameters that satisfy the requirements of instance clusters quickly and effectively for the user.

Table and Figures | Reference | Related Articles | Metrics

Select

Survey on online hashing algorithm

GUO Yicun, CHEN Huahui

Journal of Computer Applications 2021, 41 (4): 1106-1112. DOI: 10.11772/j.issn.1001-9081.2020071047

Abstract （749）

PDF （1188KB）（1080）

Save

In the current large-scale data retrieval tasks, learning to hash methods can learn compact binary codes, which saves storage space and can quickly calculate the similarity in Hamming space. Therefore, for approximate nearest neighbor search, hashing methods are often used to improve the mechanism of fast nearest neighbor search. In most current hashing methods, the offline learning models are used for batch training, which cannot adapt to possible data changes appeared in the environment of large-scale streaming data, resulting in reduction of retrieval efficiency. Therefore, the adaptive hash functions were proposed and learnt in online hashing methods, which realize the continuous learning in the process of inputting data and make the methods can be applied to similarity retrieval in real-time. Firstly, the basic principles of learning to hash and the inherent requirements to realize online hashing were explained. Secondly, the different learning methods of online hashing were introduced from the perspectives such as the reading method, learning mode, and model update method of streaming data under online conditions. Thirdly, the online learning algorithms were further divided into six categories, that is, categories based on passive-aggressive algorithms, matrix factorization technology, unsupervised clustering, similarity supervision, mutual information measurement, codebook supervision respectively. And the advantages, disadvantages and characteristics of these algorithms were analyzed. Finally, the development directions of online hashing were summarized and discussed.

Reference | Related Articles | Metrics

Select

Patent text classification based on ALBERT and bidirectional gated recurrent unit

WEN Chaodong, ZENG Cheng, REN Junwei, ZHANG Yan

Journal of Computer Applications 2021, 41 (2): 407-412. DOI: 10.11772/j.issn.1001-9081.2020050730

Abstract （636）

PDF （979KB）（767）

Save

With the rapid increase in the number of patent applications, the demand for automatic classification of patent text is increasing. Most of the existing patent text classification algorithms utilize methods such as Word2vec and Global Vectors (GloVe) to obtain the word vector representation of the text, while a lot of word position information is abandoned and the complete semantics of the text cannot be expressed. In order to solve these problems, a multilevel patent text classification model named ALBERT-BiGRU was proposed by combining ALBERT (A Lite BERT) and BiGRU (Bidirectional Gated Recurrent Unit). In this model, dynamic word vector pre-trained by ALBERT was used to replace the static word vector trained by traditional methods like Word2vec, so as to improve the representation ability of the word vector. Then, the BiGRU neural network model was used for training, which preserved the semantic association between long-distance words in the patent text to the greatest extent. In the effective verification on the patent text dataset published by State Information Center, compared with Word2vec-BiGRU and GloVe-BiGRU, the accuracy of ALBERT-BiGRU was increased by 9.1 percentage points and 10.9 percentage points respectively at the department level of patent text, and was increased by 9.5 percentage points and 11.2 percentage points respectively at the big class level. Experimental results show that ALBERT-BiGRU can effectively improve the classification effect of patent texts of different levels.

Reference | Related Articles | Metrics

Select

Traffic flow prediction model based on time series decomposition

Jin XIA, Zhengqun WANG, Shiming ZHU

Journal of Computer Applications 2023, 43 (4): 1129-1135. DOI: 10.11772/j.issn.1001-9081.2022030473

Abstract （533）

HTML （21）

PDF （2485KB）（255）

Save

Short-term traffic flow prediction is not only related to historical data， but also affected by the traffic of adjacent areas. Since the trend and spatial correlation of traffic flow are ignored by traditional Time Series Decomposition （TSD） models， a time series processing model based on the combination of Time Series Decomposition and Spatio-Temporal features （TSD-ST） was proposed. Firstly， the trend component and periodic component were obtained by using Empirical Mode Decomposition （EMD） and Discrete Fourier Transform （DFT）， the Spatio-Temporal （ST） correlation of the fluctuation component was mined by Mutual Information algorithm （MI）， and the state vector was reconstructed on the basis of the above. Then， the fluctuation component was predicted by using the state vector through Long Short-Term Memory （LSTM） network. Finally， the final predicted value was obtained by reconstructing the prediction results of the three parts of the sequence. The validity of the model was verified on the real data of Interstate I090 in Washington State， USA. Experimental results show that the Root Mean Square Error （RMSE） of the proposed model TSD-ST-LSTM is reduced by 16.5%， 34.0%， and 36.6% compared with that of Support Vector Regression （SVR）， Gradient Boosting Regression Tree （GBRT） and LSTM， respectively. It can be seen that the proposed model is very effective in improving prediction accuracy.

Table and Figures | Reference | Related Articles | Metrics

Select

Key node mining in complex network based on improved local structural entropy

Peng LI, Shilin WANG, Guangwu CHEN, Guanghui YAN

Journal of Computer Applications 2023, 43 (4): 1109-1114. DOI: 10.11772/j.issn.1001-9081.2022040562

Abstract （516）

HTML （27）

PDF （1367KB）（246）

Save

The identification of key nodes in complex network plays an important role in the optimization of network structure and effective propagation of information. Local structural Entropy （LE） can be used to identify key nodes by using the influence of the local network on the whole network instead of the influence of nodes on the whole network. However， the cases of the highly aggregative network and nodes forming a loop with neighbor nodes are not considered in LE， which leads to some limitations. To address these limitations， firstly， an improved LE based node importance evaluation method， namely PLE （Penalized Local structural Entropy）， was proposed， in which based on the LE， the Clustering Coefficient （CC） was introduced as a penalty term to penalize the highly aggregative nodes in the network appropriately. Secondly， due to the fact that the penalty of PLE penalizing the nodes in triadic closure structure is too much， an improved method of PLE， namely PLEA （Penalized Local structural Entropy Advancement） was proposed， in which control coefficient was introduced in front of the penalty term to control the penalty strength. Selective attack experiments on five real networks with different sizes were conducted. Experimental results show that in the western US states grid and the US Airlines， PLEA has the identification accuracy improved by 26.3% and 3.2% compared with LE respectively， by 380% and 5.43% compared with K-Shell （KS） method respectively， and by 14.4% and 24% compared with DCL （Degree and Clustering coefficient and Location） method respectively. The key nodes identified by PLEA can cause more damage to the network， verifying the rationality of introducing the CC as a penalty term， and the effectiveness and superiority of PLEA. The integration of the number of neighbors and the local network structure of nodes with the simplicity of computation makes it more effective in describing the reliability and invulnerability of large-scale networks.

Table and Figures | Reference | Related Articles | Metrics

Select

Comparative density peaks clustering algorithm with automatic determination of clustering center

GUO Jia, HAN Litao, SUN Xianlong, ZHOU Lijuan

Journal of Computer Applications 2021, 41 (3): 738-744. DOI: 10.11772/j.issn.1001-9081.2020071071

Abstract （508）

PDF （2809KB）（543）

Save

In order to solve the problem that the clustering centers cannot be determined automatically by Density Peaks Clustering (DPC) algorithm, and the clustering center points and the non-clustering center points are not obvious enough in the decision graph, Comparative density Peaks Clustering algorithm with Automatic determination of clustering center (ACPC) was designed. Firstly, the distance parameter was replaced by the distance comparison quantity, so that the potential clustering centers were more obvious in the decision graph. Then, the 2D interval estimation method was used to perform the automatic selection of clustering centers, so as to realize the automation of clustering process. Experimental results show that the ACPC algorithm has better clustering effect on four synthetic datasets; and the comparison of the Accuracy indicator on real datasets shows that on the dataset Iris, the clustering accuracy of ACPC can reach 94%, which is 27.3% higher than that of the traditional DPC algorithm, and the problem of selecting clustering centers interactively is solved by ACPC.

Reference | Related Articles | Metrics

Select

Multi-view clustering via subspace merging on Grassmann manifold

Jiaojiao GUAN, Xuezhong QIAN, Shibing ZHOU, Kaibin JIANG, Wei SONG

Journal of Computer Applications 2022, 42 (12): 3740-3749. DOI: 10.11772/j.issn.1001-9081.2021101756

Abstract （485）

HTML （7）

PDF （1806KB）（155）

Save

Most of the existing multi-view clustering algorithms assume that there is a linear relationship between multi-view data points， and fail to maintain the locality of original feature space during the learning process. At the same time， merging subspace in Euclidean space is too rigid to align learned subspace representations. To solve the above problems， a multi-view clustering algorithm via subspaces merging on Grassmann manifold was proposed. Firstly， the kernel trick and the learning of local manifold structure were combined to obtain the subspace representations of different views. Then， the subspace representations were merged on the Grassmann manifold to obtain the consensus affinity matrix. Finally， spectral clustering was performed on the consensus affinity matrix to obtain the final clustering result. And Alternating Direction Method of Multipliers （ADMM） was used to optimize the proposed model. Compared with Kernel Multi-view Low-Rank Sparse Subspace Clustering （KMLRSSC） algorithm， the proposed algorithm has the clustering accuracy improved by 20.83 percentage points， 9.47 percentage points and 7.33 percentage points on MSRCV1， Prokaryotic and Not-Hill datasets. Experimental results verify the effectiveness and good performance of the multi-view clustering algorithm via subspace merging on Grassmann manifold.

Table and Figures | Reference | Related Articles | Metrics

Select

Fixed word-aligned partition compression algorithm of inverted list based on directed acyclic graph

JIANG Kun, LIU Zheng, ZHU Lei, LI Xiaoxing

Journal of Computer Applications 2021, 41 (3): 727-732. DOI: 10.11772/j.issn.1001-9081.2020060874

Abstract （465）

PDF （905KB）（424）

Save

In Fixed Word-Aligned (FWA) inverted index compression algorithms of Web search engines, due to the "greedy" block partition strategy of the inverted list and the interleaved storage of the codeword information, it is difficult for the algorithm to achieve the optimal compression performance. To solve the above problem, an FWA partition compression algorithm based on Directed Acyclic Graph (DAG) was proposed. Firstly, considering the small integer information in the inverted list brought by the clustering characteristics of Web pages, a novel FWA compression format with data area of 64-bit blocks was designed. In this compression format, the data area was divided into 16 storage modes suitable for continuous small integer compression through 4-bit selector area, and the selector area and data area in each block of the inverted list were stored separately, so as to ensure good batch decompression performance. Secondly, a DAG described Word-Aligned Partition (WAP) algorithm was proposed on the basis of the new compression format. In the algorithm, the inverted list block partitioning problem was regarded as a Single-Source Shortest-Path (SSSP) problem by DAG, and by considering the constraints of various storage modes of data area in FWA compression format, the structure and recursive definition of the SSSP problem were determined. Thirdly, the dynamic programming technique was used to solve the problem of SSSP and generate the pseudo-code and algorithm complexity of the optimal partition vector. The original storage modes of traditional FWA algorithms such as S9, S16 and S8b were partitioned and optimized based on DAG, and the computational complexities of the algorithms before and after optimization were compared and analyzed. Finally, the compression performance experiments were conducted with simulation integer sequence data and Text REtrieval Conference (TREC) GOV2 Web page index data. Experimental results show that, compared with the traditional FWA algorithms, the DAG based FWA partition algorithm can improve the compression ratio and decompression speed by batch decompression and partition optimization technology. At the same time, it can obtain a higher compression ratio than the traditional Frame Of Reference (FOR) algorithms for the compression of continuous small integer sequence.

Reference | Related Articles | Metrics

Select

Point-of-interest recommendation algorithm combing dynamic and static preferences

YANG Li, WANG Shihui, ZHU Bo

Journal of Computer Applications 2021, 41 (2): 398-406. DOI: 10.11772/j.issn.1001-9081.2020050677

Abstract （463）

PDF （1355KB）（520）

Save

Since most existing Point-Of-Interest (POI) recommendation algorithms ignore the complexity of the modeling of the fusion of user dynamic and static preferences, a POI recommendation algorithm called CLSR (Combing Long Short Recommendation) was proposed that combined complex dynamic user preferences and general static user preferences. Firstly, in the process of modeling complex dynamic preferences, a hybrid neural network was designed based on the user's check-in behaviors and the skip behaviors in check-in behaviors to achieve the modeling of complex dynamic interests of the user. Secondly, in the process of general static preference modeling, a high-level attention network was used to learn the complex interactions between the user and POIs. Thirdly, a multi-layer neural network was used to further learn and express the above dynamic preferences and static preferences. Finally, a unified POI recommendation framework was used to integrate the preferences. Experimental results on real datasets show that, compared with FPMC-LR (Factorizing Personalized Markov Chain and Localized Region), PRME (Personalized Ranking Metric Embedding), Rank-GeoFM (Ranking based Geographical Factorization Method) and TMCA (Temporal and Multi-level Context Attention), CLSR has the performance greatly improved, and compared to the optimal TMCA among the comparison methods, the proposed algorithm has the precision, recall and normalized Discounted Cumulative Gain (nDCG) increased by 5.8%, 5.1%, and 7.2% on Foursquare dataset, and 7.3%, 10.2%, and 6.3% on Gowalla dataset. It can be seen that CLSR algorithm can effectively improve the results of POI recommendation.

Reference | Related Articles | Metrics

Select

Extended isolation forest algorithm based on random subspace

XIE Yu, JIANG Yu, LONG Chaoqi

Journal of Computer Applications 2021, 41 (6): 1679-1685. DOI: 10.11772/j.issn.1001-9081.2020091436

Abstract （408）

PDF （1335KB）（457）

Save

Aiming at the problem of excessive time overhead of the Extended Isolation Forest (EIF) algorithm, a new algorithm named Extended Isolation Forest based on Random Subspace (RS-EIF) was proposed. Firstly, multiple random subspaces were determined in the original data space. Then, in each random subspace, the extended isolated tree was constructed by calculating the intercept vector and slope of each node, and multiple extended isolated trees were integrated into a subspace extended isolation forest. Finally, the average traversal depth of data point in the extended isolation forest was calculated to determine whether the data point was abnormal. Experimental results on 9 real datasets in Outliter Detection DataSet (ODDS) and 7 synthetic datasets with multivariate distribution show that, the RS-EIF algorithm is sensitive to local anomalies and reduces the time overhead by about 60% compared with the EIF algorithm; on the ODDS datasets with many samples, its recognition accuracy is 2 percentage points to 12 percentage points higher than those of the isolation Forest (iForest) algorithm, Lightweight On-line Detection of Anomalies (LODA) algorithm and COPula-based Outlier Detection (COPOD) algorithm. The RS-EIF algorithm has the higher recognition efficiency in the dataset with a large number of samples.

Reference | Related Articles | Metrics

Select

Deep network embedding method based on community optimization

LI Yafang, LIANG Ye, FENG Weiwei, ZU Baokai, KANG Yujian

Journal of Computer Applications 2021, 41 (7): 1956-1963. DOI: 10.11772/j.issn.1001-9081.2020081193

Abstract （391）

PDF （1616KB）（427）

Save

With the rapid development of technologies such as modern network communication and social media, the networked big data is difficult to be applied due to the lack of efficient and available node representation. Network representation learning is widely concerned by transforming high-dimensional sparse network data into low-dimensional, compact and easy-to-apply node representation. However, the existing network embedding methods obtain the low-dimensional feature vectors of nodes and then use them as the inputs for other applications (such as node classification, community discovery, link prediction and visualization) for further analysis, without building models for specific applications, which makes it difficult to achieve satisfactory results. For the specific application of network community discovery, a deep auto-encoder clustering model that combines community structure optimization for low-dimensional feature representation of nodes was proposed, namely Community-Aware Deep Network Embedding (CADNE). Firstly, based on the deep auto-encoder model, the node low-dimensional representation was learned by maintaining the topological characteristics of the local and global links of the network, and then the low-dimensional representation of the nodes was further optimized by using the network clustering structure. In this method, the low-dimensional representations of the nodes and the indicator vectors of the communities that the nodes belong to were learnt at the same time, so that the low-dimensional representation of the nodes can not only maintain the topological characteristics of the original network structure, but also maintain the clustering characteristics of the nodes. Comparing with the existing classical network embedding methods, the results show that CADNE achieves the best clustering results on Citeseer and Cora datasets, and improves the accuracy by up to 0.525 on 20NewsGroup. In classification task, CADNE performs the best on Blogcatalog and Citeseer datasets and the performance on Blogcatalog is improved by up to 0.512 with 20% training samples. In the visualization comparison, CADNE molel can get a low-dimensional representation of nodes with clearer class boundary, which verifies that the proposed method has better low-dimensional representation ability of nodes.

Reference | Related Articles | Metrics

Select

Hybrid recommendation model based on heterogeneous information network

LIN Yixing, TANG Hua

Journal of Computer Applications 2021, 41 (5): 1348-1355. DOI: 10.11772/j.issn.1001-9081.2020081340

Abstract （390）

PDF （1265KB）（517）

Save

The current personalized recommendation platform has the characteristics of a wide range of data sources and many data types. With the data sparsity of the platform as an important reason for affecting the performance of the recommendation system, there are many challenges faced by the recommendation system:how to mine structured data and unstructured data of the platform to discover more features, improve the accuracy of recommendations in data-sparse scenarios, alleviate the cold start problem, and make recommendations interpretable. Therefore, for the personalized scenario of recommending Items for Users, the Heterogeneous Information Network (HIN) was used to build the association relationships between objects in the recommendation platform, and the Meta-Graph was used to describe the association paths between objects and calculate the User-Item similarity matrices under different paths; the FunkSVD matrix decomposition algorithm was adopted to calculate the implicit features of Users and Items, and for the unstructured data with text as an example, the Convolutional Neural Network (CNN) technology was used to mine the text features of the data; after splicing the features obtained by the two methods, a Factorization Machine (FM) incorporating historical average scores of Users and Items was used to predict Users' scores for Items. In the experiment, based on the public dataset Yelp, the proposed hybrid recommendation model, the single recommendation model based on Meta-Graph, the FM Recommendation model (FMR) and the FunkSVD based recommendation model were established and trained. Experimental results show that the proposed hybrid recommendation model has good validity and interpretability, and compared with the comparison models, the recommendation accuracy of this model has been greatly improved.

Reference | Related Articles | Metrics

Select

Performance optimization strategy of distributed storage for industrial time series big data based on HBase

Li YANG, Jianting CHEN, Yang XIANG

Journal of Computer Applications 2023, 43 (3): 759-766. DOI: 10.11772/j.issn.1001-9081.2022020211

Abstract （380）

HTML （15）

PDF （2121KB）（161）

PDF（mobile）（619KB）（12）

Save

In automated industrial scenarios， the amount of time series log data generated by a large number of industrial devices has exploded， and the demand for access to time series data in business scenarios has further increased. Although HBase， a distributed column family database， can store industrial time series big data， the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered. In view of the above problem， based on the distributed storage system HBase， and using the correlation between data and access behavior characteristics in industrial scenarios， a distributed storage performance optimization strategy for massive industrial time series data was proposed. Aiming at the load tilt problem caused by characteristics of industrial time series data， a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed. The data were classified into cold and hot ones by using a Logistic Regression （LR） model， and the hot data were distributed and stored in different nodes. In addition， in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data， a strategy of putting the index and main data into a same Region was proposed. By designing the index RowKey field and splicing rules， the index was stored with its corresponding main data in the same Region. Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5% and the query efficiency is improved by 27.7% after introducing the optimization strategy， demonstrating the proposed strategy can mine access patterns for specific time series data effectively， distribute load reasonably， reduce data access overhead， and meet access requirements for specific time series big data.

Table and Figures | Reference | Related Articles | Metrics

Select

Survey on anomaly detection algorithms for unmanned aerial vehicle flight data

Chaoshuai QI, Wensi HE, Yi JIAO, Yinghong MA, Wei CAI, Suping REN

Journal of Computer Applications 2023, 43 (6): 1833-1841. DOI: 10.11772/j.issn.1001-9081.2022060808

Abstract （358）

HTML （23）

PDF （3156KB）（410）

Save

Focused on the issue of anomaly detection for Unmanned Aerial Vehicle （UAV） flight data in the field of UAV airborne health monitoring， firstly， the characteristics of UAV flight data， the common flight data anomaly types and the corresponding demands on anomaly detection algorithms for UAV flight data were presented. Then， the existing research on UAV flight data anomaly detection algorithms was reviewed， and these algorithms were classified into three categories： prior-knowledge based algorithms for qualitative anomaly detection， model-based algorithms for quantitative anomaly detection， and data-driven anomaly detection algorithms. At the same time， the application scenarios， advantages and disadvantages of the above algorithms were analyzed. Finally， the current problems and challenges of UAV anomaly detection algorithms were summarized， and key development directions of the field of UAV anomaly detection were prospected， thereby providing reference ideas for future research.

Table and Figures | Reference | Related Articles | Metrics

Select

Adaptive affinity propagation clustering algorithm based on universal gravitation

WANG Zhihe, CHANG Xiaoqing, DU Hui

Journal of Computer Applications 2021, 41 (5): 1337-1342. DOI: 10.11772/j.issn.1001-9081.2020071130

Abstract （342）

PDF （1267KB）（395）

Save

Focused on the problem that Affinity Propagation (AP) clustering algorithm is sensitive to parameter Preference, which is not suitable for sparse data, and has the incorrectly clustered sample points in the clustering results, an algorithm named Adaptive Affinity Propagation clustering based on universal gravitation (GA-AP) was proposed. Firstly, the gravitational search mechanism was introduced into the traditional AP algorithm in order to perform the global optimization to the sample points. Secondly, on the basis of global optimization, the correctly clustered and incorrectly clustered sample points in each cluster were found through the information entropy and Adaptive Boosting (AdaBoost) algorithm, the weights of the sample points were calculated. Each sample point was updated by the corresponding weight, so that the similarity, Preference value, attractiveness and membership degree were updated, and the re-clustering was performed. The above steps were continuously operated until the maximum number of iterations was reached. Through simulation experiments on nine datasets, it can be seen that compared to Affinity Propagation clustering based on Adaptive Attribute Weighting (AFW_AP) algorithm, AP algorithm, K-means clustering (K-means) algorithm and Fuzzy C-Means (FCM) algorithm, the proposed algorithm has the average values of Purity, F-measure and Accuracy (ACC) increased by 0.69, 71.74% and 98.5% respectively at most. Experimental results show that the proposed algorithm reduces the dependence on Preference and improves the clustering effect, especially the accuracy of clustering results for sparse datasets.

Reference | Related Articles | Metrics

Select

Improved wavelet clustering algorithm based on peak grid

LONG Chaoqi, JIANG Yu, XIE Yu

Journal of Computer Applications 2021, 41 (4): 1122-1127. DOI: 10.11772/j.issn.1001-9081.2020071042

Abstract （339）

PDF （1096KB）（573）

Save

Aiming at the difference between the clustering effects of wavelet clustering algorithm under different grid division scales, an improved method based on peak grid was proposed. The algorithm mainly aimed at improving the detection method of connected regions in wavelet clustering. First, the spatial grids after wavelet transform were sorted according to the grid values; then, the breadth-first-search method was used to traverse each spatial grid to detect the peak connected regions in the data after wavelet transform; finally, the connected regions were marked and mapped to the original data space to obtain the clustering result. Experimental results of 8 synthetic datasets(4 convex datasets and 4 non-convex datasets) and 2 real datasets in the UCI database showed that the improved algorithm had good performance at low grid division scales, and compared with the original wavelet clustering algorithm, this algorithm had the requirement for grid division scale reduced by 25% to 60%, and the clustering time reduced by 14% under the same clustering effect.

Reference | Related Articles | Metrics

Select

Partial periodic pattern incremental mining of time series data based on multi-scale

Yaling XUN, Linqing WANG, Jianghui CAI, Haifeng YANG

Journal of Computer Applications 2023, 43 (2): 391-397. DOI: 10.11772/j.issn.1001-9081.2021122190

Abstract （332）

HTML （8）

PDF （2226KB）（131）

Save

Aiming at the problems of high computational complexity and poor expansibility in the mining process of partial periodic patterns from dynamic time series data， a partial periodic pattern mining algorithm for dynamic time series data combined with multi-scale theory， named MSI-PPPGrowth （Multi-Scale Incremental Partial Periodic Frequent Pattern） was proposed. In MSI-PPPGrowth， the objective multi-scale characteristics of time series data， were made full use， and the multi-scale theory was introduced in the mining process of partial periodic patterns from time series data. Firstly， both the original data after scale division and incremental time series data were used as a finer-grained benchmark scale dataset for independent mining. Then， the correlation between different scales was used to realize scale transformation， so as to indirectly obtain global frequent patterns corresponding to the dynamically updated dataset. Therefore， the repeated scanning of the original dataset and the constant adjustment of the tree structure were avoided. In which， a new frequent missing count estimation model PJK-EstimateCount was designed based on Kriging method considering the periodicity of time series to effectively estimate the frequent missing item support count in scale transformation. Experimental results show that MSI-PPPGrowth has good scalability and real-time performance. Especially for dense datasets， MSI-PPPGrowth has significant performance advantages.

Table and Figures | Reference | Related Articles | Metrics

Select

Differential privacy high-dimensional data publishing method via clustering analysis

CHEN Hengheng, NI Zhiwei, ZHU Xuhui, JIN Yuanyuan, CHEN Qian

Journal of Computer Applications 2021, 41 (9): 2578-2585. DOI: 10.11772/j.issn.1001-9081.2020111786

Abstract （327）

PDF （1281KB）（310）

Save

Aiming at the problem that the existing differential privacy high-dimensional data publishing methods are difficult to take into account both the complex attribute correlation between data and computational cost, a differential privacy high-dimensional data publishing method based on clustering analysis technology, namely PrivBC, was proposed. Firstly, the attribute clustering method was designed based on the K-means++, the maximum information coefficient was introduced to quantify the correlation between the attributes, and the data attributes with high correlation were clustered. Secondly, for each data subset obtained by the clustering, the correlation matrix was calculated to reduce the candidate space of attribute pairs, and the Bayesian network satisfying differential privacy was constructed. Finally, each attribute was sampled according to the Bayesian networks, and a new private dataset was synthesized for publishing. Compared with PrivBayes method, PrivBC method had the misclassification rate and running time reduced by 12.6% and 30.2% averagely and respectively. Experimental results show that the proposed method can significantly improve the computational efficiency with ensuring the data availability, and provides a new idea for the private publishing of high-dimensional big data.

Reference | Related Articles | Metrics

Select

Long-tail recommendation model based on adaptive group reranking

Canghong JIN, Yuhua SHAO, Qinfang HE

Journal of Computer Applications 2023, 43 (4): 1122-1128. DOI: 10.11772/j.issn.1001-9081.2022030455

Abstract （327）

HTML （10）

PDF （1249KB）（99）

Save

The traditional recommendation algorithms pay too much attention to the precision of recommendation， which leads to the high recommendation rate of popular items. At the same time， the unpopular items are not paid attention to for a long time. This is a classic long-tail problem. In response to this problem， a Multi-objective Dimension Optimization recommendation Model （MDOM）， named Adaptive Group Reranking recommendation Model （AGRM） was proposed， with the construction of two-dimensional weighted similarity based on Euclidean distance and the incorporation of adaptive group reranking. Firstly， a two-dimensional weighted similarity measure was constructed using Euclidean distance， the replacement ratio was set dynamically according to the individual’s historical behavior records， and the long-tail recommendation problem was solved by using the multi-objective optimization algorithm integrated with group. Secondly， two concise objective functions were designed， and the complexity of the objective functions was reduced by taking popularity and long-tail attention into account. Thirdly， based on the two-dimensional weighted similarity measure， a user subset was selected as the "best recommended user group"， and the Pareto optimal solution was calculated. Experimental results on MovieLens 1M and Yahoo datasets show that the coverage of AGRM is the best， with an average increase of 4.11 percentage points and 25.38 percentage points respectively compared to that of Item-based Collaborative Filtering （ItemCF） algorithm， and an average increase of 8.38 percentage points and 33.19 percentage points respectively compared to that of Deep Variational Autoencoder with Shallow Parallel Path for Top-N Recommendation （VASP） model. On Yahoo dataset， the average popularity of AGRM recommendation is the lowest， indicating that AGRM can recommend more long-tail items.

Table and Figures | Reference | Related Articles | Metrics

Select

Method of dynamically constructing spatial topic R-tree based on k-means++

ZOU Zhiwen, QIN Cheng

Journal of Computer Applications 2021, 41 (3): 733-737. DOI: 10.11772/j.issn.1001-9081.2020060851

Abstract （325）

PDF （769KB）（410）

Save

The existing R-tree spatial clustering technology usually randomly designates or calculates the Euclidean distance between spatial data to select the cluster centers, without considering the topic relevance between spatial data, so that the clustering result is influenced by the initial value of k, and the association between spatial data is only based on geographic location. Aiming at this situation, a method of dynamically constructing spatial Topic R-tree (TR-tree) based on k-means++ was proposed. Firstly, in the traditional k-means++ algorithm, k clusters were dynamically determined by the clustering measure function, and Latent Dirichlet Allocation (LDA) model was introduced into the clustering measure function to calculate the topic probability of each spatial data text, as a result, the topic relevance between spatial data was strengthened. Secondly, the cluster center with the highest probability was selected through the topic probabilities. Finally, the TR-tree was constructed, and the spatial data were dynamically allocated during the construction. Experimental results show that with a slight increase of the R-tree construction time, this method has the indexing efficiency and correlation between nodes significantly improved compared to the algorithm of constructing R-tree based only on geographic location clustering.

Reference | Related Articles | Metrics

Select

Time series clustering based on new robust similarity measure

LI Guorong, YE Jimin, ZHEN Yuanting

Journal of Computer Applications 2021, 41 (5): 1343-1347. DOI: 10.11772/j.issn.1001-9081.2020071142

Abstract （325）

PDF （683KB）（344）

Save

For time series data with outliers, a Robust Generalized Cross-Correlation measure (RGCC) between time series based on robust estimation of correlation coefficient was proposed. First, a robust correlation coefficient was introduced to replace Pearson correlation coefficient to calculate the covariance matrix between time series data. Second, the determinant of the new covariance matrix was used to construct a similarity measure between two time series named RGCC. Finally, the distance matrix between the time series was calculated based on this measure, and the matrix was used as the input of the clustering algorithm to cluster the data. Time series clustering simulation experiments showed that for time series data with outliers, the clustering results based on RGCC were obviously closer to the real ones compared to the clustering results based on the original Generalized Cross-Correlation measure (GCC). It can be seen that the proposed new robust similarity measure is fully applicable to time series data with outliers.

Reference | Related Articles | Metrics

Select

Materialized view asynchronous incremental maintenance task generation under hybrid transaction/analytical processing for single record

Yangyang SUN, Junping YAO, Xiaojun LI, Shouxiang FAN, Ziwei WANG

Journal of Computer Applications 2022, 42 (12): 3763-3768. DOI: 10.11772/j.issn.1001-9081.2021101725

Abstract （311）

HTML （4）

PDF （660KB）（55）

Save

Existing materialized view asynchronous incremental maintenance task generation algorithms under Hybrid Transaction/Analytical Processing （HTAP） are mainly used for multiple records and unable to generate materialized view asynchronous incremental maintenance task under HTAP for single record， which results in the increase of disk IO overhead and the performance degradation of materialized view asynchronous incremental maintenance under HTAP. Therefore， a materialized view asynchronous incremental maintenance task generation method under HTAP for single record was proposed. Firstly， the benefit model of materialized view asynchronous incremental maintenance task generation under HTAP for single record was established. Then， the materialized view asynchronous incremental maintenance task generation under HTAP for single record algorithm was designed on the basis of Q-learning. Experimental results show that materialized view asynchronous incremental maintenance task generation under HTAP for single record is realized by the proposed algorithm， and the proposed algorithm decreases the average IOPS （Input/output Operations Per Second）， average CPU utilization （2-core） and average CPU utilization （4-core） at least by 8.49 times， 1.85 percentage points and 0.97 percentage points respectively.

Table and Figures | Reference | Related Articles | Metrics

Select

Ensemble classification model for distributed drifted data streams

YIN Chunyong, ZHANG Guojie

Journal of Computer Applications 2021, 41 (7): 1947-1955. DOI: 10.11772/j.issn.1001-9081.2020081277

Abstract （310）

PDF （1255KB）（271）

Save

Aiming at the problem of low classification accuracy in big data environment, an ensemble classification model for distributed data streams was proposed. Firstly, the microcluster mode was used to reduce the amount of data transmitted from local nodes to the central nodes, so as to reduce the communication cost. Secondly, the training samples of the global classifier were generated by using the sample reconstruction algorithm. Finally, an ensemble classification model for drift data streams was proposed, which adopted the weighted combination strategy of dynamic classifiers and steady classifiers, and the mixed labeling strategy was used to label the most representative instances to update the ensemble model. Experiments on two virtual datasets and two real datasets showed that the model suffered less fluctuation from concept drift compared with two distributed mining models DS-means and BDS-ensemble, and had higher accuracy than Online Active Learning Ensemble model (OALEnsemble), with the accuracy on four datasets improved by 1.58、0.97、0.77 and 1.91 percentage points respectively. Although the memory consumption of this model was slightly higher than those of BDS-ensemble and DS-means models, this model was able to improve the classification performance at a lower memory cost. Therefore, the model is suitable for the classification of big data with distributed and mobility characteristics, such as network monitoring and banking business system.

Reference | Related Articles | Metrics

Select

Diversity represented deep subspace clustering algorithm

Zhifeng MA, Junyang YU, Longge WANG

Journal of Computer Applications 2023, 43 (2): 407-412. DOI: 10.11772/j.issn.1001-9081.2021122126

Abstract （309）

HTML （15）

PDF （1851KB）（130）

Save

Focusing on the challenge task for mining complementary information in different levels of features in the deep subspace clustering problem， based on the deep autoencoder， by exploring complementary information between the low-level and high-level features obtained by the encoder， a Diversity Represented Deep Subspace Clustering （DRDSC） algorithm was proposed. Firstly， based on Hilbert-Schmidt Independence Criterion （HSIC）， a diversity representation measurement model was established for different levels of features. Secondly， a feature diversity representation module was introduced into the deep autoencoder network structure， which explored image features beneficial to enhance the clustering effect. Furthermore， the form of loss function was updated to effectively fuse the underlying subspaces of multi-level representation. Finally， several experiments were conducted on commonly used clustering datasets. Experimental results show that on the datasets Extended Yale B， ORL， COIL20 and Umist， the clustering error rates of DRDSC reach 1.23%， 10.50%， 1.74% and 17.71%， respectively， which are reduced by 10.41， 16.75， 13.12 and 12.92 percentage points， respectively compared with those of Efficient Dense Subspace Clustering （EDSC）， and are reduced by 1.44， 3.50， 3.68 and 9.17 percentage points， respectively compared with Deep Subspace Clustering （DSC）， which indicates that the proposed DRDSC algorithm has better clustering effect.

Table and Figures | Reference | Related Articles | Metrics

Select

Imbalanced classification algorithm based on improved semi-supervised clustering

Yu LU, Lingyun ZHAO, Binwen BAI, Zhen JIANG

Journal of Computer Applications 2022, 42 (12): 3750-3755. DOI: 10.11772/j.issn.1001-9081.2021101837

Abstract （304）

HTML （8）

PDF （706KB）（108）

Save

Imbalanced classification is one of the research hotspots in the field of machine learning， where oversampling increases minority samples through repeated extraction or artificial synthesis to rebalance the dataset. However， most of the existing oversampling methods are based on the original data distribution， and are difficult to reveal more dataset distribution characteristics. To address the above problem， firstly， an improved semi-supervised clustering algorithm was proposed to mine the data distribution characteristics. Secondly， based on the results of semi-supervised clustering， the highly-confident unlabeled data （pseudo-labeled samples） was selected from minority-class clusters to join into the original training set. In this way， in addition to rebalancing the dataset， the distribution characteristics obtained by semi-supervised clustering was able to be used to assist the imbalanced classification. Finally， the results of semi-supervised clustering and classification were fused to predict the final labels， which further improved the model performance of imbalanced classification. With G-mean and Area Under Curve （AUC） selected as evaluation indicators， the proposed algorithm was compared with seven oversampling-/undersampling-based imbalanced classification algorithms， such as TU （Trainable Undersampling） and CDSMOTE （Class Decomposition Synthetic Minority Oversampling TEchnique） on 10 public datasets. Experimental results show that compared with TU and CDSMOTE， the proposed algorithm has the average AUC increased by 6.7% and 3.9% respectively， the average G-mean improved by 7.6% and 2.1% respectively. At the same time， the proposed algorithm achieves the highest average results on both evaluation indicators than all the algorithms to be compared. It can be seen that the proposed algorithm can effectively improve the imbalanced classification performance.

Table and Figures | Reference | Related Articles | Metrics

Select

Efficient complex event matching algorithm based on ordered event lists

Tao QIU, Jianli DING, Xiufeng XIA, Hongmei XI, Peiliang XIE, Qingyi ZHOU

Journal of Computer Applications 2023, 43 (2): 423-429. DOI: 10.11772/j.issn.1001-9081.2021122186

Abstract （302）

HTML （13）

PDF （2336KB）（92）

Save

Aiming at the problem of high matching cost in the existing complex event matching processing methods， a complex event matching algorithm ReCEP was proposed， which uses event buffers （ordered event lists） for recursive traversal. Different from the existing method that uses automaton to match on the event stream， this method decomposes the constraints in the complex event query mode into different types， and then recursively verifies the different constraints on the ordered list. Firstly， according to the query mode， the related event instances were cached according to the event type. Secondly， the query filtering operation was performed to the event instances on the ordered list， and an algorithm based on recursive traversal was given to determine the initial event instance and obtain candidate sequence. Finally， the attribute constraints of the candidate sequence were further verified. Experimental testing and analysis results based on simulated stock transaction data show that compared with the current mainstream matching methods SASE and Siddhi， ReCEP algorithm can effectively reduce the processing time of query matching， has overall performance better than both of the two methods， and has the query matching efficiency improved by more than 8.64%. It can be seen that the proposed complex event matching method can effectively improve the efficiency of complex event processing.

Table and Figures | Reference | Related Articles | Metrics

Select

Adaptive social recommendation based on negative similarity

Yinying ZHOU, Yunsheng ZHOU, Dunhui YU, Jun SUN

Journal of Computer Applications 2023, 43 (8): 2439-2447. DOI: 10.11772/j.issn.1001-9081.2022071003

Abstract （293）

HTML （6）

PDF （3245KB）（136）

Save

Social recommendation aims to improve recommendation effect of traditional recommendation algorithms by integrating social relations. Currently， social recommendation algorithms based on Network Embedding （NE） face two problems： one is that inconsistency between objects is not considered when constructing network， and algorithms are often restricted by positive objects that are difficult to obtain and have many constraints； the other is that the elimination of overfitting in algorithm training process based on the number of ratings cannot be realized by these algorithms. Therefore， an Adaptive Social Recommendation algorithm based on Negative Similarity （ASRNS） was proposed. Firstly， homogeneous networks with positive correlations were constructed by consistency analysis. Then， embedded vectors were obtained by combining weighted random walk with Skip-Gram algorithm. Next， similarities were calculated， and Matrix Factorization （MF） algorithm was constrained from the perspective of negative similarity. Finally， the number of ratings was mapped to the ideal rating range based on adaptive mechanism， and different penalties were imposed on bias terms of the algorithm. Experiments were conducted on FilmTrust and CiaoDVD datasets. The results show that compared with algorithms such as Collaborative User Network Embedding （CUNE） algorithm and Consistent neighbor aggregation for Recommendation （ConsisRec） algorithm， ASRNS has the Root Mean Square Error （RMSE） reduced by at least 2.60% and 5.53% respectively， and the Mean Absolute Error （MAE） reduced by at least 1.47% and 2.46% respectively. It can be seen that ASRNS can not only reduce rating prediction error effectively， but also improve over-fitting problem in algorithm training process significantly， and has good robustness for objects with different ratings.

Table and Figures | Reference | Related Articles | Metrics

Select

Bichromatic reverse k nearest neighbor query method based on distance-keyword similarity constraint

ZHANG Hao, ZHU Rui, SONG Fuyao, FANG Peng, XIA Xiufeng

Journal of Computer Applications 2021, 41 (6): 1686-1693. DOI: 10.11772/j.issn.1001-9081.2020091453

Abstract （290）

PDF （1025KB）（289）

Save

In order to solve the problem of low quality of results returned by spatial keyword bichromatic reverse k nearest neighbor query, a bichromatic reverse k nearest neighbor query method based on distance-keyword similarity constraint was proposed. Firstly, a threshold was set to filter out the low-quality users in the query results, so that the existence of users with relatively long spatial distance in the query results was avoided and the quality of the query results was ensured. Then, in order to support this query, an index of Keyword Multiresolution Grid rectangle-tree (KMG-tree) was proposed to manage the data. Finally, the Six-region-optimize algorithm based on Six-region algorithm was proposed to improve the query processing efficiency. The query efficiency of the Six-region-optimize algorithm was about 85.71% and 23.45% on average higher than those of the baseline and Six-region algorithms respectively. Experimental test and analysis were carried out based on real spatio-temporal data. The experimental results verify the effectiveness and high efficiency of the Six-region-optimize algorithm.

Reference | Related Articles | Metrics

Select

PageRank-based talent mining algorithm based on Web of Science

LI Chong, WANG Yuchen, DU Weijing, HE Xiaotao, LIU Xuemin, ZHANG Shibo, LI Shuren

Journal of Computer Applications 2021, 41 (5): 1356-1360. DOI: 10.11772/j.issn.1001-9081.2020081206

Abstract （285）

PDF （775KB）（429）

Save

The high-level paper is one of the symbolic achievements of excellent scientific talents. Focusing on the "Web of Science (WOS)" hot research disciplines, on the basis of constructing the Neo4j semantic network graph of academic papers and mining active scientific research communities, the PageRank-based talent mining algorithm was used to realize the mining of outstanding scientific research talents in the scientific research communities. Firstly, the existing talent mining algorithms were studied and analyzed in detail. Secondly, combined with the WOS data, the PageRank-based talent mining algorithm was optimized and implemented by adding consideration factors such as the paper publication time factor, the author's order descending model, the influence of surrounding author nodes on this node, the number of citations of the paper. Finally, experiments and verifications were carried out based on the paper data of the communities of the hot discipline computer science in the past five years. The results show that community-based mining is more targeted, and can quickly find representative excellent and potential talents in various disciplines, and the improved algorithm is more effective and objective.

Reference | Related Articles | Metrics

Select

Community mining algorithm based on multi-relationship of nodes and its application

Lin ZHOU, Yuzhi XIAO, Peng LIU, Youpeng QIN

Journal of Computer Applications 2023, 43 (5): 1489-1496. DOI: 10.11772/j.issn.1001-9081.2022081218

Abstract （285）

HTML （14）

PDF （4478KB）（133）

Save

In order to measure the similarity of multi-relational nodes and mine the community structure with multi-relational nodes， a community mining algorithm based on multi-relationship of nodes， called LSL-GN， was proposed. Firstly， based on node similarity and node reachability， LHN-ISL， a similarity measurement index for multi-relational nodes， was described to reconstruct the low-density model of the target network， and the community division was completed by combining with GN （Girvan-Newman） algorithm. The LSL-GN algorithm was compared with several classical community mining algorithms on Modularity （Q value）， Normalized Mutual Information （NMI） and Adjusted Rand Index （ARI）. The results show that LSL-GN algorithm achieves the best results in terms of three indexes， indicating that the community division quality of LSL-GN is better. The “User-Application” mobile roaming network model was divided by LSL-GN algorithm into community structures based on basic applications such as Ctrip， Amap and Didi Travel. These results of community division can provide strategic reference information for designing personalized package services.

Table and Figures | Reference | Related Articles | Metrics

Project Articles