Project Articles

    Data science and technology

    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Point-of-interest recommendation algorithm combing dynamic and static preferences
    YANG Li, WANG Shihui, ZHU Bo
    Journal of Computer Applications    2021, 41 (2): 398-406.   DOI: 10.11772/j.issn.1001-9081.2020050677
    Abstract210)      PDF (1355KB)(417)       Save
    Since most existing Point-Of-Interest (POI) recommendation algorithms ignore the complexity of the modeling of the fusion of user dynamic and static preferences, a POI recommendation algorithm called CLSR (Combing Long Short Recommendation) was proposed that combined complex dynamic user preferences and general static user preferences. Firstly, in the process of modeling complex dynamic preferences, a hybrid neural network was designed based on the user's check-in behaviors and the skip behaviors in check-in behaviors to achieve the modeling of complex dynamic interests of the user. Secondly, in the process of general static preference modeling, a high-level attention network was used to learn the complex interactions between the user and POIs. Thirdly, a multi-layer neural network was used to further learn and express the above dynamic preferences and static preferences. Finally, a unified POI recommendation framework was used to integrate the preferences. Experimental results on real datasets show that, compared with FPMC-LR (Factorizing Personalized Markov Chain and Localized Region), PRME (Personalized Ranking Metric Embedding), Rank-GeoFM (Ranking based Geographical Factorization Method) and TMCA (Temporal and Multi-level Context Attention), CLSR has the performance greatly improved, and compared to the optimal TMCA among the comparison methods, the proposed algorithm has the precision, recall and normalized Discounted Cumulative Gain (nDCG) increased by 5.8%, 5.1%, and 7.2% on Foursquare dataset, and 7.3%, 10.2%, and 6.3% on Gowalla dataset. It can be seen that CLSR algorithm can effectively improve the results of POI recommendation.
    Reference | Related Articles | Metrics
    Patent text classification based on ALBERT and bidirectional gated recurrent unit
    WEN Chaodong, ZENG Cheng, REN Junwei, ZHANG Yan
    Journal of Computer Applications    2021, 41 (2): 407-412.   DOI: 10.11772/j.issn.1001-9081.2020050730
    Abstract353)      PDF (979KB)(619)       Save
    With the rapid increase in the number of patent applications, the demand for automatic classification of patent text is increasing. Most of the existing patent text classification algorithms utilize methods such as Word2vec and Global Vectors (GloVe) to obtain the word vector representation of the text, while a lot of word position information is abandoned and the complete semantics of the text cannot be expressed. In order to solve these problems, a multilevel patent text classification model named ALBERT-BiGRU was proposed by combining ALBERT (A Lite BERT) and BiGRU (Bidirectional Gated Recurrent Unit). In this model, dynamic word vector pre-trained by ALBERT was used to replace the static word vector trained by traditional methods like Word2vec, so as to improve the representation ability of the word vector. Then, the BiGRU neural network model was used for training, which preserved the semantic association between long-distance words in the patent text to the greatest extent. In the effective verification on the patent text dataset published by State Information Center, compared with Word2vec-BiGRU and GloVe-BiGRU, the accuracy of ALBERT-BiGRU was increased by 9.1 percentage points and 10.9 percentage points respectively at the department level of patent text, and was increased by 9.5 percentage points and 11.2 percentage points respectively at the big class level. Experimental results show that ALBERT-BiGRU can effectively improve the classification effect of patent texts of different levels.
    Reference | Related Articles | Metrics
    Fixed word-aligned partition compression algorithm of inverted list based on directed acyclic graph
    JIANG Kun, LIU Zheng, ZHU Lei, LI Xiaoxing
    Journal of Computer Applications    2021, 41 (3): 727-732.   DOI: 10.11772/j.issn.1001-9081.2020060874
    Abstract243)      PDF (905KB)(335)       Save
    In Fixed Word-Aligned (FWA) inverted index compression algorithms of Web search engines, due to the "greedy" block partition strategy of the inverted list and the interleaved storage of the codeword information, it is difficult for the algorithm to achieve the optimal compression performance. To solve the above problem, an FWA partition compression algorithm based on Directed Acyclic Graph (DAG) was proposed. Firstly, considering the small integer information in the inverted list brought by the clustering characteristics of Web pages, a novel FWA compression format with data area of 64-bit blocks was designed. In this compression format, the data area was divided into 16 storage modes suitable for continuous small integer compression through 4-bit selector area, and the selector area and data area in each block of the inverted list were stored separately, so as to ensure good batch decompression performance. Secondly, a DAG described Word-Aligned Partition (WAP) algorithm was proposed on the basis of the new compression format. In the algorithm, the inverted list block partitioning problem was regarded as a Single-Source Shortest-Path (SSSP) problem by DAG, and by considering the constraints of various storage modes of data area in FWA compression format, the structure and recursive definition of the SSSP problem were determined. Thirdly, the dynamic programming technique was used to solve the problem of SSSP and generate the pseudo-code and algorithm complexity of the optimal partition vector. The original storage modes of traditional FWA algorithms such as S9, S16 and S8b were partitioned and optimized based on DAG, and the computational complexities of the algorithms before and after optimization were compared and analyzed. Finally, the compression performance experiments were conducted with simulation integer sequence data and Text REtrieval Conference (TREC) GOV2 Web page index data. Experimental results show that, compared with the traditional FWA algorithms, the DAG based FWA partition algorithm can improve the compression ratio and decompression speed by batch decompression and partition optimization technology. At the same time, it can obtain a higher compression ratio than the traditional Frame Of Reference (FOR) algorithms for the compression of continuous small integer sequence.
    Reference | Related Articles | Metrics
    Method of dynamically constructing spatial topic R-tree based on k-means++
    ZOU Zhiwen, QIN Cheng
    Journal of Computer Applications    2021, 41 (3): 733-737.   DOI: 10.11772/j.issn.1001-9081.2020060851
    Abstract164)      PDF (769KB)(277)       Save
    The existing R-tree spatial clustering technology usually randomly designates or calculates the Euclidean distance between spatial data to select the cluster centers, without considering the topic relevance between spatial data, so that the clustering result is influenced by the initial value of k, and the association between spatial data is only based on geographic location. Aiming at this situation, a method of dynamically constructing spatial Topic R-tree (TR-tree) based on k-means++ was proposed. Firstly, in the traditional k-means++ algorithm, k clusters were dynamically determined by the clustering measure function, and Latent Dirichlet Allocation (LDA) model was introduced into the clustering measure function to calculate the topic probability of each spatial data text, as a result, the topic relevance between spatial data was strengthened. Secondly, the cluster center with the highest probability was selected through the topic probabilities. Finally, the TR-tree was constructed, and the spatial data were dynamically allocated during the construction. Experimental results show that with a slight increase of the R-tree construction time, this method has the indexing efficiency and correlation between nodes significantly improved compared to the algorithm of constructing R-tree based only on geographic location clustering.
    Reference | Related Articles | Metrics
    Comparative density peaks clustering algorithm with automatic determination of clustering center
    GUO Jia, HAN Litao, SUN Xianlong, ZHOU Lijuan
    Journal of Computer Applications    2021, 41 (3): 738-744.   DOI: 10.11772/j.issn.1001-9081.2020071071
    Abstract259)      PDF (2809KB)(411)       Save
    In order to solve the problem that the clustering centers cannot be determined automatically by Density Peaks Clustering (DPC) algorithm, and the clustering center points and the non-clustering center points are not obvious enough in the decision graph, Comparative density Peaks Clustering algorithm with Automatic determination of clustering center (ACPC) was designed. Firstly, the distance parameter was replaced by the distance comparison quantity, so that the potential clustering centers were more obvious in the decision graph. Then, the 2D interval estimation method was used to perform the automatic selection of clustering centers, so as to realize the automation of clustering process. Experimental results show that the ACPC algorithm has better clustering effect on four synthetic datasets; and the comparison of the Accuracy indicator on real datasets shows that on the dataset Iris, the clustering accuracy of ACPC can reach 94%, which is 27.3% higher than that of the traditional DPC algorithm, and the problem of selecting clustering centers interactively is solved by ACPC.
    Reference | Related Articles | Metrics
    Survey on online hashing algorithm
    GUO Yicun, CHEN Huahui
    Journal of Computer Applications    2021, 41 (4): 1106-1112.   DOI: 10.11772/j.issn.1001-9081.2020071047
    Abstract384)      PDF (1188KB)(933)       Save
    In the current large-scale data retrieval tasks, learning to hash methods can learn compact binary codes, which saves storage space and can quickly calculate the similarity in Hamming space. Therefore, for approximate nearest neighbor search, hashing methods are often used to improve the mechanism of fast nearest neighbor search. In most current hashing methods, the offline learning models are used for batch training, which cannot adapt to possible data changes appeared in the environment of large-scale streaming data, resulting in reduction of retrieval efficiency. Therefore, the adaptive hash functions were proposed and learnt in online hashing methods, which realize the continuous learning in the process of inputting data and make the methods can be applied to similarity retrieval in real-time. Firstly, the basic principles of learning to hash and the inherent requirements to realize online hashing were explained. Secondly, the different learning methods of online hashing were introduced from the perspectives such as the reading method, learning mode, and model update method of streaming data under online conditions. Thirdly, the online learning algorithms were further divided into six categories, that is, categories based on passive-aggressive algorithms, matrix factorization technology, unsupervised clustering, similarity supervision, mutual information measurement, codebook supervision respectively. And the advantages, disadvantages and characteristics of these algorithms were analyzed. Finally, the development directions of online hashing were summarized and discussed.
    Reference | Related Articles | Metrics
    Improved wavelet clustering algorithm based on peak grid
    LONG Chaoqi, JIANG Yu, XIE Yu
    Journal of Computer Applications    2021, 41 (4): 1122-1127.   DOI: 10.11772/j.issn.1001-9081.2020071042
    Abstract132)      PDF (1096KB)(477)       Save
    Aiming at the difference between the clustering effects of wavelet clustering algorithm under different grid division scales, an improved method based on peak grid was proposed. The algorithm mainly aimed at improving the detection method of connected regions in wavelet clustering. First, the spatial grids after wavelet transform were sorted according to the grid values; then, the breadth-first-search method was used to traverse each spatial grid to detect the peak connected regions in the data after wavelet transform; finally, the connected regions were marked and mapped to the original data space to obtain the clustering result. Experimental results of 8 synthetic datasets(4 convex datasets and 4 non-convex datasets) and 2 real datasets in the UCI database showed that the improved algorithm had good performance at low grid division scales, and compared with the original wavelet clustering algorithm, this algorithm had the requirement for grid division scale reduced by 25% to 60%, and the clustering time reduced by 14% under the same clustering effect.
    Reference | Related Articles | Metrics
    Reliability analysis models for replication-based storage systems with proactive fault tolerance
    LI Jing, LUO Jinfei, LI Bingchao
    Journal of Computer Applications    2021, 41 (4): 1113-1121.   DOI: 10.11772/j.issn.1001-9081.2020071067
    Abstract137)      PDF (1396KB)(349)       Save
    Proactive fault tolerance mechanism, which predicts disk failures and prompts the system to perform migration and backup for the data in danger in advance, can be used to enhance the storage system reliability. In view of the problem that the reliability of the replication-based storage systems with proactive fault tolerance cannot be evaluated by the existing research accurately, several state transition models were proposed for replication-based storage systems; then the models were implemented based on Monte Carlo simulation, so as to simulate the running of the replication-based storage systems with proactive fault tolerance; at last, the expected number of data-loss events during a period in the systems was counted. The Weibull distribution function was used to model the time distribution of device failure and failure repair events, and the impact of proactive fault tolerance mechanism, node failures, node failure repairs, disk failures and disk failure repairs on the system reliability were evaluated quantitatively. Experimental results showed that when the accuracy of the prediction model reached 50%, the reliability of the systems were able to be improved by 1-3 times, and compared with 2-way replication systems, 3-way replication systems were more sensitive to system parameters. By using the proposed models, system administrators can easily assess system reliability under different fault tolerance schemes and system parameters, and then build storage systems with high reliability and high availability.
    Reference | Related Articles | Metrics
    PageRank-based talent mining algorithm based on Web of Science
    LI Chong, WANG Yuchen, DU Weijing, HE Xiaotao, LIU Xuemin, ZHANG Shibo, LI Shuren
    Journal of Computer Applications    2021, 41 (5): 1356-1360.   DOI: 10.11772/j.issn.1001-9081.2020081206
    Abstract127)      PDF (775KB)(337)       Save
    The high-level paper is one of the symbolic achievements of excellent scientific talents. Focusing on the "Web of Science (WOS)" hot research disciplines, on the basis of constructing the Neo4j semantic network graph of academic papers and mining active scientific research communities, the PageRank-based talent mining algorithm was used to realize the mining of outstanding scientific research talents in the scientific research communities. Firstly, the existing talent mining algorithms were studied and analyzed in detail. Secondly, combined with the WOS data, the PageRank-based talent mining algorithm was optimized and implemented by adding consideration factors such as the paper publication time factor, the author's order descending model, the influence of surrounding author nodes on this node, the number of citations of the paper. Finally, experiments and verifications were carried out based on the paper data of the communities of the hot discipline computer science in the past five years. The results show that community-based mining is more targeted, and can quickly find representative excellent and potential talents in various disciplines, and the improved algorithm is more effective and objective.
    Reference | Related Articles | Metrics
    Patent clustering method based on functional effect
    MA Jianhong, CAO Wenbin, LIU Yuangang, XIA Shuang
    Journal of Computer Applications    2021, 41 (5): 1361-1366.   DOI: 10.11772/j.issn.1001-9081.2020081203
    Abstract125)      PDF (916KB)(314)       Save
    At present, patents are divided according to their domains, and cross-domain patent clustering can be realized based on the functional effect, which is of great significance in enterprise innovation design. Accurate extraction of patent functional effect and fast acquisition of optimal clustering results are the key tasks in it. Therefore, a Functional Effect Information-Joint (FEI-Joint) model combining Enhanced Language Representation with Informative Entities (ERNIE) and Convolutional Neural Network (CNN) was proposed to extract the functional effects of patent documents, and the Self-Organizing Map (SOM) algorithm was improved, so as to propose an Early Reject based Class Merge Self-Organizing Map (ERCM-SOM) to realize the patent clustering based on functional effect. FEI-Joint model was compared with Term-Frequency-Inverse-Document-Frequency (TF-IDF), Latent Dirichlet Allocation (LDA) and CNN in the clustering effect after feature extraction, and the results show that the F-measure value of the proposed model was obviously improved than those of other models. Compared with K-Means algorithm and SOM algorithm, ERCM-SOM algorithm has higher F-measure value while has significantly shorter time than that of SOM algorithm. Compared with the patent classification using International Patent Classification (IPC), the clustering method based on functional effect can achieve cross-domain patent clustering effect, which lays a foundation for designers to learn from design methods in other domains.
    Reference | Related Articles | Metrics
    Hybrid recommendation model based on heterogeneous information network
    LIN Yixing, TANG Hua
    Journal of Computer Applications    2021, 41 (5): 1348-1355.   DOI: 10.11772/j.issn.1001-9081.2020081340
    Abstract196)      PDF (1265KB)(403)       Save
    The current personalized recommendation platform has the characteristics of a wide range of data sources and many data types. With the data sparsity of the platform as an important reason for affecting the performance of the recommendation system, there are many challenges faced by the recommendation system:how to mine structured data and unstructured data of the platform to discover more features, improve the accuracy of recommendations in data-sparse scenarios, alleviate the cold start problem, and make recommendations interpretable. Therefore, for the personalized scenario of recommending Items for Users, the Heterogeneous Information Network (HIN) was used to build the association relationships between objects in the recommendation platform, and the Meta-Graph was used to describe the association paths between objects and calculate the User-Item similarity matrices under different paths; the FunkSVD matrix decomposition algorithm was adopted to calculate the implicit features of Users and Items, and for the unstructured data with text as an example, the Convolutional Neural Network (CNN) technology was used to mine the text features of the data; after splicing the features obtained by the two methods, a Factorization Machine (FM) incorporating historical average scores of Users and Items was used to predict Users' scores for Items. In the experiment, based on the public dataset Yelp, the proposed hybrid recommendation model, the single recommendation model based on Meta-Graph, the FM Recommendation model (FMR) and the FunkSVD based recommendation model were established and trained. Experimental results show that the proposed hybrid recommendation model has good validity and interpretability, and compared with the comparison models, the recommendation accuracy of this model has been greatly improved.
    Reference | Related Articles | Metrics
    Time series clustering based on new robust similarity measure
    LI Guorong, YE Jimin, ZHEN Yuanting
    Journal of Computer Applications    2021, 41 (5): 1343-1347.   DOI: 10.11772/j.issn.1001-9081.2020071142
    Abstract185)      PDF (683KB)(227)       Save
    For time series data with outliers, a Robust Generalized Cross-Correlation measure (RGCC) between time series based on robust estimation of correlation coefficient was proposed. First, a robust correlation coefficient was introduced to replace Pearson correlation coefficient to calculate the covariance matrix between time series data. Second, the determinant of the new covariance matrix was used to construct a similarity measure between two time series named RGCC. Finally, the distance matrix between the time series was calculated based on this measure, and the matrix was used as the input of the clustering algorithm to cluster the data. Time series clustering simulation experiments showed that for time series data with outliers, the clustering results based on RGCC were obviously closer to the real ones compared to the clustering results based on the original Generalized Cross-Correlation measure (GCC). It can be seen that the proposed new robust similarity measure is fully applicable to time series data with outliers.
    Reference | Related Articles | Metrics
    Adaptive affinity propagation clustering algorithm based on universal gravitation
    WANG Zhihe, CHANG Xiaoqing, DU Hui
    Journal of Computer Applications    2021, 41 (5): 1337-1342.   DOI: 10.11772/j.issn.1001-9081.2020071130
    Abstract168)      PDF (1267KB)(254)       Save
    Focused on the problem that Affinity Propagation (AP) clustering algorithm is sensitive to parameter Preference, which is not suitable for sparse data, and has the incorrectly clustered sample points in the clustering results, an algorithm named Adaptive Affinity Propagation clustering based on universal gravitation (GA-AP) was proposed. Firstly, the gravitational search mechanism was introduced into the traditional AP algorithm in order to perform the global optimization to the sample points. Secondly, on the basis of global optimization, the correctly clustered and incorrectly clustered sample points in each cluster were found through the information entropy and Adaptive Boosting (AdaBoost) algorithm, the weights of the sample points were calculated. Each sample point was updated by the corresponding weight, so that the similarity, Preference value, attractiveness and membership degree were updated, and the re-clustering was performed. The above steps were continuously operated until the maximum number of iterations was reached. Through simulation experiments on nine datasets, it can be seen that compared to Affinity Propagation clustering based on Adaptive Attribute Weighting (AFW_AP) algorithm, AP algorithm, K-means clustering (K-means) algorithm and Fuzzy C-Means (FCM) algorithm, the proposed algorithm has the average values of Purity, F-measure and Accuracy (ACC) increased by 0.69, 71.74% and 98.5% respectively at most. Experimental results show that the proposed algorithm reduces the dependence on Preference and improves the clustering effect, especially the accuracy of clustering results for sparse datasets.
    Reference | Related Articles | Metrics
    Bichromatic reverse k nearest neighbor query method based on distance-keyword similarity constraint
    ZHANG Hao, ZHU Rui, SONG Fuyao, FANG Peng, XIA Xiufeng
    Journal of Computer Applications    2021, 41 (6): 1686-1693.   DOI: 10.11772/j.issn.1001-9081.2020091453
    Abstract137)      PDF (1025KB)(196)       Save
    In order to solve the problem of low quality of results returned by spatial keyword bichromatic reverse k nearest neighbor query, a bichromatic reverse k nearest neighbor query method based on distance-keyword similarity constraint was proposed. Firstly, a threshold was set to filter out the low-quality users in the query results, so that the existence of users with relatively long spatial distance in the query results was avoided and the quality of the query results was ensured. Then, in order to support this query, an index of Keyword Multiresolution Grid rectangle-tree (KMG-tree) was proposed to manage the data. Finally, the Six-region-optimize algorithm based on Six-region algorithm was proposed to improve the query processing efficiency. The query efficiency of the Six-region-optimize algorithm was about 85.71% and 23.45% on average higher than those of the baseline and Six-region algorithms respectively. Experimental test and analysis were carried out based on real spatio-temporal data. The experimental results verify the effectiveness and high efficiency of the Six-region-optimize algorithm.
    Reference | Related Articles | Metrics
    Extended isolation forest algorithm based on random subspace
    XIE Yu, JIANG Yu, LONG Chaoqi
    Journal of Computer Applications    2021, 41 (6): 1679-1685.   DOI: 10.11772/j.issn.1001-9081.2020091436
    Abstract193)      PDF (1335KB)(362)       Save
    Aiming at the problem of excessive time overhead of the Extended Isolation Forest (EIF) algorithm, a new algorithm named Extended Isolation Forest based on Random Subspace (RS-EIF) was proposed. Firstly, multiple random subspaces were determined in the original data space. Then, in each random subspace, the extended isolated tree was constructed by calculating the intercept vector and slope of each node, and multiple extended isolated trees were integrated into a subspace extended isolation forest. Finally, the average traversal depth of data point in the extended isolation forest was calculated to determine whether the data point was abnormal. Experimental results on 9 real datasets in Outliter Detection DataSet (ODDS) and 7 synthetic datasets with multivariate distribution show that, the RS-EIF algorithm is sensitive to local anomalies and reduces the time overhead by about 60% compared with the EIF algorithm; on the ODDS datasets with many samples, its recognition accuracy is 2 percentage points to 12 percentage points higher than those of the isolation Forest (iForest) algorithm, Lightweight On-line Detection of Anomalies (LODA) algorithm and COPula-based Outlier Detection (COPOD) algorithm. The RS-EIF algorithm has the higher recognition efficiency in the dataset with a large number of samples.
    Reference | Related Articles | Metrics
    Parameter independent weighted local mean-based pseudo nearest neighbor classification algorithm
    CAI Ruiguang, ZHANG Desheng, XIAO Yanting
    Journal of Computer Applications    2021, 41 (6): 1694-1700.   DOI: 10.11772/j.issn.1001-9081.2020091370
    Abstract118)      PDF (895KB)(226)       Save
    Aiming at the problem that the Local Mean-based Pseudo Nearest Neighbor (LMPNN) algorithm is sensitive to the value of k and ignores the different influence of different attributes on the classification results, a Parameter Independent Weighted Local Mean-based Pseudo Nearest Neighbor classification (PIW-LMPNN) algorithm was proposed. Firstly, the Success-History based parameter Adaptation for Differential Evolution (SHADE) algorithm, the latest variant of differential evolution algorithm, was used to optimize the training set samples to obtain the best k value and a set of best weights related to the classes. Secondly, when calculating the distance between samples, different weights were assigned to different attributes of different classes, and the test set samples were classified. Finally, simulations were performed on 15 real datasets and the proposed algorithm was compared to other eight classification algorithms. The results show that the proposed algorithm has the classification accuracy and F1 value increased by about 28 percentage points and 23.1 percentage points respectively. At the same time, the comparision results of Wilcoxon signed-rank test, Friedman rank variance test and Hollander-Wolfe's pairwise processing show that the proposed improved algorithm outperforms the other eight classification algorithms in terms of classification accuracy and k value selection.
    Reference | Related Articles | Metrics
    Influence maximization algorithm based on user interactive representation
    ZHANG Meng, LI Weihua
    Journal of Computer Applications    2021, 41 (7): 1964-1969.   DOI: 10.11772/j.issn.1001-9081.2020081225
    Abstract116)      PDF (952KB)(180)       Save
    The problem of influence maximization is to select a group of effective seed users in social networks, through which information can reach the largest scope of spread. Traditional researches on influence maximization rely on the specific network structures and diffusion models, however, the manually processed simplified networks and the diffusion models based on assumptions have great limitations on assessing the real influence of users. To solve this problem, an Influence Maximization algorithm based on User Interactive Representation (IMUIR) was proposed. First, the context pairs were constructed through random sampling according to users' interaction traces, and the vector representations of the users were obtained by the SkipGram model training. Then, the greedy strategy was used to select the best seed set according to the activity degrees of the source users and the interaction degrees between these users with other users. To verify the effectiveness of IMUIR, experiments were conducted to compare it with Random, Average Cascade (AC), Kcore and Imfector algorithms on two social networks with real interactive information. The results show that IMUIR selects the seed set with higher quality, produces a wider scope of influence spread, and performs stablely on the two datasets.
    Reference | Related Articles | Metrics
    Deep network embedding method based on community optimization
    LI Yafang, LIANG Ye, FENG Weiwei, ZU Baokai, KANG Yujian
    Journal of Computer Applications    2021, 41 (7): 1956-1963.   DOI: 10.11772/j.issn.1001-9081.2020081193
    Abstract232)      PDF (1616KB)(318)       Save
    With the rapid development of technologies such as modern network communication and social media, the networked big data is difficult to be applied due to the lack of efficient and available node representation. Network representation learning is widely concerned by transforming high-dimensional sparse network data into low-dimensional, compact and easy-to-apply node representation. However, the existing network embedding methods obtain the low-dimensional feature vectors of nodes and then use them as the inputs for other applications (such as node classification, community discovery, link prediction and visualization) for further analysis, without building models for specific applications, which makes it difficult to achieve satisfactory results. For the specific application of network community discovery, a deep auto-encoder clustering model that combines community structure optimization for low-dimensional feature representation of nodes was proposed, namely Community-Aware Deep Network Embedding (CADNE). Firstly, based on the deep auto-encoder model, the node low-dimensional representation was learned by maintaining the topological characteristics of the local and global links of the network, and then the low-dimensional representation of the nodes was further optimized by using the network clustering structure. In this method, the low-dimensional representations of the nodes and the indicator vectors of the communities that the nodes belong to were learnt at the same time, so that the low-dimensional representation of the nodes can not only maintain the topological characteristics of the original network structure, but also maintain the clustering characteristics of the nodes. Comparing with the existing classical network embedding methods, the results show that CADNE achieves the best clustering results on Citeseer and Cora datasets, and improves the accuracy by up to 0.525 on 20NewsGroup. In classification task, CADNE performs the best on Blogcatalog and Citeseer datasets and the performance on Blogcatalog is improved by up to 0.512 with 20% training samples. In the visualization comparison, CADNE molel can get a low-dimensional representation of nodes with clearer class boundary, which verifies that the proposed method has better low-dimensional representation ability of nodes.
    Reference | Related Articles | Metrics
    Ensemble classification model for distributed drifted data streams
    YIN Chunyong, ZHANG Guojie
    Journal of Computer Applications    2021, 41 (7): 1947-1955.   DOI: 10.11772/j.issn.1001-9081.2020081277
    Abstract172)      PDF (1255KB)(179)       Save
    Aiming at the problem of low classification accuracy in big data environment, an ensemble classification model for distributed data streams was proposed. Firstly, the microcluster mode was used to reduce the amount of data transmitted from local nodes to the central nodes, so as to reduce the communication cost. Secondly, the training samples of the global classifier were generated by using the sample reconstruction algorithm. Finally, an ensemble classification model for drift data streams was proposed, which adopted the weighted combination strategy of dynamic classifiers and steady classifiers, and the mixed labeling strategy was used to label the most representative instances to update the ensemble model. Experiments on two virtual datasets and two real datasets showed that the model suffered less fluctuation from concept drift compared with two distributed mining models DS-means and BDS-ensemble, and had higher accuracy than Online Active Learning Ensemble model (OALEnsemble), with the accuracy on four datasets improved by 1.58、0.97、0.77 and 1.91 percentage points respectively. Although the memory consumption of this model was slightly higher than those of BDS-ensemble and DS-means models, this model was able to improve the classification performance at a lower memory cost. Therefore, the model is suitable for the classification of big data with distributed and mobility characteristics, such as network monitoring and banking business system.
    Reference | Related Articles | Metrics
    Impact and enhancement of similarity features on link prediction
    CAI Biao, LI Ruicen, WU Yuanyuan
    Journal of Computer Applications    2021, 41 (9): 2569-2577.   DOI: 10.11772/j.issn.1001-9081.2020111744
    Abstract116)      PDF (4634KB)(144)       Save
    Link prediction focuses on the design of prediction algorithms that can describe a given network mechanism more accurately to achieve the prediction result with higher accuracy. Based on an analysis of the existing research achievements, it is found that the similarity characteristics of a network has a great impact on the link prediction method used. In networks with low tag similarity between nodes, increasing the tag similarity is able to improve the prediction accuracy; in networks with high tag similarity between nodes, more attention should be paid to the contribution of structural information to link prediction to improve the prediction accuracy. Then, a tag-weighted similarity algorithm was proposed by weighting the tags, which was able to improve the accuracy of link prediction in networks with low similarity. Meanwhile, in networks with relatively high similarity, the structural information of the network was introduced into the node similarity calculation, and the accuracy of link prediction was improved through the preferential attachment mechanism. Experimental results on four real networks show that the proposed algorithm achieves the highest accuracy compared to the comparison algorithms Cosine Similarity between Tag Systems (CSTS), Preferential Attachment (PA), etc. According to the network similarity characteristics, using the proposed corresponding algorithm for link prediction can obtain more accurate prediction results.
    Reference | Related Articles | Metrics
    Differential privacy high-dimensional data publishing method via clustering analysis
    CHEN Hengheng, NI Zhiwei, ZHU Xuhui, JIN Yuanyuan, CHEN Qian
    Journal of Computer Applications    2021, 41 (9): 2578-2585.   DOI: 10.11772/j.issn.1001-9081.2020111786
    Abstract123)      PDF (1281KB)(131)       Save
    Aiming at the problem that the existing differential privacy high-dimensional data publishing methods are difficult to take into account both the complex attribute correlation between data and computational cost, a differential privacy high-dimensional data publishing method based on clustering analysis technology, namely PrivBC, was proposed. Firstly, the attribute clustering method was designed based on the K-means++, the maximum information coefficient was introduced to quantify the correlation between the attributes, and the data attributes with high correlation were clustered. Secondly, for each data subset obtained by the clustering, the correlation matrix was calculated to reduce the candidate space of attribute pairs, and the Bayesian network satisfying differential privacy was constructed. Finally, each attribute was sampled according to the Bayesian networks, and a new private dataset was synthesized for publishing. Compared with PrivBayes method, PrivBC method had the misclassification rate and running time reduced by 12.6% and 30.2% averagely and respectively. Experimental results show that the proposed method can significantly improve the computational efficiency with ensuring the data availability, and provides a new idea for the private publishing of high-dimensional big data.
    Reference | Related Articles | Metrics
    Parallel decompression algorithm for high-speed train monitoring data
    WANG Zhoukai, ZHANG Jiong, MA Weigang, WANG Huaijun
    Journal of Computer Applications    2021, 41 (9): 2586-2593.   DOI: 10.11772/j.issn.1001-9081.2020111173
    Abstract116)      PDF (1272KB)(153)       Save
    The real-time monitoring data generated by high-speed trains during running are usually processed by variable-length coding compression technology, which is convenient for transmission and storage. However, this method will complicate the internal structure of the compressed data, so that the corresponding data decompression process must follow the composition order of the compressed data, which is inefficient. In order to improve the decompression efficiency of high-speed train monitoring data, a parallel decompression algorithm for high-speed train monitoring data was proposed with the help of the speculation technology. Firstly, the structural characteristics of high-speed train monitoring data were studied, and the internal dependence that affects data division was analyzed. Secondly, the speculation technology was used to clean up internal dependence, and then, the data were divided into different parts tentatively. Thirdly, the division results were decompressed in a distributed computing environment in parallel. Finally, the parallel decompression results were combined together. Through this way, the decompression efficiency of high-speed train monitoring data was improved. Experimental results showed that on the computing cluster composed of 7 computing nodes, compared with the serial algorithm, the speedup of the proposed speculative parallel algorithm was about 3, showing a good performance of this algorithm. It can be seen that this algorithm can improve the monitoring data decompression efficiency significantly.
    Reference | Related Articles | Metrics
2022 Vol.42 No.11

Current Issue
Archive
Honorary Editor-in-Chief: ZHANG Jingzhong
Editor-in-Chief: XU Zongben
Associate Editor: SHEN Hengtao XIA Zhaohui
Domestic Post Distribution Code: 62-110
Foreign Distribution Code: M4616
Address:
No. 9, 4th Section of South Renmin Road, Chengdu 610041, China
Tel: 028-85224283-803
  028-85222239-803
Website: www.joca.cn
E-mail: bjb@joca.cn
WeChat
Join CCF