Clustering results of the pSCAN (pruned Structural Clustering Algorithm for Network) algorithm are influenced by the density constraint parameter and the similarity threshold parameter. If the requirements cannot be satisfied by the clustering results obtained by the clustering parameters provided by the user, then the user’s own clustering requirements can be expressed through instance clusters. Aiming at the problem of instance clusters expressing clustering query requirements, an instance cluster-driven structural graph clustering parameter calculation algorithm PART and its improved algorithm ImPART were proposed. Firstly, the influences of two clustering parameters on the clustering results were analyzed, and correlation subgraph of instance cluster was extracted. Secondly, the feasible interval of the density constraint parameter was obtained by analyzing the correlation subgraph, and the nodes in the instance cluster were divided into core nodes and non-core nodes according to the current density constraint parameter and the structural similarity between nodes. Finally, according to the node division result, the optimal similarity threshold parameter corresponding to the current density constraint parameter was calculated, and the obtained parameters were verified and optimized on the relevant subgraph until the clustering parameters that satisfy the requirements of the instance cluster were obtained. Experimental results on real datasets show that a set of effective parameters can be returned for user instance clusters by using the proposed algorithm, and the proposed improved algorithm ImPART is more than 20% faster than the basic algorithm PART, and can return the optimal clustering parameters that satisfy the requirements of instance clusters quickly and effectively for the user.
Short-term traffic flow prediction is not only related to historical data, but also affected by the traffic of adjacent areas. Since the trend and spatial correlation of traffic flow are ignored by traditional Time Series Decomposition (TSD) models, a time series processing model based on the combination of Time Series Decomposition and Spatio-Temporal features (TSD-ST) was proposed. Firstly, the trend component and periodic component were obtained by using Empirical Mode Decomposition (EMD) and Discrete Fourier Transform (DFT), the Spatio-Temporal (ST) correlation of the fluctuation component was mined by Mutual Information algorithm (MI), and the state vector was reconstructed on the basis of the above. Then, the fluctuation component was predicted by using the state vector through Long Short-Term Memory (LSTM) network. Finally, the final predicted value was obtained by reconstructing the prediction results of the three parts of the sequence. The validity of the model was verified on the real data of Interstate I090 in Washington State, USA. Experimental results show that the Root Mean Square Error (RMSE) of the proposed model TSD-ST-LSTM is reduced by 16.5%, 34.0%, and 36.6% compared with that of Support Vector Regression (SVR), Gradient Boosting Regression Tree (GBRT) and LSTM, respectively. It can be seen that the proposed model is very effective in improving prediction accuracy.
The identification of key nodes in complex network plays an important role in the optimization of network structure and effective propagation of information. Local structural Entropy (LE) can be used to identify key nodes by using the influence of the local network on the whole network instead of the influence of nodes on the whole network. However, the cases of the highly aggregative network and nodes forming a loop with neighbor nodes are not considered in LE, which leads to some limitations. To address these limitations, firstly, an improved LE based node importance evaluation method, namely PLE (Penalized Local structural Entropy), was proposed, in which based on the LE, the Clustering Coefficient (CC) was introduced as a penalty term to penalize the highly aggregative nodes in the network appropriately. Secondly, due to the fact that the penalty of PLE penalizing the nodes in triadic closure structure is too much, an improved method of PLE, namely PLEA (Penalized Local structural Entropy Advancement) was proposed, in which control coefficient was introduced in front of the penalty term to control the penalty strength. Selective attack experiments on five real networks with different sizes were conducted. Experimental results show that in the western US states grid and the US Airlines, PLEA has the identification accuracy improved by 26.3% and 3.2% compared with LE respectively, by 380% and 5.43% compared with K-Shell (KS) method respectively, and by 14.4% and 24% compared with DCL (Degree and Clustering coefficient and Location) method respectively. The key nodes identified by PLEA can cause more damage to the network, verifying the rationality of introducing the CC as a penalty term, and the effectiveness and superiority of PLEA. The integration of the number of neighbors and the local network structure of nodes with the simplicity of computation makes it more effective in describing the reliability and invulnerability of large-scale networks.
Most of the existing multi-view clustering algorithms assume that there is a linear relationship between multi-view data points, and fail to maintain the locality of original feature space during the learning process. At the same time, merging subspace in Euclidean space is too rigid to align learned subspace representations. To solve the above problems, a multi-view clustering algorithm via subspaces merging on Grassmann manifold was proposed. Firstly, the kernel trick and the learning of local manifold structure were combined to obtain the subspace representations of different views. Then, the subspace representations were merged on the Grassmann manifold to obtain the consensus affinity matrix. Finally, spectral clustering was performed on the consensus affinity matrix to obtain the final clustering result. And Alternating Direction Method of Multipliers (ADMM) was used to optimize the proposed model. Compared with Kernel Multi-view Low-Rank Sparse Subspace Clustering (KMLRSSC) algorithm, the proposed algorithm has the clustering accuracy improved by 20.83 percentage points, 9.47 percentage points and 7.33 percentage points on MSRCV1, Prokaryotic and Not-Hill datasets. Experimental results verify the effectiveness and good performance of the multi-view clustering algorithm via subspace merging on Grassmann manifold.
In automated industrial scenarios, the amount of time series log data generated by a large number of industrial devices has exploded, and the demand for access to time series data in business scenarios has further increased. Although HBase, a distributed column family database, can store industrial time series big data, the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered. In view of the above problem, based on the distributed storage system HBase, and using the correlation between data and access behavior characteristics in industrial scenarios, a distributed storage performance optimization strategy for massive industrial time series data was proposed. Aiming at the load tilt problem caused by characteristics of industrial time series data, a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed. The data were classified into cold and hot ones by using a Logistic Regression (LR) model, and the hot data were distributed and stored in different nodes. In addition, in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data, a strategy of putting the index and main data into a same Region was proposed. By designing the index RowKey field and splicing rules, the index was stored with its corresponding main data in the same Region. Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5% and the query efficiency is improved by 27.7% after introducing the optimization strategy, demonstrating the proposed strategy can mine access patterns for specific time series data effectively, distribute load reasonably, reduce data access overhead, and meet access requirements for specific time series big data.
Focused on the issue of anomaly detection for Unmanned Aerial Vehicle (UAV) flight data in the field of UAV airborne health monitoring, firstly, the characteristics of UAV flight data, the common flight data anomaly types and the corresponding demands on anomaly detection algorithms for UAV flight data were presented. Then, the existing research on UAV flight data anomaly detection algorithms was reviewed, and these algorithms were classified into three categories: prior-knowledge based algorithms for qualitative anomaly detection, model-based algorithms for quantitative anomaly detection, and data-driven anomaly detection algorithms. At the same time, the application scenarios, advantages and disadvantages of the above algorithms were analyzed. Finally, the current problems and challenges of UAV anomaly detection algorithms were summarized, and key development directions of the field of UAV anomaly detection were prospected, thereby providing reference ideas for future research.
Aiming at the problems of high computational complexity and poor expansibility in the mining process of partial periodic patterns from dynamic time series data, a partial periodic pattern mining algorithm for dynamic time series data combined with multi-scale theory, named MSI-PPPGrowth (Multi-Scale Incremental Partial Periodic Frequent Pattern) was proposed. In MSI-PPPGrowth, the objective multi-scale characteristics of time series data, were made full use, and the multi-scale theory was introduced in the mining process of partial periodic patterns from time series data. Firstly, both the original data after scale division and incremental time series data were used as a finer-grained benchmark scale dataset for independent mining. Then, the correlation between different scales was used to realize scale transformation, so as to indirectly obtain global frequent patterns corresponding to the dynamically updated dataset. Therefore, the repeated scanning of the original dataset and the constant adjustment of the tree structure were avoided. In which, a new frequent missing count estimation model PJK-EstimateCount was designed based on Kriging method considering the periodicity of time series to effectively estimate the frequent missing item support count in scale transformation. Experimental results show that MSI-PPPGrowth has good scalability and real-time performance. Especially for dense datasets, MSI-PPPGrowth has significant performance advantages.
The traditional recommendation algorithms pay too much attention to the precision of recommendation, which leads to the high recommendation rate of popular items. At the same time, the unpopular items are not paid attention to for a long time. This is a classic long-tail problem. In response to this problem, a Multi-objective Dimension Optimization recommendation Model (MDOM), named Adaptive Group Reranking recommendation Model (AGRM) was proposed, with the construction of two-dimensional weighted similarity based on Euclidean distance and the incorporation of adaptive group reranking. Firstly, a two-dimensional weighted similarity measure was constructed using Euclidean distance, the replacement ratio was set dynamically according to the individual’s historical behavior records, and the long-tail recommendation problem was solved by using the multi-objective optimization algorithm integrated with group. Secondly, two concise objective functions were designed, and the complexity of the objective functions was reduced by taking popularity and long-tail attention into account. Thirdly, based on the two-dimensional weighted similarity measure, a user subset was selected as the "best recommended user group", and the Pareto optimal solution was calculated. Experimental results on MovieLens 1M and Yahoo datasets show that the coverage of AGRM is the best, with an average increase of 4.11 percentage points and 25.38 percentage points respectively compared to that of Item-based Collaborative Filtering (ItemCF) algorithm, and an average increase of 8.38 percentage points and 33.19 percentage points respectively compared to that of Deep Variational Autoencoder with Shallow Parallel Path for Top-N Recommendation (VASP) model. On Yahoo dataset, the average popularity of AGRM recommendation is the lowest, indicating that AGRM can recommend more long-tail items.
Existing materialized view asynchronous incremental maintenance task generation algorithms under Hybrid Transaction/Analytical Processing (HTAP) are mainly used for multiple records and unable to generate materialized view asynchronous incremental maintenance task under HTAP for single record, which results in the increase of disk IO overhead and the performance degradation of materialized view asynchronous incremental maintenance under HTAP. Therefore, a materialized view asynchronous incremental maintenance task generation method under HTAP for single record was proposed. Firstly, the benefit model of materialized view asynchronous incremental maintenance task generation under HTAP for single record was established. Then, the materialized view asynchronous incremental maintenance task generation under HTAP for single record algorithm was designed on the basis of Q-learning. Experimental results show that materialized view asynchronous incremental maintenance task generation under HTAP for single record is realized by the proposed algorithm, and the proposed algorithm decreases the average IOPS (Input/output Operations Per Second), average CPU utilization (2-core) and average CPU utilization (4-core) at least by 8.49 times, 1.85 percentage points and 0.97 percentage points respectively.
Focusing on the challenge task for mining complementary information in different levels of features in the deep subspace clustering problem, based on the deep autoencoder, by exploring complementary information between the low-level and high-level features obtained by the encoder, a Diversity Represented Deep Subspace Clustering (DRDSC) algorithm was proposed. Firstly, based on Hilbert-Schmidt Independence Criterion (HSIC), a diversity representation measurement model was established for different levels of features. Secondly, a feature diversity representation module was introduced into the deep autoencoder network structure, which explored image features beneficial to enhance the clustering effect. Furthermore, the form of loss function was updated to effectively fuse the underlying subspaces of multi-level representation. Finally, several experiments were conducted on commonly used clustering datasets. Experimental results show that on the datasets Extended Yale B, ORL, COIL20 and Umist, the clustering error rates of DRDSC reach 1.23%, 10.50%, 1.74% and 17.71%, respectively, which are reduced by 10.41, 16.75, 13.12 and 12.92 percentage points, respectively compared with those of Efficient Dense Subspace Clustering (EDSC), and are reduced by 1.44, 3.50, 3.68 and 9.17 percentage points, respectively compared with Deep Subspace Clustering (DSC), which indicates that the proposed DRDSC algorithm has better clustering effect.
Imbalanced classification is one of the research hotspots in the field of machine learning, where oversampling increases minority samples through repeated extraction or artificial synthesis to rebalance the dataset. However, most of the existing oversampling methods are based on the original data distribution, and are difficult to reveal more dataset distribution characteristics. To address the above problem, firstly, an improved semi-supervised clustering algorithm was proposed to mine the data distribution characteristics. Secondly, based on the results of semi-supervised clustering, the highly-confident unlabeled data (pseudo-labeled samples) was selected from minority-class clusters to join into the original training set. In this way, in addition to rebalancing the dataset, the distribution characteristics obtained by semi-supervised clustering was able to be used to assist the imbalanced classification. Finally, the results of semi-supervised clustering and classification were fused to predict the final labels, which further improved the model performance of imbalanced classification. With G-mean and Area Under Curve (AUC) selected as evaluation indicators, the proposed algorithm was compared with seven oversampling-/undersampling-based imbalanced classification algorithms, such as TU (Trainable Undersampling) and CDSMOTE (Class Decomposition Synthetic Minority Oversampling TEchnique) on 10 public datasets. Experimental results show that compared with TU and CDSMOTE, the proposed algorithm has the average AUC increased by 6.7% and 3.9% respectively, the average G-mean improved by 7.6% and 2.1% respectively. At the same time, the proposed algorithm achieves the highest average results on both evaluation indicators than all the algorithms to be compared. It can be seen that the proposed algorithm can effectively improve the imbalanced classification performance.
Aiming at the problem of high matching cost in the existing complex event matching processing methods, a complex event matching algorithm ReCEP was proposed, which uses event buffers (ordered event lists) for recursive traversal. Different from the existing method that uses automaton to match on the event stream, this method decomposes the constraints in the complex event query mode into different types, and then recursively verifies the different constraints on the ordered list. Firstly, according to the query mode, the related event instances were cached according to the event type. Secondly, the query filtering operation was performed to the event instances on the ordered list, and an algorithm based on recursive traversal was given to determine the initial event instance and obtain candidate sequence. Finally, the attribute constraints of the candidate sequence were further verified. Experimental testing and analysis results based on simulated stock transaction data show that compared with the current mainstream matching methods SASE and Siddhi, ReCEP algorithm can effectively reduce the processing time of query matching, has overall performance better than both of the two methods, and has the query matching efficiency improved by more than 8.64%. It can be seen that the proposed complex event matching method can effectively improve the efficiency of complex event processing.
Social recommendation aims to improve recommendation effect of traditional recommendation algorithms by integrating social relations. Currently, social recommendation algorithms based on Network Embedding (NE) face two problems: one is that inconsistency between objects is not considered when constructing network, and algorithms are often restricted by positive objects that are difficult to obtain and have many constraints; the other is that the elimination of overfitting in algorithm training process based on the number of ratings cannot be realized by these algorithms. Therefore, an Adaptive Social Recommendation algorithm based on Negative Similarity (ASRNS) was proposed. Firstly, homogeneous networks with positive correlations were constructed by consistency analysis. Then, embedded vectors were obtained by combining weighted random walk with Skip-Gram algorithm. Next, similarities were calculated, and Matrix Factorization (MF) algorithm was constrained from the perspective of negative similarity. Finally, the number of ratings was mapped to the ideal rating range based on adaptive mechanism, and different penalties were imposed on bias terms of the algorithm. Experiments were conducted on FilmTrust and CiaoDVD datasets. The results show that compared with algorithms such as Collaborative User Network Embedding (CUNE) algorithm and Consistent neighbor aggregation for Recommendation (ConsisRec) algorithm, ASRNS has the Root Mean Square Error (RMSE) reduced by at least 2.60% and 5.53% respectively, and the Mean Absolute Error (MAE) reduced by at least 1.47% and 2.46% respectively. It can be seen that ASRNS can not only reduce rating prediction error effectively, but also improve over-fitting problem in algorithm training process significantly, and has good robustness for objects with different ratings.
In order to measure the similarity of multi-relational nodes and mine the community structure with multi-relational nodes, a community mining algorithm based on multi-relationship of nodes, called LSL-GN, was proposed. Firstly, based on node similarity and node reachability, LHN-ISL, a similarity measurement index for multi-relational nodes, was described to reconstruct the low-density model of the target network, and the community division was completed by combining with GN (Girvan-Newman) algorithm. The LSL-GN algorithm was compared with several classical community mining algorithms on Modularity (Q value), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). The results show that LSL-GN algorithm achieves the best results in terms of three indexes, indicating that the community division quality of LSL-GN is better. The “User-Application” mobile roaming network model was divided by LSL-GN algorithm into community structures based on basic applications such as Ctrip, Amap and Didi Travel. These results of community division can provide strategic reference information for designing personalized package services.