The 37th CCF National Database Conference (NDBC 2020)

Select

Two-stage file compaction framework by log-structured merge-tree for time series data

ZHANG Lingzhe, HUANG Xiangdong, QIAO Jialin, GOU Wangminhao, WANG Jianmin

Journal of Computer Applications 2021, 41 (3): 618-622. DOI: 10.11772/j.issn.1001-9081.2020122053

Abstract （487）

PDF （793KB）（896）

Save

When the Log-Structured Merge-tree (LSM-tree) in the time series database is under high write load or resource constraints, file compaction not in time will cause a large accumulation of LSM C ₀ layer data, resulting in an increase in the latency of ad hoc queries of recently written data. To address this problem, a two-stage LSM compaction framework was proposed that realizes low-latency query of newly written time series data on the basis of maintaining efficient query for large blocks of data. Firstly, the file compaction process was divided into two stages:quickly merging of a small number of out-of-order files, merging of a large number of small files, then multiple file compaction strategies were provided in each stage, finally the two-stage compaction resource allocation was performed according to the query load of the system. By implementing the test of the traditional LSM compaction strategy and the two-stage LSM compaction framework on the time series database Apache IoTDB, the results showed that compared with the traditional LSM, the two-stage file compaction module was able to greatly reduce the number of ad hoc query reads while improving the flexibility of the strategy, and made the historical data analysis and query performance improved by about 20%. Experimental results show that the two-stage LSM compaction framework can increase the ad hoc query efficiency of recently written data, and can improve the performance of historical data analysis and query as well as the flexibility of compaction strategy.

Reference | Related Articles | Metrics

Select

Implementation method of lightweight distributed index based on log structured merge-tree

CUI Shuangshuang, WANG Hongzhi

Journal of Computer Applications 2021, 41 (3): 630-635. DOI: 10.11772/j.issn.1001-9081.2020091543

Abstract （456）

PDF （896KB）（695）

Save

To solve the problem that the existing distributed database based on Log Structured Merge-Tree (LSM-Tree) only supports efficient primary key query and cannot allow users to quickly apply it in their own clusters, a light-weight distributed index implementation method based on LSM-Tree, called SIBL (Secondary Index Based LSM-Tree), was proposed. Firstly, the query efficiency of the non-primary key attributes was improved by indexing the primary key attribute columns. Then, a distributed index construction algorithm and an index interval division algorithm based on equidistant sampling were proposed to ensure the even distribution of indexes in the system. And the query algorithm of the traditional index was optimized, and the index file was regarded as a special data file and stored in the system in a distributed manner, ensuring the load balance and scalability of the system. Finally, experiments of the proposed method with Huawei's secondary index scheme HIndex were carried out on the HBase database to compare performances such as time and space overhead of index construction, index query performance and system load balance, verifying that the proposed method improves query performance by 50 to 200 times.

Reference | Related Articles | Metrics

Select

Ensemble classification algorithm for imbalanced time series

CAO Yang, YAN Qiuyan, WU Xin

Journal of Computer Applications 2021, 41 (3): 651-656. DOI: 10.11772/j.issn.1001-9081.2020091493

Abstract （397）

PDF （925KB）（512）

Save

Aiming at the problem that the existing ensemble classification methods have poor learning ability for unbalanced time series data, the idea of optimizing component algorithm performance and integration strategy was adopted, and based on the heterogeneous ensemble method Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE), an ensemble classification algorithm IMHIVE-COTE (Imbalanced Hierarchical Vote Collective of Transformation-based Ensembles) for unbalanced time series was proposed. The algorithm mainly contains two improvements:first, a new unbalanced classification component SBST-HESCA (SMOM ( K-NN-based Synthetic Minority Oversampling algorithm for Multiclass imbalance problems) & Boosting into ST-HESCA (Shapelet Transformation-Heterogeneous Ensembles of Standard Classification Algorithm) algorithm) was added, the idea of boosting combined with resampling was introduced, and the sample weights were updated through cross-validation prediction results, so as to make the re-sampling process of the dataset more conducive to improving the classification quality of minority samples; second, the HIVE-COTE calculation framework was improved by combining the SBST-HESCA component, and the weight of the component algorithm was optimized, so that the unbalanced time series classification algorithm had higher voting weight to the classification result, as a result, the overall classification quality of the ensemble algorithm was further improved. The experimental part verified and analyzed the performance of IMHIVE-COTE:compared with the comparison methods, IMHIVE-COTE had the highest overall classification evaluation, and the best, the best and third overall classification evaluation on three unbalanced classification indexes. It is proved that IMHIVE-COTE's ability to solve the problem of unbalanced time series classification is significantly better.

Reference | Related Articles | Metrics

Select

Database star-join optimization for multicore CPU and GPU platforms

LIU Zhuan, HAN Ruichen, ZHANG Yansong, CHEN Yueguo, ZHANG Yu

Journal of Computer Applications 2021, 41 (3): 611-617. DOI: 10.11772/j.issn.1001-9081.2020091430

Abstract （595）

PDF （1026KB）（832）

Save

Focusing on the high execution cost of star-join between the fact table and multiple dimension tables in On-line Analytical Processing (OLAP), a star-join optimization technique was proposed for advanced multicore CPU (Central Processing Unit) and GPU (Graphics Processing Unit). Firstly, the vector index based vectorized star-join algorithm on CPU and GPU platforms was proposed for the intermediate materialization cost problem in star-join in multicore CPU and GPU platforms. Secondly, the star-join operation based on vector granularity was presented according to the vector division for CPU cache size and GPU shared memory size, so as to optimize the vector index materialization cost in star-join. Finally, the compressed vector index based star-join algorithm was proposed to compress the fixed-length vector index to the variable-length binary vector index, so as to improve the storage access efficiency of the vector index in cache under low selection rate. Experimental results show that the vectorized star-join algorithm achieves more than 40% performance improvement compared to the traditional row-wise or column-wise star-join algorithms on multicore CPU platform, and the vectorized star-join algorithm achieves more than 15% performance improvement compared to the conventional star-join algorithms on GPU platform; in the comparison with the mainstream main-memory databases and GPU databases, the optimized star-join algorithm achieves 130% performance improvement compared to the optimal main-memory database Hyper, and achieves 80% performance improvement compared to the optimal GPU database OmniSci. It can be seen that the vector index based star-join optimization technique effectively improves the multiple table join performance, and compared with the traditional optimization techniques, the vector index based vectorized processing improves the data storage access efficiency in small cache, and the compressed vector further improves the vector index access efficiency in cache.

Reference | Related Articles | Metrics

Select

NVM-LH: non-volatile memory-friendly linear hash index

TANG Chen, HUANG Guorui, JIN Peiquan

Journal of Computer Applications 2021, 41 (3): 623-629. DOI: 10.11772/j.issn.1001-9081.2020091451

Abstract （387）

PDF （1035KB）（614）

Save

Non-Volatile Memory (NVM) attracts people's attention because of its large capacity, persistence, bit addressability and low read latency. However, it also has some disadvantages, such as limited writes and asymmetric reading and writing speed. When the traditional linear hash index is implemented directly on NVM, it will lead to a great number of random write operations. To solve this problem, a new NVM-friendly linear hash index called NVM-LH (NVM-oriented Linear Hashing) was proposed. The cache friendliness was achieved by NVM-LH through the cache line alignment during storing data. And a log-free data consistency guaranteeing strategy was presented in NVM-LH. In addition, the split and delete operations were optimized in NVM-LH to minimize the NVM write operations. Experimental results show that NVM-LH outperforms the state-of-the-art NVM-aware hash index CCEH (Cacheline-Conscious Extendible Hashing) in terms of space utilization (30% higher) and NVM write number (about 15% lower), showing better NVM-friendliness.

Reference | Related Articles | Metrics

Select

Dynamic workload matching method for automatic parameter tuning

SHEN Chen, TAI Lingxiang, PENG Yuwei

Journal of Computer Applications 2021, 41 (3): 657-661. DOI: 10.11772/j.issn.1001-9081.2020091424

Abstract （355）

PDF （867KB）（512）

Save

A dynamic workload description method and a dynamic workload matching algorithm were proposed to improve the accuracy of workload matching in automatic parameter tuning systems, such as OtterTune, with static workload description. First, a dynamic workload description method was proposed to describe the workload changes more accurately. Then, for the problems such as irregular sequences in workload matching and that the Euclidean distance algorithm is no longer applicable, a dynamic workload matching algorithm using data alignment was proposed based on the Dynamic Time Warping (DTW) algorithm. Finally, the proposed methods were applied to OtterTune to form a dynamic workload-based tuning tool D-OtterTune (Dynamic workload based OtterTune), and several experiments were conducted on it. Experimental results showed that, compared with the original method OtterTune, with the stable improvement in the workload matching accuracy for automatic parameter tuning brought by the dynamic workload matching method, D-OtterTune had the accuracy increased by 3%. It can be seen that D-OtterTune can have a significant impact on the overall business performance in data-intensive applications.

Reference | Related Articles | Metrics

Select

AAC-Hunter: efficient algorithm for discovering aggregation algebraic constraints in relational databases

ZHANG Xiaowei, JIANG Dawei, CHEN Ke, CHEN Gang

Journal of Computer Applications 2021, 41 (3): 636-642. DOI: 10.11772/j.issn.1001-9081.2020091473

Abstract （323）

PDF （1077KB）（581）

Save

In order to better maintain the data integrity and help auditors find anomalous reimbursement records in relational databases, the algorithm AAC-Hunter (Aggregation Algebraic Constraints Hunter), which discovered Aggregation Algebraic Constraints (AACs) automatically, was proposed. An AAC is a fuzzy constraint defined between the aggregation results of two columns in the database and acts on most but not all records. Firstly, joining, grouping and algebraic expressions were enumerated to generate candidate AACs. Secondly, the value range sets of these candidate AACs were calculated. Finally, the AAC results were output. However, this method was not able to face the performance challenges caused by massive data, so that a set of heuristic rules were applied to decrease the size of candidate constraint space and the optimization strategies based on intermediate results reuse and trivial candidate AACs elimination were employed to speed up the value range set calculation for candidate AACs. Experimental results on TPC-H and European Soccer datasets show that AAC-Hunter reduces the constraint discovery space by 95.68% and 99.94% respectively, and shortens running time by 96.58% and 92.51% respectively, compared with the baseline algorithm without heuristic rules or optimization strategies. As the effectiveness of AAC-Hunter is verified, it can be seen that AAC-Hunter can improve the efficiency and capability of auditing application.

Reference | Related Articles | Metrics

Select

Remote sensing time-series images classification algorithm with abnormal data

REN Yuanyuan, WANG Chuanjian

Journal of Computer Applications 2021, 41 (3): 662-668. DOI: 10.11772/j.issn.1001-9081.2020091425

Abstract （366）

PDF （1226KB）（902）

Save

Concerning the problem of convolutional neural network having poor classification performance to time-series remote sensing images with abnormal data, an end-to-end network based on the integration of multi-mode and multi-single-mode architecture was introduced. Firstly, multi-scale features of the multi-dimensional time-series were extracted by the multivariate time-series model and the univariate time-series model. Then, the spatio-temporal sequence feature construction was completed by automatic coding based on the pixel spatial coordinate information. Finally, the classification was implemented by fully connected layer and the softmax function. In the case of data anomaly (data loss and data distortion), the proposed algorithm was compared with commonly used time-series remote sensing image classification algorithms such as 1D Convolutional Neural Network (1D-CNN), Multi-Channels Deep Neural Network (MCDNN), Time Series Convolutional Neural Networks (TSCNN) and Long Short-Term Memory (LSTM) network. Experimental results showed that the proposed network using the end-to-end multi-mode and multi-single-mode architecture fusion had the highest classification accuracy in the case of data anomaly, and the F1 value reached 93.40%.

Reference | Related Articles | Metrics

Select

Personalized privacy protection for spatio-temporal data

LIU Xiangyu, XIA Guoping, XIA Xiufeng, ZONG Chuanyu, ZHU Rui, LI Jiajia

Journal of Computer Applications 2021, 41 (3): 643-650. DOI: 10.11772/j.issn.1001-9081.2020091463

Abstract （442）

PDF （1280KB）（833）

Save

Due to the popularity of smart mobile terminals, sensitive information such as personal location privacy, check-in data privacy and trajectory privacy in the collected spatio-temporal data are easy to be leaked. In the current researches, protection technologies are proposed for the above privacy leakages respectively, and there is not a personalized spatio-temporal data privacy protection method to prevent the above privacy leakages for users. Therefore, a personalized privacy protection model for spatio-temporal data named ( p, q, ε)-anonymity and a Personalized Privacy Protection for Spatio-Temporal Data (PPP _ST) algorithm based on this model were proposed to protect the users' privacy data with personalized settings (location privacy, check-in data privacy and trajectory privacy). The heuristic rules were designed to generalize the spatio-temporal data to ensure the availability of the published data and realize the high availability of spatio-temporal data. In the comparison experiments, the data availability rate of PPP _ST algorithm is about 4.66% and 15.45% higher than those of Information Data Used through K-anonymity (IDU-K) and Personalized Clique Cloak (PCC) algorithms on average respectively. At the same time, the generalized location search technology was designed to improve the execution efficiency of the algorithm. Experiments and analysis were conducted based on real spatio-temporal data. Experimental results show that PPP _ST algorithm can effectively protect the privacy of personalized spatio-temporal data.

Reference | Related Articles | Metrics

Project Articles