Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (7): 2078-2087.DOI: 10.11772/j.issn.1001-9081.2021050743
• Data science and technology • Previous Articles Next Articles
Yiyang GUO1, Jiong YU1,2(), Xusheng DU1, Shaozhi YANG1, Ming CAO3
Received:
2021-05-10
Revised:
2021-09-08
Accepted:
2021-09-15
Online:
2021-09-08
Published:
2022-07-10
Contact:
Jiong YU
About author:
GUO Yiyang, born in 1996, M. S. candidate. His research interests include machine learning, data mining.Supported by:
郭一阳1, 于炯1,2(), 杜旭升1, 杨少智1, 曹铭3
通讯作者:
于炯
作者简介:
郭一阳(1996—),男,山东滕州人,硕士研究生,主要研究方向:机器学习、数据挖掘基金资助:
CLC Number:
Yiyang GUO, Jiong YU, Xusheng DU, Shaozhi YANG, Ming CAO. Outlier detection algorithm based on autoencoder and ensemble learning[J]. Journal of Computer Applications, 2022, 42(7): 2078-2087.
郭一阳, 于炯, 杜旭升, 杨少智, 曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2078-2087.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021050743
符号 | 说明 | 符号 | 说明 |
---|---|---|---|
Z(m+n)×d | 包含(m+n)个数据和d维数据特征的数据集 | Dc ( xi ) | xi 在第c个基检测器上的离群值 |
Xn×d | 包含n个数据和d维数据特征的训练集 | outlier_matrixn×num_detector | 离群值矩阵 |
Ym×d | 包含m个数据和d维数据特征的测试集 | Labeln×1 | 标签离群值矩阵 |
xi | Xn×d 中第i个数据 | Ωj | 测试点 yj 的局部区域 |
yj | Ym×d 中第j个数据 | num_local | Ωj 中数据总数 |
r | 自编码器某一层节点数与其上一层节点数的比值 | qi | Ωj 中第i个数据 |
k | 表示数据对象的最近邻个数 | Onum_local×num_detector | 其元素为Ωj 中所有数据在 outlier_matrixn×num_detector 上对应的离群值 |
num_detector | 基检测器数量 | Qnum_local×1 | 其元素为Ωj 中所有数据在 Labeln×1上对应的标签离群值 |
D1×num_detector | 基检测器集合 | δc,i | D1×num_detector 中第c个基检测器对 qi 的检测能力 |
Dc | D1×num_detector 中第c个基检测器 |
Tab. 1 Explanation of symbols
符号 | 说明 | 符号 | 说明 |
---|---|---|---|
Z(m+n)×d | 包含(m+n)个数据和d维数据特征的数据集 | Dc ( xi ) | xi 在第c个基检测器上的离群值 |
Xn×d | 包含n个数据和d维数据特征的训练集 | outlier_matrixn×num_detector | 离群值矩阵 |
Ym×d | 包含m个数据和d维数据特征的测试集 | Labeln×1 | 标签离群值矩阵 |
xi | Xn×d 中第i个数据 | Ωj | 测试点 yj 的局部区域 |
yj | Ym×d 中第j个数据 | num_local | Ωj 中数据总数 |
r | 自编码器某一层节点数与其上一层节点数的比值 | qi | Ωj 中第i个数据 |
k | 表示数据对象的最近邻个数 | Onum_local×num_detector | 其元素为Ωj 中所有数据在 outlier_matrixn×num_detector 上对应的离群值 |
num_detector | 基检测器数量 | Qnum_local×1 | 其元素为Ωj 中所有数据在 Labeln×1上对应的标签离群值 |
D1×num_detector | 基检测器集合 | δc,i | D1×num_detector 中第c个基检测器对 qi 的检测能力 |
Dc | D1×num_detector 中第c个基检测器 |
数据 | 基检测器 | |||
---|---|---|---|---|
D1 | D2 | Dnum_detector | ||
x1 | D1( x1) | D2( x1) | Dnum_detector ( x1) | |
x2 | D1( x2) | D2( x2) | Dnum_detector ( x2) | |
xn | D1( xn ) | D2( xn ) | Dnum_detector ( xn ) |
Tab.2 Outlier value matrix
数据 | 基检测器 | |||
---|---|---|---|---|
D1 | D2 | Dnum_detector | ||
x1 | D1( x1) | D2( x1) | Dnum_detector ( x1) | |
x2 | D1( x2) | D2( x2) | Dnum_detector ( x2) | |
xn | D1( xn ) | D2( xn ) | Dnum_detector ( xn ) |
数据 xi | 标签离群值 |
---|---|
Label( xi )=max{D1( xi ),D2( xi ), | |
x1 x2 xn | Label( x1) Label( x2) Label( xn ) |
Tab.3 Outlier label value matrix
数据 xi | 标签离群值 |
---|---|
Label( xi )=max{D1( xi ),D2( xi ), | |
x1 x2 xn | Label( x1) Label( x2) Label( xn ) |
数据集 | 样本数 | 维度 | 离群点数 | 离群点比例/% |
---|---|---|---|---|
Cardio | 1 831 | 21 | 176 | 9.60 |
Ionosphere | 351 | 33 | 126 | 36.00 |
Mnist | 7 603 | 100 | 700 | 9.20 |
Pendigits | 6 870 | 16 | 156 | 2.27 |
Satimage-2 | 5 803 | 36 | 71 | 1.20 |
Tab. 4 ODDS public datasets used in experiments
数据集 | 样本数 | 维度 | 离群点数 | 离群点比例/% |
---|---|---|---|---|
Cardio | 1 831 | 21 | 176 | 9.60 |
Ionosphere | 351 | 33 | 126 | 36.00 |
Mnist | 7 603 | 100 | 700 | 9.20 |
Pendigits | 6 870 | 16 | 156 | 2.27 |
Satimage-2 | 5 803 | 36 | 71 | 1.20 |
真实标签 | 检测结果 | |
---|---|---|
正例 | 反例 | |
正例 反例 | TP FP | FN TN |
Tab. 5 Confusion matrix of classification results
真实标签 | 检测结果 | |
---|---|---|
正例 | 反例 | |
正例 反例 | TP FP | FN TN |
对比指标 | 数据集 | EAOD | HBOS | IForest | LOF | PCA | AE | FB |
---|---|---|---|---|---|---|---|---|
AUC分值 | Cardio | 0.965 5 | 0.814 9 | 0.946 7 | 0.657 1 | 0.940 9 | 0.884 7 | 0.896 6 |
Ionosphere | 0.893 6 | 0.589 2 | 0.829 7 | 0.864 7 | 0.755 2 | 0.790 9 | 0.791 5 | |
Mnist | 0.914 3 | 0.583 2 | 0.802 8 | 0.728 0 | 0.849 1 | 0.825 3 | 0.824 5 | |
Pendigits | 0.950 3 | 0.916 0 | 0.941 2 | 0.501 0 | 0.850 6 | 0.865 6 | 0.895 8 | |
Satimage-2 | 0.989 6 | 0.947 7 | 0.984 1 | 0.465 3 | 0.957 3 | 0.887 1 | 0.901 3 | |
AP分值 | Cardio | 0.694 3 | 0.425 6 | 0.565 2 | 0.217 7 | 0.584 1 | 0.602 6 | 0.613 9 |
Ionosphere | 0.795 7 | 0.380 8 | 0.758 5 | 0.793 9 | 0.680 9 | 0.714 2 | 0.671 6 | |
Mnist | 0.472 5 | 0.115 6 | 0.259 6 | 0.279 3 | 0.356 3 | 0.358 6 | 0.294 6 | |
Pendigits | 0.271 9 | 0.213 1 | 0.238 0 | 0.065 9 | 0.186 7 | 0.189 2 | 0.198 2 | |
Satimage-2 | 0.825 6 | 0.695 6 | 0.810 2 | 0.031 5 | 0.772 3 | 0.765 5 | 0.739 3 |
Tab. 6 AUC scores and AP scores comparison of various algorithms
对比指标 | 数据集 | EAOD | HBOS | IForest | LOF | PCA | AE | FB |
---|---|---|---|---|---|---|---|---|
AUC分值 | Cardio | 0.965 5 | 0.814 9 | 0.946 7 | 0.657 1 | 0.940 9 | 0.884 7 | 0.896 6 |
Ionosphere | 0.893 6 | 0.589 2 | 0.829 7 | 0.864 7 | 0.755 2 | 0.790 9 | 0.791 5 | |
Mnist | 0.914 3 | 0.583 2 | 0.802 8 | 0.728 0 | 0.849 1 | 0.825 3 | 0.824 5 | |
Pendigits | 0.950 3 | 0.916 0 | 0.941 2 | 0.501 0 | 0.850 6 | 0.865 6 | 0.895 8 | |
Satimage-2 | 0.989 6 | 0.947 7 | 0.984 1 | 0.465 3 | 0.957 3 | 0.887 1 | 0.901 3 | |
AP分值 | Cardio | 0.694 3 | 0.425 6 | 0.565 2 | 0.217 7 | 0.584 1 | 0.602 6 | 0.613 9 |
Ionosphere | 0.795 7 | 0.380 8 | 0.758 5 | 0.793 9 | 0.680 9 | 0.714 2 | 0.671 6 | |
Mnist | 0.472 5 | 0.115 6 | 0.259 6 | 0.279 3 | 0.356 3 | 0.358 6 | 0.294 6 | |
Pendigits | 0.271 9 | 0.213 1 | 0.238 0 | 0.065 9 | 0.186 7 | 0.189 2 | 0.198 2 | |
Satimage-2 | 0.825 6 | 0.695 6 | 0.810 2 | 0.031 5 | 0.772 3 | 0.765 5 | 0.739 3 |
1 | PANG G S, CAO L B, AGGARWAL C. Deep learning for anomaly detection: challenges, methods, and opportunities[C]// Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York: ACM, 2021: 1127-1130. 10.1145/3437963.3441659 |
2 | PANG G S, LI J D, VAN DEN HENGEL A, et al. Anomaly and Novelty Detection, Explanation, and Accommodation (ANDEA)[C]// Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2021: 4145-4146. 10.1145/3447548.3469453 |
3 | THUDUMU S, BRANCH P, JIN J, et al. A comprehensive survey of anomaly detection techniques for high dimensional big data[J]. Journal of Big Data, 2020, 7: No.42. 10.1186/s40537-020-00320-x |
4 | QIAN J, ZENG G F, CAI Z P, et al. A survey on anomaly detection techniques in large-scale KPI data[C]// Proceedings of the 9th International Conference on Computer Engineering and Networks, AISC 1143. Singapore: Springer, 2021: 767-776. |
5 | BERGMANN P, BATZNER K, FAUSER M, et al. The MVTec anomaly detection dataset: a comprehensive real-world dataset for unsupervised anomaly detection[J]. International Journal of Computer Vision, 2021, 129(4): 1038-1059. 10.1007/s11263-020-01400-4 |
6 | 梅林,张凤荔,高强. 离群点检测技术综述[J]. 计算机应用研究, 2020, 37(12): 3521-3527. |
MEI L, ZHANG F L, GAO Q. Overview of outlier detection technology[J]. Application Research of Computers, 2020, 37(12): 3521-3527. | |
7 | PANG G S, SHEN C H, CAO L B, et al. Deep learning for anomaly detection: a review[J]. ACM Computing Surveys, 2022, 54(2): No.38. 10.1145/3439950 |
8 | CHEN W Q, WANG Z L, ZHONG Y, et al. ADSIM: network anomaly detection via similarity-aware heterogeneous ensemble learning[C]// Proceedings of the 2021 IFIP/IEEE International Symposium on Integrated Network Management. Piscataway: IEEE, 2021: 608-612. |
9 | CRUZ R M O, SABOURIN R, CAVALCANTI G D C. Dynamic classifier selection: recent advances and perspectives[J]. Information Fusion, 2018, 41: 195-216. 10.1016/j.inffus.2017.09.010 |
10 | CRUZ R M O, SABOURIN R, CAVALCANTI G D C. META-DES.Oracle: meta-learning and feature selection for dynamic ensemble selection[J]. Information Fusion, 2017, 38: 84-103. 10.1016/j.inffus.2017.02.010 |
11 | GOLDSTEIN M, DENGEL A. Histogram-Based Outlier Score (HBOS): a fast unsupervised anomaly detection algorithm[EB/OL]. [2021-03-22].. |
12 | WANG X C, JIANG H C, YANG B Q. A k-nearest neighbor medoid-based outlier detection algorithm[C]// Proceedings of the 2021 International Conference on Communications, Information System and Computer Engineering. Piscataway: IEEE, 2021: 601-605. 10.1109/cisce52179.2021.9446001 |
13 | BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF: identifying density-based local outliers[C]// Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2000: 93-104. 10.1145/342009.335388 |
14 | FONG S, LI T, HAN D, et al. Lightweight classifier-based outlier detection algorithms from multivariate data stream[M]// FONG S J, MILLHAM R C. Bio-inspired Algorithms for Data Streaming and Visualization, Big Data Management, and Fog Computing, STNIC. Singapore: Springer, 2021: 97-125. 10.1007/978-981-15-6695-0_6 |
15 | 杜旭升,于炯,叶乐乐,等. 基于图上随机游走的离群点检测算法[J]. 计算机应用, 2020, 40(5): 1322-1328. |
DU X S, YU J, YE L L, et al. Outlier detection algorithm based on the graph random walk[J]. Journal of Computer Applications, 2020, 40(5): 1322-1328. | |
16 | LÜBBERING M, GEBAUER M, RAMAMURTHY R, et al. Supervised autoencoder variants for end to end anomaly detection[C]// Proceedings of the 2021 International Conference on Pattern Recognition, LNCS 12662. Cham: Springer, 2021: 566-581. |
17 | SARVARI H, DOMENICONI C, PRENKAJ B, et al. Unsupervised boosting-based autoencoder ensembles for outlier detection[C]// Proceedings of the 2021 Pacific-Asia Conference on Knowledge Discovery and Data Mining, LNCS 12712. Cham: Springer, 2021: 91-103. |
18 | BHATIA R, SHARMA R, GULERIA A. Anomaly detection systems using IP flows: a review[M]// BAREDAR P V, TANGELLAPALLI S, SOLANKI C S. Advances in Clean Energy Technologies, SPE. Singapore: Springer, 2021: 1035-1049. 10.1007/978-981-16-0235-1_80 |
19 | AVCI B, BODUROGLU A. Contributions of ensemble perception to outlier representation precision[J]. Attention, Perception, & Psychophysics, 2021, 83(3): 1141-1151. 10.3758/s13414-021-02270-9 |
20 | AHSAN M, MASHURI M, KUSWANTO H, et al. Outlier detection using PCA mix based T 2 control chart for continuous and categorical data[J]. Communications in Statistics-Simulation and Computation, 2021, 50(5): 1496-1523. 10.1080/03610918.2019.1586921 |
21 | LIU F T, TING K M, ZHOU Z H. Isolation forest[C]// Proceedings of the 8th IEEE International Conference on Data Mining. Piscataway: IEEE, 2008: 413-422. 10.1109/icdm.2008.17 |
22 | OOSTERMAN D T, LANGENKAMP W H, BERGEN E L. Customs risk assessment based on unsupervised anomaly detection using autoencoders[C]// Proceedings of the 2021 SAI Intelligent Systems Conference, LNNS 294. Cham: Springer, 2022: 668-681. |
23 | LAZAREVIC A, KUMAR V. Feature bagging for outlier detection[C]// Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York: ACM, 2005: 157-166. 10.1145/1081870.1081891 |
24 | BELHADI A, DJENOURI Y, LIN J C W, et al. Trajectory outlier detection: algorithms, taxonomies, evaluation, and open challenges[J]. ACM Transactions on Management Information Systems, 2020, 11(3): No.16. 10.1145/3399631 |
[1] | Jieru JIA, Jianchao YANG, Shuorui ZHANG, Tao YAN, Bin CHEN. Unsupervised person re-identification based on self-distilled vision Transformer [J]. Journal of Computer Applications, 2024, 44(9): 2893-2902. |
[2] | Xiawuji, Heming HUANG, Gengzangcuomao, Yutao FAN. Survey of extractive text summarization based on unsupervised learning and supervised learning [J]. Journal of Computer Applications, 2024, 44(4): 1035-1048. |
[3] | Rui JIANG, Wei LIU, Cheng CHEN, Tao LU. Asymmetric unsupervised end-to-end image deraining network [J]. Journal of Computer Applications, 2024, 44(3): 922-930. |
[4] | Qiye ZHANG, Xinrui ZENG. Efficient active-set method for support vector data description problem with Gaussian kernel [J]. Journal of Computer Applications, 2024, 44(12): 3808-3814. |
[5] | Nengbing HU, Biao CAI, Xu LI, Danhua CAO. Graph classification method based on graph pooling contrast learning [J]. Journal of Computer Applications, 2024, 44(11): 3327-3334. |
[6] | Pei ZHAO, Yan QIAO, Rongyao HU, Xinyu YUAN, Minyue LI, Benchu ZHANG. Multivariate time series anomaly detection based on multi-domain feature extraction [J]. Journal of Computer Applications, 2024, 44(11): 3419-3426. |
[7] | Yuhao TANG, Dezhong PENG, Zhong YUAN. Fuzzy multi-granularity anomaly detection for incomplete mixed data [J]. Journal of Computer Applications, 2024, 44(10): 3097-3104. |
[8] | Hui JIANG, Qiuyan YAN, Zhujun JIANG. Symmetric positive definite autoencoder method for multivariate time series anomaly detection [J]. Journal of Computer Applications, 2024, 44(10): 3294-3299. |
[9] | Menglin HUANG, Lei DUAN, Yuanhao ZHANG, Peiyan WANG, Renhao LI. Prompt learning based unsupervised relation extraction model [J]. Journal of Computer Applications, 2023, 43(7): 2010-2016. |
[10] | Zhe XU, Zhihong WANG, Cunyu SHAN, Yaru SUN, Ying YANG. Unsupervised face forgery video detection based on reconstruction error [J]. Journal of Computer Applications, 2023, 43(5): 1571-1577. |
[11] | Mengting GE, Minghua WAN. Feature extraction model based on neighbor supervised locally invariant robust principal component analysis [J]. Journal of Computer Applications, 2023, 43(4): 1013-1020. |
[12] | Chunyong YIN, Liwen ZHOU. Unsupervised time series anomaly detection model based on re-encoding [J]. Journal of Computer Applications, 2023, 43(3): 804-811. |
[13] | Wenbo LI, Bo LIU, Lingling TAO, Fen LUO, Hang ZHANG. Deep spectral clustering algorithm with L1 regularization [J]. Journal of Computer Applications, 2023, 43(12): 3662-3667. |
[14] | Jingtao ZHAO, Zefang ZHAO, Zhaojuan YUE, Jun LI. TenrepNN:practice of new ensemble learning paradigm in enterprise self-discipline evaluation [J]. Journal of Computer Applications, 2023, 43(10): 3107-3113. |
[15] | YUAN Lining, LIU Zhao. Graph representation learning by autoencoder with one-shot aggregation [J]. Journal of Computer Applications, 2023, 43(1): 8-14. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||