基于自编码器与集成学习的离群点检测算法

doi:10.11772/j.issn.1001-9081.2021050743

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (7): 2078-2087.DOI: 10.11772/j.issn.1001-9081.2021050743

基于自编码器与集成学习的离群点检测算法

郭一阳¹, 于炯¹^,²(), 杜旭升¹, 杨少智¹, 曹铭³

^1.新疆大学信息科学与工程学院, 乌鲁木齐 830046
^2.新疆大学软件学院, 乌鲁木齐 830091
^3.中国海洋大学信息科学与工程学院, 山东青岛 266100

收稿日期:2021-05-10 修回日期:2021-09-08 接受日期:2021-09-15 发布日期:2021-09-08 出版日期:2022-07-10
通讯作者: 于炯
作者简介:郭一阳（1996—），男，山东滕州人，硕士研究生，主要研究方向：机器学习、数据挖掘
杜旭升（1995—），男，甘肃庆阳人，博士研究生，CCF会员，主要研究方向：机器学习、数据挖掘
杨少智（1995—），男，安徽凤阳人，硕士研究生，主要研究方向：机器学习、数据挖掘
曹铭（1996—），女，山东菏泽人，硕士研究生，主要研究方向：机器学习、数据挖掘。
基金资助:
国家自然科学基金资助项目(61862060)

Outlier detection algorithm based on autoencoder and ensemble learning

Yiyang GUO¹, Jiong YU¹^,²(), Xusheng DU¹, Shaozhi YANG¹, Ming CAO³

^1.College of Information Science and Engineering，Xinjiang University，Urumqi Xinjiang 830046，China
^2.School of Software，Xinjiang University，Urumqi Xinjiang 830091，China
^3.College of Information Science and Engineering，Ocean University of China，Qingdao Shandong 266100，China

Received:2021-05-10 Revised:2021-09-08 Accepted:2021-09-15 Online:2021-09-08 Published:2022-07-10
Contact: Jiong YU
About author:GUO Yiyang， born in 1996， M. S. candidate. His research interests include machine learning， data mining.
DU Xusheng， born in 1995， Ph. D. candidate. His research interests include machine learning， data mining.
YANG Shaozhi， born in 1995， M. S. candidate. His research interests include machine learning， data mining.
CAO Ming， born in 1996， M. S. candidate. Her research interests include machine learning， data mining.
Supported by:
National Natural Science Foundation of China(61862060)

摘要/Abstract

摘要：

针对基于自编码器的离群点检测算法在中小规模数据集上易过拟合以及传统的基于集成学习的离群点检测算法未对基检测器进行优化选择而导致的检测精度低的问题，提出了一种基于自编码器与集成学习的离群点检测（EAOD）算法。首先，随机改变自编码器的连接结构来生成不同的基检测器，以获取数据对象的离群值和标签离群值；然后，通过最近邻算法计算数据对象之间的欧氏距离，并在对象周围构建局部区域；最后，根据离群值与标签离群值之间的相似度，选择在该区域内检测能力强的基检测器进行组合，组合后的对象离群值作为EAOD算法最终判定的离群值。在实验中，所提算法与自编码器（AE）算法相比，在Cardio数据集上，接受者操作特征曲线下方的面积（AUC）和平均精度（AP）分值分别提高了8.08个百分点和9.17个百分点；所提算法与特征装袋（FB）集成学习算法相比，在Mnist数据集上，运行时间成本降低了21.33%。实验结果表明，在无监督学习下所提算法具有良好的检测性能和检测实时性。

关键词: 离群点检测, 集成学习, 自编码器, 基检测器, 无监督学习

Abstract:

The outlier detection algorithm based on autoencoder is easy to over-fit on small- and medium-sized datasets， and the traditional outlier detection algorithm based on ensemble learning does not optimize and select the base detectors， resulting in low detection accuracy. Aiming at the above problems， an Ensemble learning and Autoencoder-based Outlier Detection （EAOD） algorithm was proposed. Firstly， the outlier values and outlier label values of the data objects were obtained by randomly changing the connection structure of the autoencoder generate different base detectors. Secondly， local region around the object was constructed according to the Euclidean distance between the data objects calculated by the nearest neighbor algorithm. Finally， based on the similarity between the outlier values and the outlier label values， the base detectors with strong detection ability in the region were selected and combined together， and the object outlier value after combination was used as the final outlier value judged by EAOD algorithm. In the experiments， compared with the AutoEncoder （AE） algorithm， the proposed algorithm has the Area Under receiver operating characteristic Curve （AUC） and Average Precision （AP） scores increased by 8.08 percentage points and 9.17 percentage points respectively on Cardio dataset； compared with the Feature Bagging （FB） ensemble learning algorithm， the proposed algorithm has the detection time cost reduced by 21.33% on Mnist dataset. Experimental results show that the proposed algorithm has good detection performance and real-time performance under unsupervised learning.

Key words: outlier detection, ensemble learning, AutoEncoder (AE), base detector, unsupervised learning

中图分类号:

TP311.1

郭一阳, 于炯, 杜旭升, 杨少智, 曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 计算机应用, 2022, 42(7): 2078-2087.

Yiyang GUO, Jiong YU, Xusheng DU, Shaozhi YANG, Ming CAO. Outlier detection algorithm based on autoencoder and ensemble learning[J]. Journal of Computer Applications, 2022, 42(7): 2078-2087.

图/表 14

表1 符号说明

Tab. 1 Explanation of symbols

符号	说明	符号	说明
Z_（_m₊_n_）×_d	包含（m+n）个数据和d维数据特征的数据集	D_c （ x_i ）	x_i 在第c个基检测器上的离群值
X_n_×_d	包含n个数据和d维数据特征的训练集	*outlier_matrix_n_×_{num_detector}*	离群值矩阵
Y_m_×_d	包含m个数据和d维数据特征的测试集	*Label*_n_×1	标签离群值矩阵
x_i	X_n_×_d 中第i个数据	Ω_j	测试点 y_j 的局部区域
y_j	Y_m_×_d 中第j个数据	num_local	Ω_j 中数据总数
r	自编码器某一层节点数与其上一层节点数的比值	q_i	Ω_j 中第i个数据
k	表示数据对象的最近邻个数	O_{num_local}_×_{num_detector}	其元素为Ω_j 中所有数据在 *outlier_matrix_n_×_{num_detector}* 上对应的离群值
num_detector	基检测器数量	Q_{num_local}_×1	其元素为Ω_j 中所有数据在 *Label*_n_×1上对应的标签离群值
D_1×_{num_detector}	基检测器集合	δ_c，i	D_1×_{num_detector} 中第c个基检测器对 q_i 的检测能力
D_c	D_1×_{num_detector} 中第c个基检测器

图1 EAOD算法框架与流程

Fig. 1 Framework and flow for EAOD algorithm

图2 自编码器模型

Fig. 2 Autoencoder model

图3 随机连接自编码器模型

Fig. 3 Randomly connected autoencoder model

表2 离群值矩阵

Tab.2 Outlier value matrix

数据	基检测器
数据	D₁	D₂	$⋯$	D_{num_detector}
x₁	D₁（ x₁）	D₂（ x₁）	$⋯$	D_{num_detector} （ x₁）
x₂	D₁（ x₂）	D₂（ x₂）	$⋯$	D_{num_detector} （ x₂）
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$
x_n	D₁（ x_n ）	D₂（ x_n ）	$⋯$	D_{num_detector} （ x_n ）

表2 离群值矩阵

Tab.2 Outlier value matrix

数据	基检测器
数据	D₁	D₂	$⋯$	D_{num_detector}
x₁	D₁（ x₁）	D₂（ x₁）	$⋯$	D_{num_detector} （ x₁）
x₂	D₁（ x₂）	D₂（ x₂）	$⋯$	D_{num_detector} （ x₂）
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$
x_n	D₁（ x_n ）	D₂（ x_n ）	$⋯$	D_{num_detector} （ x_n ）

表3 标签离群值矩阵

Tab.3 Outlier label value matrix

数据 x_i

标签离群值

Label（ x_i ）=max｛D₁（ x_i ），D₂（ x_i ），

⋯

，D_{num_detector} （ x_i ）｝

x₁

x₂

$⋮$

x_n

Label（ x₁）

Label（ x₂）

$⋮$

Label（ x_n ）

表3 标签离群值矩阵

Tab.3 Outlier label value matrix

数据 x_i

标签离群值

Label（ x_i ）=max｛D₁（ x_i ），D₂（ x_i ），

⋯

，D_{num_detector} （ x_i ）｝

x₁

x₂

$⋮$

x_n

Label（ x₁）

Label（ x₂）

$⋮$

Label（ x_n ）

图4 数据对象的局部区域

Fig.4 Local region of data objects

表4 实验中使用的ODDS公开数据集

Tab. 4 ODDS public datasets used in experiments

数据集	样本数	维度	离群点数	离群点比例/%
Cardio	1 831	21	176	9.60
Ionosphere	351	33	126	36.00
Mnist	7 603	100	700	9.20
Pendigits	6 870	16	156	2.27
Satimage-2	5 803	36	71	1.20

表5 分类结果的混淆矩阵

Tab. 5 Confusion matrix of classification results

正例

反例

表6 各算法AUC分值和AP分值对比

Tab. 6 AUC scores and AP scores comparison of various algorithms

对比指标	数据集	EAOD	HBOS	IForest	LOF	PCA	AE	FB
AUC分值	Cardio	0.965 5	0.814 9	0.946 7	0.657 1	0.940 9	0.884 7	0.896 6
	Ionosphere	0.893 6	0.589 2	0.829 7	0.864 7	0.755 2	0.790 9	0.791 5
	Mnist	0.914 3	0.583 2	0.802 8	0.728 0	0.849 1	0.825 3	0.824 5
	Pendigits	0.950 3	0.916 0	0.941 2	0.501 0	0.850 6	0.865 6	0.895 8
	Satimage-2	0.989 6	0.947 7	0.984 1	0.465 3	0.957 3	0.887 1	0.901 3
AP分值	Cardio	0.694 3	0.425 6	0.565 2	0.217 7	0.584 1	0.602 6	0.613 9
	Ionosphere	0.795 7	0.380 8	0.758 5	0.793 9	0.680 9	0.714 2	0.671 6
	Mnist	0.472 5	0.115 6	0.259 6	0.279 3	0.356 3	0.358 6	0.294 6
	Pendigits	0.271 9	0.213 1	0.238 0	0.065 9	0.186 7	0.189 2	0.198 2
	Satimage-2	0.825 6	0.695 6	0.810 2	0.031 5	0.772 3	0.765 5	0.739 3

图5 各算法在5种数据集上的时间耗费对比

Fig. 5 Time consumption comparison of various algorithms on 5 datasets

图6 基检测器个数变动分析

Fig. 6 Analysis of change in the number of base detectors

图7 基检测器层数变动分析

Fig. 7 Analysis of changes in the number of layers of base detectors

图8 基检测器迭代次数变动分析

Fig. 8 Analysis of change in the number of iterations of base detectors

参考文献 24

1	PANG G S， CAO L B， AGGARWAL C. Deep learning for anomaly detection： challenges， methods， and opportunities［C］// Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York： ACM， 2021： 1127-1130. 10.1145/3437963.3441659
2	PANG G S， LI J D， VAN DEN HENGEL A， et al. Anomaly and Novelty Detection， Explanation， and Accommodation （ANDEA）［C］// Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York： ACM， 2021： 4145-4146. 10.1145/3447548.3469453
3	THUDUMU S， BRANCH P， JIN J， et al. A comprehensive survey of anomaly detection techniques for high dimensional big data［J］. Journal of Big Data， 2020， 7： No.42. 10.1186/s40537-020-00320-x
4	QIAN J， ZENG G F， CAI Z P， et al. A survey on anomaly detection techniques in large-scale KPI data［C］// Proceedings of the 9th International Conference on Computer Engineering and Networks， AISC 1143. Singapore： Springer， 2021： 767-776.
5	BERGMANN P， BATZNER K， FAUSER M， et al. The MVTec anomaly detection dataset： a comprehensive real-world dataset for unsupervised anomaly detection［J］. International Journal of Computer Vision， 2021， 129（4）： 1038-1059. 10.1007/s11263-020-01400-4
6	梅林，张凤荔，高强. 离群点检测技术综述［J］. 计算机应用研究， 2020， 37（12）： 3521-3527.
	MEI L， ZHANG F L， GAO Q. Overview of outlier detection technology［J］. Application Research of Computers， 2020， 37（12）： 3521-3527.
7	PANG G S， SHEN C H， CAO L B， et al. Deep learning for anomaly detection： a review［J］. ACM Computing Surveys， 2022， 54（2）： No.38. 10.1145/3439950
8	CHEN W Q， WANG Z L， ZHONG Y， et al. ADSIM： network anomaly detection via similarity-aware heterogeneous ensemble learning［C］// Proceedings of the 2021 IFIP/IEEE International Symposium on Integrated Network Management. Piscataway： IEEE， 2021： 608-612.
9	CRUZ R M O， SABOURIN R， CAVALCANTI G D C. Dynamic classifier selection： recent advances and perspectives［J］. Information Fusion， 2018， 41： 195-216. 10.1016/j.inffus.2017.09.010
10	CRUZ R M O， SABOURIN R， CAVALCANTI G D C. META-DES.Oracle： meta-learning and feature selection for dynamic ensemble selection［J］. Information Fusion， 2017， 38： 84-103. 10.1016/j.inffus.2017.02.010
11	GOLDSTEIN M， DENGEL A. Histogram-Based Outlier Score （HBOS）： a fast unsupervised anomaly detection algorithm［EB/OL］. ［2021-03-22］..
12	WANG X C， JIANG H C， YANG B Q. A k-nearest neighbor medoid-based outlier detection algorithm［C］// Proceedings of the 2021 International Conference on Communications， Information System and Computer Engineering. Piscataway： IEEE， 2021： 601-605. 10.1109/cisce52179.2021.9446001
13	BREUNIG M M， KRIEGEL H P， NG R T， et al. LOF： identifying density-based local outliers［C］// Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York： ACM， 2000： 93-104. 10.1145/342009.335388
14	FONG S， LI T， HAN D， et al. Lightweight classifier-based outlier detection algorithms from multivariate data stream［M］// FONG S J， MILLHAM R C. Bio-inspired Algorithms for Data Streaming and Visualization， Big Data Management， and Fog Computing， STNIC. Singapore： Springer， 2021： 97-125. 10.1007/978-981-15-6695-0_6
15	杜旭升，于炯，叶乐乐，等. 基于图上随机游走的离群点检测算法［J］. 计算机应用， 2020， 40（5）： 1322-1328.
	DU X S， YU J， YE L L， et al. Outlier detection algorithm based on the graph random walk［J］. Journal of Computer Applications， 2020， 40（5）： 1322-1328.
16	LÜBBERING M， GEBAUER M， RAMAMURTHY R， et al. Supervised autoencoder variants for end to end anomaly detection［C］// Proceedings of the 2021 International Conference on Pattern Recognition， LNCS 12662. Cham： Springer， 2021： 566-581.
17	SARVARI H， DOMENICONI C， PRENKAJ B， et al. Unsupervised boosting-based autoencoder ensembles for outlier detection［C］// Proceedings of the 2021 Pacific-Asia Conference on Knowledge Discovery and Data Mining， LNCS 12712. Cham： Springer， 2021： 91-103.
18	BHATIA R， SHARMA R， GULERIA A. Anomaly detection systems using IP flows： a review［M］// BAREDAR P V， TANGELLAPALLI S， SOLANKI C S. Advances in Clean Energy Technologies， SPE. Singapore： Springer， 2021： 1035-1049. 10.1007/978-981-16-0235-1_80
19	AVCI B， BODUROGLU A. Contributions of ensemble perception to outlier representation precision［J］. Attention， Perception， & Psychophysics， 2021， 83（3）： 1141-1151. 10.3758/s13414-021-02270-9
20	AHSAN M， MASHURI M， KUSWANTO H， et al. Outlier detection using PCA mix based T ² control chart for continuous and categorical data［J］. Communications in Statistics-Simulation and Computation， 2021， 50（5）： 1496-1523. 10.1080/03610918.2019.1586921
21	LIU F T， TING K M， ZHOU Z H. Isolation forest［C］// Proceedings of the 8th IEEE International Conference on Data Mining. Piscataway： IEEE， 2008： 413-422. 10.1109/icdm.2008.17
22	OOSTERMAN D T， LANGENKAMP W H， BERGEN E L. Customs risk assessment based on unsupervised anomaly detection using autoencoders［C］// Proceedings of the 2021 SAI Intelligent Systems Conference， LNNS 294. Cham： Springer， 2022： 668-681.
23	LAZAREVIC A， KUMAR V. Feature bagging for outlier detection［C］// Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York： ACM， 2005： 157-166. 10.1145/1081870.1081891
24	BELHADI A， DJENOURI Y， LIN J C W， et al. Trajectory outlier detection： algorithms， taxonomies， evaluation， and open challenges［J］. ACM Transactions on Management Information Systems， 2020， 11（3）： No.16. 10.1145/3399631

[1]	贾洁茹, 杨建超, 张硕蕊, 闫涛, 陈斌. 基于自蒸馏视觉Transformer的无监督行人重识别[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2893-2902.
[2]	邓凯丽, 魏伟波, 潘振宽. 改进掩码自编码器的工业缺陷检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2595-2603.
[3]	李宗禹, 强思维, 郭晓波, 朱振峰. 重加权的对抗变分自编码器及其在工业因果效应估计中的应用[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1099-1106.
[4]	夏吾吉, 黄鹤鸣, 更藏措毛, 范玉涛. 基于无监督学习和监督学习的抽取式文本摘要综述[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1035-1048.
[5]	江锐, 刘威, 陈成, 卢涛. 非对称端到端的无监督图像去雨网络[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 922-930.
[6]	张卓, 陈花竹. 基于一致性和多样性的多尺度自表示学习的深度子空间聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 353-359.
[7]	胡能兵, 蔡彪, 李旭, 曹旦华. 基于图池化对比学习的图分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3327-3334.
[8]	赵培, 乔焰, 胡荣耀, 袁新宇, 李敏悦, 张本初. 基于多域特征提取的多变量时间序列异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3419-3426.
[9]	蒋辉, 闫秋艳, 姜竹郡. 面向多元时间序列异常检测的对称正定自编码器方法[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3294-3299.
[10]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.
[11]	王静红, 周志霞, 王辉, 李昊康. 双路自编码器的属性网络表示学习[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2338-2344.
[12]	黄梦林, 段磊, 张袁昊, 王培妍, 李仁昊. 基于Prompt学习的无监督关系抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2010-2016.
[13]	许喆, 王志宏, 单存宇, 孙亚茹, 杨莹. 基于重构误差的无监督人脸伪造视频检测[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1571-1577.
[14]	葛孟婷, 万鸣华. 基于近邻监督局部不变鲁棒主成分分析的特征提取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1013-1020.
[15]	尹春勇, 周立文. 基于再编码的无监督时间序列异常检测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 804-811.

基于自编码器与集成学习的离群点检测算法

Outlier detection algorithm based on autoencoder and ensemble learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 24

相关文章 15

编辑推荐

Metrics