Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (7): 2078-2087.DOI: 10.11772/j.issn.1001-9081.2021050743

• Data science and technology • Previous Articles    

Outlier detection algorithm based on autoencoder and ensemble learning

Yiyang GUO1, Jiong YU1,2(), Xusheng DU1, Shaozhi YANG1, Ming CAO3   

  1. 1.College of Information Science and Engineering,Xinjiang University,Urumqi Xinjiang 830046,China
    2.School of Software,Xinjiang University,Urumqi Xinjiang 830091,China
    3.College of Information Science and Engineering,Ocean University of China,Qingdao Shandong 266100,China
  • Received:2021-05-10 Revised:2021-09-08 Accepted:2021-09-15 Online:2021-09-08 Published:2022-07-10
  • Contact: Jiong YU
  • About author:GUO Yiyang, born in 1996, M. S. candidate. His research interests include machine learning, data mining.
    DU Xusheng, born in 1995, Ph. D. candidate. His research interests include machine learning, data mining.
    YANG Shaozhi, born in 1995, M. S. candidate. His research interests include machine learning, data mining.
    CAO Ming, born in 1996, M. S. candidate. Her research interests include machine learning, data mining.
  • Supported by:
    National Natural Science Foundation of China(61862060)

基于自编码器与集成学习的离群点检测算法

郭一阳1, 于炯1,2(), 杜旭升1, 杨少智1, 曹铭3   

  1. 1.新疆大学 信息科学与工程学院, 乌鲁木齐 830046
    2.新疆大学 软件学院, 乌鲁木齐 830091
    3.中国海洋大学 信息科学与工程学院, 山东 青岛 266100
  • 通讯作者: 于炯
  • 作者简介:郭一阳(1996—),男,山东滕州人,硕士研究生,主要研究方向:机器学习、数据挖掘
    杜旭升(1995—),男,甘肃庆阳人,博士研究生,CCF会员,主要研究方向:机器学习、数据挖掘
    杨少智(1995—),男,安徽凤阳人,硕士研究生,主要研究方向:机器学习、数据挖掘
    曹铭(1996—),女,山东菏泽人,硕士研究生,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61862060)

Abstract:

The outlier detection algorithm based on autoencoder is easy to over-fit on small- and medium-sized datasets, and the traditional outlier detection algorithm based on ensemble learning does not optimize and select the base detectors, resulting in low detection accuracy. Aiming at the above problems, an Ensemble learning and Autoencoder-based Outlier Detection (EAOD) algorithm was proposed. Firstly, the outlier values and outlier label values of the data objects were obtained by randomly changing the connection structure of the autoencoder generate different base detectors. Secondly, local region around the object was constructed according to the Euclidean distance between the data objects calculated by the nearest neighbor algorithm. Finally, based on the similarity between the outlier values and the outlier label values, the base detectors with strong detection ability in the region were selected and combined together, and the object outlier value after combination was used as the final outlier value judged by EAOD algorithm. In the experiments, compared with the AutoEncoder (AE) algorithm, the proposed algorithm has the Area Under receiver operating characteristic Curve (AUC) and Average Precision (AP) scores increased by 8.08 percentage points and 9.17 percentage points respectively on Cardio dataset; compared with the Feature Bagging (FB) ensemble learning algorithm, the proposed algorithm has the detection time cost reduced by 21.33% on Mnist dataset. Experimental results show that the proposed algorithm has good detection performance and real-time performance under unsupervised learning.

Key words: outlier detection, ensemble learning, AutoEncoder (AE), base detector, unsupervised learning

摘要:

针对基于自编码器的离群点检测算法在中小规模数据集上易过拟合以及传统的基于集成学习的离群点检测算法未对基检测器进行优化选择而导致的检测精度低的问题,提出了一种基于自编码器与集成学习的离群点检测(EAOD)算法。首先,随机改变自编码器的连接结构来生成不同的基检测器,以获取数据对象的离群值和标签离群值;然后,通过最近邻算法计算数据对象之间的欧氏距离,并在对象周围构建局部区域;最后,根据离群值与标签离群值之间的相似度,选择在该区域内检测能力强的基检测器进行组合,组合后的对象离群值作为EAOD算法最终判定的离群值。在实验中,所提算法与自编码器(AE)算法相比,在Cardio数据集上,接受者操作特征曲线下方的面积(AUC)和平均精度(AP)分值分别提高了8.08个百分点和9.17个百分点;所提算法与特征装袋(FB)集成学习算法相比,在Mnist数据集上,运行时间成本降低了21.33%。实验结果表明,在无监督学习下所提算法具有良好的检测性能和检测实时性。

关键词: 离群点检测, 集成学习, 自编码器, 基检测器, 无监督学习

CLC Number: