Journal of Computer Applications

    Next Articles

Outlier detection algorithm based on autoencoder and ensemble learning

  

  • Received:2021-05-10 Revised:2021-09-08 Online:2021-09-15 Published:2021-09-15

基于自编码器与集成学习的离群点检测算法

郭一阳1,于炯1,2,杜旭升1,杨少智1,曹铭3   

  1. 1.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2.新疆大学 软件学院,乌鲁木齐 830091
    3.中国海洋大学 信息科学与工程学院,山东 青岛 266100
  • 通讯作者: 于炯

Abstract: The outlier detection algorithm based on self encoder is easy to fit on small and medium-sized data sets, and the traditional outlier detection algorithm based on ensemble learning does not optimize the base detector, resulting in low detection accuracy. Aiming at the problems above, an Ensemble and Autoencoder-based Outlier Detection (EAOD) algorithm which applied autoencoder into ensemble learning was proposed. Firstly, in order that outlier values and outlier degree marker values for data objects were obtained, the connection structure of the autoencoder was randomly changed to generate different base detectors. Secondly, local region around the object was constructed according to the Euclidean distance calculated by the nearest neighbour algorithm. Finally, based on the similarity between the outlier and the outlier degree marker value, base detectors with strong detection ability were selected and combined in the region, and the outlier of the combined object was used as the final outlier of EAOD. Compared with the AutoEncoder(AE) algorithm, the AUC and AP values of the proposed algorithm were increased by 8.08 and 9.17 percentage points respectively on the Cardio dataset; compared with the Feature Bagging(FB) ensemble learning algorithm, the running time cost was reduced by 21.33% on the Mnist dataset. Above experimental results show that the algorithm has good detection performance and real-time performance under unsupervised learning.

Key words: outlier detection, ensemble learning, autoencoder, base detector, unsupervised learning

摘要: 针对基于自编码器的离群点检测算法在中小规模数据集上易过拟合以及传统的基于集成学习的离群点检测算法未对基检测器进行优化选择而导致检测精度低的问题,提出了一种以自编码器作为基检测器的集成学习离群点检测(EAOD)算法。首先,随机改变自编码器的连接结构生成不同的基检测器,以获取数据对象的离群值和离群程度标记值;然后,通过近邻算法计算数据对象之间的欧氏距离,在对象周围构建局部区域;最后,根据离群值与离群程度标记值之间的相似度,选择在该区域内检测能力强的基检测器进行组合,组合后的对象离群值作为EAOD算法最终判定的离群值。所提算法与自编码器(AE)算法相比,在Cardio数据集上,AUC和AP值分别提高了8.08和9.17个百分点;与特征装袋(FB)集成学习算法相比,在Mnist数据集上,运行时间成本降低了21.33%。实验结果表明,在无监督学习下该算法具有良好的检测性能和检测实时性。

关键词: 离群点检测, 集成学习, 自编码器, 基检测器, 无监督学习

CLC Number: