《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1423-1427.DOI: 10.11772/j.issn.1001-9081.2023050697

• 2023年中国计算机学会人工智能会议(CCFAI 2023) • 上一篇    

基于条件生成对抗插补网络的双重判别器缺失值插补算法

粟佳, 于洪()   

  1. 计算智能重庆市重点实验室(重庆邮电大学),重庆 400065
  • 收稿日期:2023-08-01 修回日期:2023-09-14 接受日期:2023-09-28 发布日期:2023-10-17 出版日期:2024-05-10
  • 通讯作者: 于洪
  • 作者简介:粟佳(1997—),男,四川广安人,硕士研究生,CCF会员,主要研究方向:不平衡和不完备数据的处理
    第一联系人:于洪(1972—),女,重庆人,教授,博士,CCF会员,主要研究方向:粗糙集、工业大数据、数据挖掘、知识发现、粒计算、三支聚类、机器学习。
  • 基金资助:
    国家重点研发计划项目(2021YFF0704103);国家自然科学基金资助项目(62136002)

Missing value imputation algorithm using dual discriminator based on conditional generative adversarial imputation network

Jia SU, Hong YU()   

  1. Chongqing Key Laboratory of Computational Intelligence (Chongqing University of Posts and Telecommunications),Chongqing 400065,China
  • Received:2023-08-01 Revised:2023-09-14 Accepted:2023-09-28 Online:2023-10-17 Published:2024-05-10
  • Contact: Hong YU
  • About author:SU Jia, born in 1997, M. S. candidate. His research interests include unbalanced and incomplete data processing.
  • Supported by:
    National Key R&D Program of China(2021YFF0704103);National Natural Science Foundation of China(62136002)

摘要:

应用中的各种因素可能造成数据缺失,影响后续任务的分析。因此,数据集缺失值的插补尤为重要。相比原本没有插补的处理,错误的插补值也会对分析造成更严重的偏差。针对这种情况,提出新的采用双重判别器的基于条件生成对抗插补网络(C-GAIN)的缺失值插补算法DDC-GAIN(Dual Discriminator based on C-GAIN)。该算法通过一个辅助判别器辅助主判别器判断预测值的真假,即根据一个样本的全局信息判断这个样本生成的真假,更注重特征之间的关系,以此估算预测值。在4个数据集上与5种经典插补算法进行对比实验,结果表明:同样条件下,DDC-GAIN算法在样本量较大时的均方根误差(RMSE)最低;在Default credit card数据集上缺失率为15%时,DDC-GAIN算法的RMSE比次优算法C-GAIN降低了28.99%。这说明利用辅助判别器帮助主判别器学习特征之间的关系是有效的。

关键词: 条件生成对抗插补网络, 缺失值插补, 不完备性, 特征关系, 双重判别器

Abstract:

Various factors in the application may cause data loss and affect the analysis of subsequent tasks. Therefore, the imputation of missing data values in data sets is particularly important. Moreover, the accuracy of data imputation can significantly impact the analysis of subsequent tasks. Incorrect imputation data may introduce more severe bias in the analysis compared to missing data. A new missing value imputation algorithm named DDC-GAIN (Dual Discriminator based on Conditional Generation Adversarial Imputation Network) was introduced based on Conditional Generative Adversarial Imputation Network (C-GAIN) and dual discriminator, in which the primary discriminator was assisted by the auxiliary discriminator in assessing the validity of predicted values. In other words, the authenticity of the generated sample was judged by global sample information and the relationship between features was emphasized to estimate predicted values. Experimental results on four datasets show that, compared with five classical imputation algorithms, DDC-GAIN algorithm achieves the lowest Root Mean Square Error (RMSE) under the same conditions and with large sample size; when the missing rate is 15% on the Default credit card dataset, the RMSE of DDC-GAIN is 28.99% lower than that of the optimal comparison algorithm C-GAIN. This indicates that it is effective to utilize the auxiliary discriminator to support the primary discriminator in learning feature relationships.

Key words: Conditional Generative Adversarial Imputation Network (C-GAIN), imputation of missing data, incompleteness, feature relationship, dual discriminator

中图分类号: