基于条件生成对抗插补网络的双重判别器缺失值插补算法

doi:10.11772/j.issn.1001-9081.2023050697

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1423-1427.DOI: 10.11772/j.issn.1001-9081.2023050697

所属专题：人工智能； 2023年中国计算机学会人工智能会议(CCFAI 2023)

• 2023年中国计算机学会人工智能会议(CCFAI 2023) • 上一篇下一篇

基于条件生成对抗插补网络的双重判别器缺失值插补算法

粟佳, 于洪()

计算智能重庆市重点实验室（重庆邮电大学），重庆 400065

收稿日期:2023-08-01 修回日期:2023-09-14 接受日期:2023-09-28 发布日期:2023-10-17 出版日期:2024-05-10
通讯作者: 于洪
作者简介:粟佳（1997—），男，四川广安人，硕士研究生，CCF会员，主要研究方向：不平衡和不完备数据的处理
第一联系人：于洪（1972—），女，重庆人，教授，博士，CCF会员，主要研究方向：粗糙集、工业大数据、数据挖掘、知识发现、粒计算、三支聚类、机器学习。
基金资助:
国家重点研发计划项目(2021YFF0704103);国家自然科学基金资助项目(62136002)

Missing value imputation algorithm using dual discriminator based on conditional generative adversarial imputation network

Jia SU, Hong YU()

Chongqing Key Laboratory of Computational Intelligence （Chongqing University of Posts and Telecommunications），Chongqing 400065，China

Received:2023-08-01 Revised:2023-09-14 Accepted:2023-09-28 Online:2023-10-17 Published:2024-05-10
Contact: Hong YU
About author:SU Jia， born in 1997， M. S. candidate. His research interests include unbalanced and incomplete data processing.
Supported by:
National Key R&D Program of China(2021YFF0704103);National Natural Science Foundation of China(62136002)

摘要/Abstract

摘要：

应用中的各种因素可能造成数据缺失，影响后续任务的分析。因此，数据集缺失值的插补尤为重要。相比原本没有插补的处理，错误的插补值也会对分析造成更严重的偏差。针对这种情况，提出新的采用双重判别器的基于条件生成对抗插补网络（C-GAIN）的缺失值插补算法DDC-GAIN（Dual Discriminator based on C-GAIN）。该算法通过一个辅助判别器辅助主判别器判断预测值的真假，即根据一个样本的全局信息判断这个样本生成的真假，更注重特征之间的关系，以此估算预测值。在4个数据集上与5种经典插补算法进行对比实验，结果表明：同样条件下，DDC-GAIN算法在样本量较大时的均方根误差（RMSE）最低；在Default credit card数据集上缺失率为15%时，DDC-GAIN算法的RMSE比次优算法C-GAIN降低了28.99%。这说明利用辅助判别器帮助主判别器学习特征之间的关系是有效的。

关键词: 条件生成对抗插补网络, 缺失值插补, 不完备性, 特征关系, 双重判别器

Abstract:

Various factors in the application may cause data loss and affect the analysis of subsequent tasks. Therefore， the imputation of missing data values in data sets is particularly important. Moreover， the accuracy of data imputation can significantly impact the analysis of subsequent tasks. Incorrect imputation data may introduce more severe bias in the analysis compared to missing data. A new missing value imputation algorithm named DDC-GAIN （Dual Discriminator based on Conditional Generation Adversarial Imputation Network） was introduced based on Conditional Generative Adversarial Imputation Network （C-GAIN） and dual discriminator， in which the primary discriminator was assisted by the auxiliary discriminator in assessing the validity of predicted values. In other words， the authenticity of the generated sample was judged by global sample information and the relationship between features was emphasized to estimate predicted values. Experimental results on four datasets show that， compared with five classical imputation algorithms， DDC-GAIN algorithm achieves the lowest Root Mean Square Error （RMSE） under the same conditions and with large sample size； when the missing rate is 15% on the Default credit card dataset， the RMSE of DDC-GAIN is 28.99% lower than that of the optimal comparison algorithm C-GAIN. This indicates that it is effective to utilize the auxiliary discriminator to support the primary discriminator in learning feature relationships.

Key words: Conditional Generative Adversarial Imputation Network (C-GAIN), imputation of missing data, incompleteness, feature relationship, dual discriminator

中图分类号:

TP391

粟佳, 于洪. 基于条件生成对抗插补网络的双重判别器缺失值插补算法[J]. 计算机应用, 2024, 44(5): 1423-1427.

Jia SU, Hong YU. Missing value imputation algorithm using dual discriminator based on conditional generative adversarial imputation network[J]. Journal of Computer Applications, 2024, 44(5): 1423-1427.

图/表 4

参考文献 30

1	LUO Y. Evaluating the state of the art in missing data imputation for clinical data［J］. Briefings in Bioinformatics， 2022， 23（1）： bbab489. 10.1093/bib/bbab489
2	JUNGER W L， DE LEON A P. Imputation of missing data in time series for air pollutants［J］. Atmospheric Environment， 2015， 102： 96-104. 10.1016/j.atmosenv.2014.11.049
3	WANG Z， WANG L， TAN Y， et al. Fault detection based on Bayesian network and missing data imputation for building energy systems［J］. Applied Thermal Engineering， 2021， 182： 116051. 10.1016/j.applthermaleng.2020.116051
4	LIU Y-Q， WANG C， ZHANG L. Decision tree based predictive models for breast cancer survivability on imbalanced data［C］// Proceedings of the 3rd International Conference on Bioinformatics and Biomedical Engineering. Piscataway： IEEE， 2009： 1-4. 10.1109/icbbe.2009.5162571
5	LIN W-C， C-F TSAI. Missing value imputation： a review and analysis of the literature （2006 — 2017）［J］. Artificial Intelligence Review， 2020， 53： 1487-1507. 10.1007/s10462-019-09709-4
6	ZHANG Z. Missing values in big data research： some basic skills［J］. Annals of Translational Medicine， 2015， 3（21）： 323.
7	YOON J， JORDON J， SCHAAR M. GAIN： Missing data imputation using generative adversarial nets［J］. Proceedings of Machine Learning Research， 2018， 80： 5689-5698. 10.48550/arXiv.1806.02920
8	YADAV M L， ROYCHOUDHURY B. Handling missing values： a study of popular imputation packages in R［J］. Knowledge-Based Systems， 2018， 160： 104-118. 10.1016/j.knosys.2018.06.012
9	VAN BUUREN S， GROOTHUIS-OUDSHOORN K G M. MICE： Multivariate imputation by chained equations in R［J］. Journal of Statistical Software， 2011， 45： 1-67. 10.18637/jss.v045.i03
10	KHAN S I， HOQUE A S M L. SICE： an improved missing data imputation technique［J］. Journal of Big Data， 2020， 7： No. 37. 10.1186/s40537-020-00313-w
11	STEKHOVEN D J， BÜHLMANN P. MissForest： non-parametric missing value imputation for mixed-type data［J］. Bioinformatics， 2012， 28（1）： 112-118. 10.1093/bioinformatics/btr597
12	MAZUMDER R， HASTIE T， TIBSHIRANI R. Spectral regularization algorithms for learning large incomplete matrices［J］. The Journal of Machine Learning Research， 2010， 11： 2287-2322. 10.1002/rnc.1522
13	RAHMAN M G， ISLAM M Z. Missing value imputation using a fuzzy clustering-based EM approach［J］. Knowledge and Information Systems， 2016， 46（2）： 389-422. 10.1007/s10115-015-0822-y
14	GONDARA L， WANG K. MIDA： multiple imputation using deep denoising autoencoders［C］// Proceedings of the 22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Cham： Springer， 2018： 260-272 . 10.1007/978-3-319-93040-4_21
15	AWAN S E， BENNAMOUN M， SOHEL F， et al. Imputation of missing data with class imbalance using conditional generative adversarial networks［J］. Neurocomputing， 2021， 453： 164-171. 10.1016/j.neucom.2021.04.010
16	McKNIGHT P E， McKNIGHT K M， SIDANI S， et al. Missing Data： A Gentle Introduction［M］. New York： Guilford Press， 2007： 17.
17	MORITZ S， SARDÁ A， BARTZ-BEIELSTEIN T， et al. Comparison of different methods for univariate time series imputation in R［EB/OL］. ［2023-04-23］. .
18	CHEN J， SHAO J. Nearest neighbor imputation for survey data［J］. Journal of Official Statistics， 2000， 16（2）： 113-131.
19	KIM H， GOLUB G H， PARK H. Missing value estimation for DNA microarray gene expression data： local least squares imputation［J］. Bioinformatics， 2005， 21（2）： 187-198. 10.1093/bioinformatics/bth499
20	AWAN S E， BENNAMOUN M， SOHEL F， et al. A reinforcement learning-based approach for imputing missing data［J］. Neural Computing and Applications， 2022， 34： 9701-9716. 10.1007/s00521-022-06958-3
21	SOVILJ D， EIROLA E， MICHE Y， et al. Extreme learning machine for missing data using multiple imputations［J］. Neurocomputing， 2016， 174： 220-231. 10.1016/j.neucom.2015.03.108
22	GARDNER M W， DORLING S R. Artificial neural networks （the multilayer perceptron）： a review of applications in the atmospheric sciences［J］. Atmospheric Environment， 1998， 32（14/15）： 2627-2636. 10.1016/s1352-2310(97)00447-0
23	DOOVE L L， VAN BUUREN S， DUSSELDORP E. Recursive partitioning for missing data imputation in the presence of interaction effects［J］. Computational Statistics & Data Analysis， 2014， 72： 92-104. 10.1016/j.csda.2013.10.025
24	SHAH A D， BARTLETT J W， CARPENTER J， et al. Comparison of random forest and parametric imputation models for imputing missing data using MICE： a CALIBER study［J］. American Journal of Epidemiology， 2014， 179（6）： 764-774. 10.1093/aje/kwt312
25	TRAN L， LIU X， ZHOU J， et al. Missing modalities imputation via cascaded residual autoencoder［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4971-4980. 10.1109/cvpr.2017.528
26	ŚMIEJA M， STRUSKI Ł， TABOR J， et al. Processing of missing data by neural networks［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2018： 2724-2734.
27	GOODFELLOW I， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial networks［J］. Communications of the ACM， 2020， 63（11）： 139-144. 10.1145/3422622
28	ZHOU X， LIU X， LAN G， et al. Federated conditional generative adversarial nets imputation method for air quality missing data［J］. Knowledge-Based Systems， 2021， 228： 107261. 10.1016/j.knosys.2021.107261
29	DUA D， GRAFF C. UCI machine learning repository［DB/OL］. ［2023-05-29］. .
30	CAI J-F， CANDÈS E J， SHEN Z. A singular value thresholding algorithm for matrix completion［J］. SIAM Journal on Optimization， 2010， 20（4）： 1956-1982. 10.1137/080738970

数据集	数据量	类别数	特征数
Bresat cancer	569	2	30
Spambase	4 601	2	57
Default credit card	30 000	2	24
News popularity	39 644	2	61

数据集	数据量	类别数	特征数
Bresat cancer	569	2	30
Spambase	4 601	2	57
Default credit card	30 000	2	24
News popularity	39 644	2	61

缺失率/%	数据集	DDC-GAIN	C-GAIN^［16］	GAIN^［7］	MICE^［10］	MissForest^［12］	Matrix^［30］
5	Default credit card	0.167 4±0.003 9	0.232 9±0.003 9	0.242 8±0.009 3	0.247 9±0.007 9	0.290 2±0.001 0	0.256 5±0.008 9
	Spambase	0.053 5±0.004 7	0.061 1±0.006 0	0.072 3±0.001 8	0.074 7±0.004 5	0.077 1±0.007 1	0.094 3±0.000 9
	News popularity	0.194 6±0.002 2	0.196 4±0.003 3	0.282 2±0.002 4	0.201 0±0.002 5	0.211 4±0.001 4	0.417 8±0.001 5
	Breast cancer	0.066 0±0.006 7	0.064 3±0.001 4	0.097 2±0.001 3	0.085 4±0.001 3	0.065 8±0.002 2	0.688 1±0.003 4
10	Default credit card	0.170 6±0.006 6	0.200 9±0.002 2	0.210 9±0.034 4	0.249 1±0.008 5	0.243 9±0.007 9	0.255 9±0.007 5
	Spambase	0.063 7±0.007 8	0.066 4±0.001 7	0.070 2±0.003 1	0.079 3±0.004 0	0.078 3±0.002 9	0.090 6±0.001 1
	News popularity	0.191 7±0.001 8	0.193 7±0.007 4	0.268 0±0.001 5	0.212 4±0.001 3	0.244 2±0.001 5	0.417 5±0.001 6
	Breast cancer	0.074 8±0.008 1	0.062 8±0.002 4	0.093 1±0.001 0	0.088 1±0.005 4	0.069 2±0.001 7	0.689 5±0.003 8
15	Default credit card	0.164 3±0.009 0	0.231 4±0.003 5	0.244 2±0.008 9	0.247 9±0.007 4	0.267 2±0.002 5	0.256 5±0.005 9
	Spambase	0.057 7±0.003 0	0.060 7±0.003 3	0.073 9±0.002 5	0.078 4±0.002 4	0.077 7±0.002 1	0.090 2±0.005 2
	News popularity	0.192 9±0.001 7	0.199 2±0.006 9	0.286 9±0.003 6	0.228 3±0.001 5	0.291 8±0.001 5	0.417 7±0.001 5
	Breast cancer	0.073 2±0.013 5	0.067 3±0.003 9	0.098 6±0.003 3	0.087 7±0.005 6	0.068 9±0.005 8	0.704 2±0.001 6
20	Default credit card	0.161 0±0.001 9	0.221 3±0.009 9	0.242 6±0.009 0	0.248 0±0.009 1	0.264 6±0.002 6	0.253 7±0.005 1
	Spambase	0.059 0±0.003 1	0.060 1±0.001 3	0.076 4±0.003 4	0.079 6±0.003 2	0.078 6±0.005 9	0.089 6±0.001 9
	News popularity	0.192 5±0.004 7	0.193 1±0.001 4	0.268 6±0.001 0	0.242 4±0.002 2	0.390 7±0.001 5	0.417 6±0.001 5
	Breast cancer	0.076 4±0.008 7	0.063 7±0.009 2	0.105 3±0.004 6	0.090 3±0.006 4	0.072 6±0.003 8	0.685 8±0.001 2

缺失率/%	数据集	DDC-GAIN	C-GAIN^［16］	GAIN^［7］	MICE^［10］	MissForest^［12］	Matrix^［30］
5	Default credit card	0.167 4±0.003 9	0.232 9±0.003 9	0.242 8±0.009 3	0.247 9±0.007 9	0.290 2±0.001 0	0.256 5±0.008 9
	Spambase	0.053 5±0.004 7	0.061 1±0.006 0	0.072 3±0.001 8	0.074 7±0.004 5	0.077 1±0.007 1	0.094 3±0.000 9
	News popularity	0.194 6±0.002 2	0.196 4±0.003 3	0.282 2±0.002 4	0.201 0±0.002 5	0.211 4±0.001 4	0.417 8±0.001 5
	Breast cancer	0.066 0±0.006 7	0.064 3±0.001 4	0.097 2±0.001 3	0.085 4±0.001 3	0.065 8±0.002 2	0.688 1±0.003 4
10	Default credit card	0.170 6±0.006 6	0.200 9±0.002 2	0.210 9±0.034 4	0.249 1±0.008 5	0.243 9±0.007 9	0.255 9±0.007 5
	Spambase	0.063 7±0.007 8	0.066 4±0.001 7	0.070 2±0.003 1	0.079 3±0.004 0	0.078 3±0.002 9	0.090 6±0.001 1
	News popularity	0.191 7±0.001 8	0.193 7±0.007 4	0.268 0±0.001 5	0.212 4±0.001 3	0.244 2±0.001 5	0.417 5±0.001 6
	Breast cancer	0.074 8±0.008 1	0.062 8±0.002 4	0.093 1±0.001 0	0.088 1±0.005 4	0.069 2±0.001 7	0.689 5±0.003 8
15	Default credit card	0.164 3±0.009 0	0.231 4±0.003 5	0.244 2±0.008 9	0.247 9±0.007 4	0.267 2±0.002 5	0.256 5±0.005 9
	Spambase	0.057 7±0.003 0	0.060 7±0.003 3	0.073 9±0.002 5	0.078 4±0.002 4	0.077 7±0.002 1	0.090 2±0.005 2
	News popularity	0.192 9±0.001 7	0.199 2±0.006 9	0.286 9±0.003 6	0.228 3±0.001 5	0.291 8±0.001 5	0.417 7±0.001 5
	Breast cancer	0.073 2±0.013 5	0.067 3±0.003 9	0.098 6±0.003 3	0.087 7±0.005 6	0.068 9±0.005 8	0.704 2±0.001 6
20	Default credit card	0.161 0±0.001 9	0.221 3±0.009 9	0.242 6±0.009 0	0.248 0±0.009 1	0.264 6±0.002 6	0.253 7±0.005 1
	Spambase	0.059 0±0.003 1	0.060 1±0.001 3	0.076 4±0.003 4	0.079 6±0.003 2	0.078 6±0.005 9	0.089 6±0.001 9
	News popularity	0.192 5±0.004 7	0.193 1±0.001 4	0.268 6±0.001 0	0.242 4±0.002 2	0.390 7±0.001 5	0.417 6±0.001 5
	Breast cancer	0.076 4±0.008 7	0.063 7±0.009 2	0.105 3±0.004 6	0.090 3±0.006 4	0.072 6±0.003 8	0.685 8±0.001 2

基于条件生成对抗插补网络的双重判别器缺失值插补算法

Missing value imputation algorithm using dual discriminator based on conditional generative adversarial imputation network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 4

参考文献 30

相关文章 1

编辑推荐

Metrics