Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (4): 1094-1101.DOI: 10.11772/j.issn.1001-9081.2022030383

• Data science and technology • Previous Articles    

Generative adversarial network based data uncertainty quantification method

Hao WANG1, Zicheng WANG1, Chao ZHANG1, Yunsheng MA2()   

  1. 1.School of Mathematical Sciences,Dalian University of Technology,Dalian Liaoning 116024,China
    2.Shandong Chambroad Holding Group Company Limited,Binzhou Shandong 256500,China
  • Received:2022-03-30 Revised:2022-06-22 Accepted:2022-06-24 Online:2023-01-11 Published:2023-04-10
  • Contact: Yunsheng MA
  • About author:WANG Hao, born in 1996, Ph. D. candidate. His research interests include machine learning, reinforcement learning, active learning.
    WANG Zicheng, born in 1996, M. S. candidate. His research interests include machine learning, uncertainty quantification.
    ZHANG Chao, born in 1981, Ph. D., professor. His research interests include machine learning.
  • Supported by:
    National Key R&D Program of China(2020YFB1711104)

基于生成对抗网络的数据不确定性量化方法

王昊1, 王子成1, 张超1, 马韵升2()   

  1. 1.大连理工大学 数学科学学院,辽宁 大连 116024
    2.山东京博控股集团有限公司,山东 滨州 256500
  • 通讯作者: 马韵升
  • 作者简介:王昊(1996—),男,辽宁抚顺人,博士研究生,主要研究方向:机器学习、强化学习、主动学习;
    王子成(1996—),男,辽宁鞍山人,硕士研究生,主要研究方向:机器学习、不确定性量化;
    张超(1981—),男,河北文安人,教授,博士,主要研究方向:机器学习;
  • 基金资助:
    国家重点研发计划项目(2020YFB1711104)

Abstract:

To solve the problem that the direct use of high-dimensional, high-frequency, noise-containing real-world data to perform data processing leads to unreliable estimators, a data uncertainty quantification method based on Generative Adversarial Network (GAN) was proposed. Firstly, the original data distribution was reconstructed by GAN to construct a mapping distribution from the noise space to the space of the original data. Secondly, the samples were extracted by Markov Chain Monte Carlo (MCMC) method to obtain new samples based on the original data distribution. Thirdly, confidence intervals for the uncertainty of the samples were defined based on the specified functions. Finally, the confidence intervals were used to estimate the uncertainty of the original data, and within the data the confidence intervals was selected as the data used by the estimator. Experimental results show that 50% fewer samples are required to train the estimator to reach the upper limit by using the data within the confidence intervals compared to the samples required by using the original data. At the same time, compared to the original data, the data within the confidence intervals requires 30% fewer samples on average to achieve the same test accuracy.

Key words: Generative Adversarial Network (GAN), uncertainty quantification, Markov Chain Monte Carlo (MCMC) method, confidence interval, uncertainty estimation

摘要:

针对直接使用高维、高频、含有噪声的现实世界数据进行数据处理时会导致估计器不可靠的问题,提出一种基于生成对抗网络(GAN)的数据不确定性量化方法。首先,通过GAN重构原始数据分布,构建噪声空间到原始数据空间的映射分布;其次,使用马尔可夫链蒙特卡洛(MCMC)方法抽取样本,从而得到基于原始数据分布的新样本;然后,基于指定的函数定义样本的不确定性置信区间;最后,使用置信区间对原始数据进行不确定性估计,并选择置信区间内的数据作为估计器使用的数据。实验结果表明,与使用原始数据相比,使用置信区间内的数据进行估计器训练达到性能上限所需要的样本数减少了50%;同时,对比原始训练数据,置信区间内的数据在达到相同测试精度时所需要的样本数平均降低了30%。

关键词: 生成对抗网络, 不确定性量化, 马尔可夫链蒙特卡洛方法, 置信区间, 不确定性估计

CLC Number: