Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (3): 783-790.DOI: 10.11772/j.issn.1001-9081.2021040759

Special Issue: 人工智能 2021年中国计算机学会人工智能会议(CCFAI 2021)

• 2021 CCF Conference on Artificial Intelligence (CCFAI 2021) • Previous Articles     Next Articles

Gene data generation method based on generative adversarial network

Yimin CAO, Lei CAI, Jingyang GAO()   

  1. College of Information Science and Technology,Beijing University of Chemical Technology,Beijing 100029,China
  • Received:2021-05-12 Revised:2021-06-03 Accepted:2021-06-09 Online:2021-11-09 Published:2022-03-10
  • Contact: Jingyang GAO
  • About author:CAO Yimin, born in 1997, M. S. candidate. His research interests include bioinformatics, deep learning, data mining.
    CAI Lei, born in 1992, Ph. D. candidate. His research interests include bioinformatics, deep learning, data mining.
  • Supported by:
    Beijing Natural Science Foundation(5182018)


曹一珉, 蔡磊, 高敬阳()   

  1. 北京化工大学 信息科学与技术学院,北京 100029
  • 通讯作者: 高敬阳
  • 作者简介:曹一珉(1997—),男,河南信阳人,硕士研究生,主要研究方向:生物信息学、深度学习、数据挖掘
  • 基金资助:


In deep learning, as the depth of Convolutional Neural Network (CNN) increases, more and more data is required for neural network training, but gene structure variation is a small sample event in large-scale genetic data, resulting in a very shortage of image data of variant genes, which seriously affects the training effect of CNN and causes the problems of poor gene structure variation detection precision and high false positive rate. In order to increase the number of gene structure variation samples and improve the precision of CNN to identify gene structure variation, a gene image data augmentation method was proposed based on GAN (Generative Adversarial Network), namely GeneGAN. Firstly, initial genetic image data was generated by using the Reads stacking method and it was divided into two datasets including variant gene images and non-variant gene images. Secondly, GeneGAN was used to augment the variant image samples to balance the positive and negative datasets. Finally, CNN was used to detect the datasets before and after augmentation, and precision, recall and F1 score were used as measurement indicators. Experimental results show that compared with tradional augmentation method, GAN based augmentation method and feature extraction method, the F1 score of GeneGAN is improved by 1.94 to 17.46 percentage points, verifying that GeneGAN method can improve the precision of CNN to identify gene structure variation.

Key words: Generative Adversarial Network (GAN), residual learning, gene image, Convolution Neural Network (CNN), data augmentation



关键词: 生成对抗网络, 残差学习, 基因图像, 卷积神经网络, 数据增强

CLC Number: