《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (3): 783-790.DOI: 10.11772/j.issn.1001-9081.2021040759

• 2021年中国计算机学会人工智能会议(CCFAI 2021) • 上一篇    

基于生成对抗网络的基因数据生成方法

曹一珉, 蔡磊, 高敬阳()   

  1. 北京化工大学 信息科学与技术学院,北京 100029
  • 收稿日期:2021-05-12 修回日期:2021-06-03 接受日期:2021-06-09 发布日期:2021-11-09 出版日期:2022-03-10
  • 通讯作者: 高敬阳
  • 作者简介:曹一珉(1997—),男,河南信阳人,硕士研究生,主要研究方向:生物信息学、深度学习、数据挖掘
    蔡磊(1992—),男,内蒙古呼和浩特人,博士研究生,主要研究方向:生物信息学、深度学习、数据挖掘;
  • 基金资助:
    北京市自然科学基金资助项目(5182018)

Gene data generation method based on generative adversarial network

Yimin CAO, Lei CAI, Jingyang GAO()   

  1. College of Information Science and Technology,Beijing University of Chemical Technology,Beijing 100029,China
  • Received:2021-05-12 Revised:2021-06-03 Accepted:2021-06-09 Online:2021-11-09 Published:2022-03-10
  • Contact: Jingyang GAO
  • About author:CAO Yimin, born in 1997, M. S. candidate. His research interests include bioinformatics, deep learning, data mining.
    CAI Lei, born in 1992, Ph. D. candidate. His research interests include bioinformatics, deep learning, data mining.
  • Supported by:
    Beijing Natural Science Foundation(5182018)

摘要:

在深度学习中,随着卷积神经网络(CNN)的深度不断增加,进行神经网络训练所需的数据会越来越多,但基因结构变异在大规模基因数据中属于小样本事件,导致变异基因的图像数据十分匮乏,严重影响了CNN的训练效果,造成了基因结构变异检测精度差、假阳性率高等问题。为增加基因结构变异样本数量,提高CNN识别基因结构变异的精度,提出了一种基于生成对抗网络(GAN)进行基因图像数据扩增的方法——GeneGAN。首先,利用Reads堆叠方法生成初始基因图像数据,将变异基因图像数据与非变异基因图像数据分为两个数据集;然后,为了平衡正负样本数据集,使用GeneGAN对变异图像样本进行扩充;最后,通过CNN对平衡前后数据集进行检测,并对精确率、召回率与F1值进行对比。实验结果显示,与传统扩增方法、生成对抗网络扩增方法、特征提取方法相比,GeneGAN对基因结构变异检测的F1值提升了1.94~17.46个百分点,说明使用GeneGAN进行基因数据生成能够有效提高使用CNN进行基因图像分类的精确率。

关键词: 生成对抗网络, 残差学习, 基因图像, 卷积神经网络, 数据增强

Abstract:

In deep learning, as the depth of Convolutional Neural Network (CNN) increases, more and more data is required for neural network training, but gene structure variation is a small sample event in large-scale genetic data, resulting in a very shortage of image data of variant genes, which seriously affects the training effect of CNN and causes the problems of poor gene structure variation detection precision and high false positive rate. In order to increase the number of gene structure variation samples and improve the precision of CNN to identify gene structure variation, a gene image data augmentation method was proposed based on GAN (Generative Adversarial Network), namely GeneGAN. Firstly, initial genetic image data was generated by using the Reads stacking method and it was divided into two datasets including variant gene images and non-variant gene images. Secondly, GeneGAN was used to augment the variant image samples to balance the positive and negative datasets. Finally, CNN was used to detect the datasets before and after augmentation, and precision, recall and F1 score were used as measurement indicators. Experimental results show that compared with tradional augmentation method, GAN based augmentation method and feature extraction method, the F1 score of GeneGAN is improved by 1.94 to 17.46 percentage points, verifying that GeneGAN method can improve the precision of CNN to identify gene structure variation.

Key words: Generative Adversarial Network (GAN), residual learning, gene image, Convolution Neural Network (CNN), data augmentation

中图分类号: