Gene data generation method based on generative adversarial network

doi:10.11772/j.issn.1001-9081.2021040759

Abstract

Abstract:

In deep learning， as the depth of Convolutional Neural Network （CNN） increases， more and more data is required for neural network training， but gene structure variation is a small sample event in large-scale genetic data， resulting in a very shortage of image data of variant genes， which seriously affects the training effect of CNN and causes the problems of poor gene structure variation detection precision and high false positive rate. In order to increase the number of gene structure variation samples and improve the precision of CNN to identify gene structure variation， a gene image data augmentation method was proposed based on GAN （Generative Adversarial Network）， namely GeneGAN. Firstly， initial genetic image data was generated by using the Reads stacking method and it was divided into two datasets including variant gene images and non-variant gene images. Secondly， GeneGAN was used to augment the variant image samples to balance the positive and negative datasets. Finally， CNN was used to detect the datasets before and after augmentation， and precision， recall and F1 score were used as measurement indicators. Experimental results show that compared with tradional augmentation method， GAN based augmentation method and feature extraction method， the F1 score of GeneGAN is improved by 1.94 to 17.46 percentage points， verifying that GeneGAN method can improve the precision of CNN to identify gene structure variation.

Key words: Generative Adversarial Network (GAN), residual learning, gene image, Convolution Neural Network (CNN), data augmentation

摘要：

在深度学习中，随着卷积神经网络（CNN）的深度不断增加，进行神经网络训练所需的数据会越来越多，但基因结构变异在大规模基因数据中属于小样本事件，导致变异基因的图像数据十分匮乏，严重影响了CNN的训练效果，造成了基因结构变异检测精度差、假阳性率高等问题。为增加基因结构变异样本数量，提高CNN识别基因结构变异的精度，提出了一种基于生成对抗网络（GAN）进行基因图像数据扩增的方法——GeneGAN。首先，利用Reads堆叠方法生成初始基因图像数据，将变异基因图像数据与非变异基因图像数据分为两个数据集；然后，为了平衡正负样本数据集，使用GeneGAN对变异图像样本进行扩充；最后，通过CNN对平衡前后数据集进行检测，并对精确率、召回率与F1值进行对比。实验结果显示，与传统扩增方法、生成对抗网络扩增方法、特征提取方法相比，GeneGAN对基因结构变异检测的F1值提升了1.94~17.46个百分点，说明使用GeneGAN进行基因数据生成能够有效提高使用CNN进行基因图像分类的精确率。

关键词: 生成对抗网络, 残差学习, 基因图像, 卷积神经网络, 数据增强

CLC Number:

TP391

Yimin CAO, Lei CAI, Jingyang GAO. Gene data generation method based on generative adversarial network[J]. Journal of Computer Applications, 2022, 42(3): 783-790.

曹一珉, 蔡磊, 高敬阳. 基于生成对抗网络的基因数据生成方法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 783-790.

Figures/Tables 15

Fig. 1 GeneGAN network structure

Fig. 2 Flowchart of gene data classification algorithm based on GeneGAN

Fig. 3 Network structure of Generator

Fig. 4 Network structure of Discriminator

Tab. 1 Significance of four pixel colors in gene image

像素点

颜色

匹配

模式

是否

缺失

像素点

颜色

匹配

模式

是否

缺失

Fig. 5 Comparison of positive and negative samples in original genetic images

Tab. 2 Network structure parameters

网络	学习率	优化器	Batch_size
GeneGAN	0.000 1	Adam	64
CNN	1E-8	SGD	64

Tab. 3 Experimental results of raw data with different proportions of positive and negative samples

正负样本比例	精确率	召回率	F1值
1∶100	46.70	61.28	53.01
1∶50	47.31	65.73	55.02
1∶25	49.17	69.13	57.46

Fig. 6 Amplified gene image examples generated by traditional ways

Tab. 4 Experimental results of traditional amplification data with different proportions of positive and negative samples

正负样本比例	精确率	召回率	F1值
1∶15	49.91	70.26	58.36
1∶1	50.43	72.44	59.46

Tab. 5 Experimental results of original GAN extended data with different proportions of positive and negative samples

正负样本比例	精确率	召回率	F1值
1∶15	50.73	71.62	59.39
1∶1	53.17	78.31	63.33

Fig. 7 Images generated by four GAN methods

Fig. 8 Learning process of convolutional neural network on multiple GAN amplified datasets

Tab. 6 Experimental results of four kinds of GAN amplification data with different proportions of positive and negative samples

数据	正负样本比例	Precision/%	Recall/%	F1/%	耗时 /min
原始数据	1∶25	49.17	69.13	57.46	152.1
GAN扩增数据	1∶15	50.73	71.62	59.39	154.4
DCGAN扩增数据		51.44	73.69	60.58	150.8
WGAN-GP扩增数据		51.06	72.14	59.79	151.7
GeneGAN扩增数据		51.84	75.81	61.57	152.4
GAN扩增数据	1∶1	53.17	78.31	63.34	147.8
DCGAN扩增数据		53.91	79.82	64.35	142.1
WGAN-GP扩增数据		53.62	79.91	64.18	143.5
GeneGAN扩增数据		55.28	82.78	66.29	144.5

Tab. 7 Experimental results comparison of different feature extraction methods

方法	Precision	Recall	F1
SVIM	49.20	81.79	61.44
Sniffles	54.39	77.86	64.05
Pbhoney	59.18	41.56	48.83
GeneGAN	55.28	82.78	66.29

References 28

1	MICHAEL R S， CAMPBELL P J， FUTREAL P A. The cancer genome［J］. Nature， 2009， 458（7239）： 719-724. 10.1038/nature07943
2	PAK C H， DANKO T， ZHANG Y， et al. Human neuropsychiatric disease modeling using conditional deletion reveals synaptic transmission defects caused by heterozygous mutations in NRXN1［J］. Cell Stem Cell， 2015， 17（3）： 316-328. 10.1016/j.stem.2015.07.017
3	International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome［J］. Nature， 2004， 431（7011）： 931. 10.1038/nature03001
4	KALINSKY K， HEGUY A， BHANOT U K， et al. PIK3CA mutations rarely demonstrate genotypic intratumoral heterogeneity and are selected for in breast cancer progression［J］. Breast Cancer Research and Treatment， 2011， 129（2）： 635. 10.1007/s10549-011-1601-4
5	EMILE J F， DIAMOND E L， HÉLIAS-RODZEWICZ Z， et al. Recurrent RAS and PIK3CA mutations in Erdheim-Chester disease［J］. Blood： The Journal of the American Society of Hematology， 2014， 124（19）： 3016-3019. 10.1182/blood-2014-04-570937
6	MOLEY J F， BROTHER M B， WELLS S A， et al. Low frequency of ras gene mutations in neuroblastomas， pheochromocytomas， and medullary thyroid cancers［J］. Cancer Research， 1991， 51（6）： 1596-1599. 10.1002/1097-0142(19910315)67:6<1713::AID-CNCR2820670639>3.0.CO;
7	BAKER S J， PREISINGER A C， JESSUP J M， et al. p53 gene mutations occur in combination with 17p allelic deletions as late events in colorectal tumorigenesis［J］. Cancer Research， 1990， 50（23）： 7717-7722.
8	SETIO A A A， CIOMPI F， LITJENS G， et al. Pulmonary nodule detection in CT images： false positive reduction using multi-view convolutional networks［J］. IEEE Transactions on Medical Imaging， 2016， 35（5）： 1160-1169. 10.1109/tmi.2016.2536809
9	ALDOJ N， LUKAS S， DEWEY M， et al. Semi-automatic classification of prostate cancer on multi-parametric MR imaging using a multi-channel 3D convolutional neural network［J］. European Radiology， 2020， 30（2）： 1243-1253. 10.1007/s00330-019-06417-z
10	GOODFELLOW I J， POUGET ABADIE J， MIRZA M， et al. Generative adversarial networks ［EB/OL］. ［2020-12-19］. . 10.1145/3422622
11	WOLTERINK J M， DINKLA A M， SAVENIJE M H F， et al. Deep MR to CT synthesis using unpaired data［C］// Proceedings of the 2017 International Workshop on Simulation and Synthesis in Medical Imaging. Cham： Springer， 2017： 14-23. 10.1007/978-3-319-68127-6_2
12	CALIMERI F， MARZULLO A， STAMILE C， et al. Biomedical data augmentation using generative adversarial neural networks［C］// Proceedings of the 2017 International Conference on Artificial Neural Networks. Cham： Springer， 2017： 626-634. 10.1007/978-3-319-68612-7_71
13	CAI L， WU Y， GAO J. DeepSV： accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network［J］. BMC Bioinformatics， 2019， 20（1）： 665. 10.1186/s12859-019-3299-y
14	POPLIN R， CHANG P C， ALEXANDER D， et al. A universal SNP and small-indel variant caller using deep neural networks［J］. Nature Biotechnology， 2018， 36（10）： 983-987. 10.1038/nbt.4235
15	RADFORD A， METZ L， CHINTALA S. Unsupervised representation learning with deep convolutional generative adversarial networks ［EB/OL］. ［2020-12-19］. .
16	GOODFELLOW I J， POUGET ABADIE J， MIRZA M， et al. Generative adversarial networks ［EB/OL］. ［2020-12-19］. . 10.1145/3422622
17	RATLIFF L J， BURDEN S A， SASTRY S S. Characterization and computation of local Nash equilibria in continuous games［C］// Proceedings of the 2013 51st Annual Allerton Conference on Communication， Control， and Computing. Piscataway： IEEE， 2013： 917-924. 10.1109/allerton.2013.6736623
18	GOODFELLOW I. NIPS 2016 tutorial： generative adversarial networks ［EB/OL］. ［2020-12-19］. .
19	曹仰杰，贾丽丽，陈永霞，等. 生成式对抗网络及其计算机视觉应用研究综述［J］. 中国图象图形学报， 2018， 23（10）： 1433-1449. 10.11834/jig.180103
	CAO Y J， JIA L L， CHEN Y X，et al. Review of computer vision based on generative adversarial networks［J］. Journal of Image and Graphics，2018， 23（10）：1433-1449. 10.11834/jig.180103
20	ARJOVSKY M， CHINTALA S， BOTTOU L. Wasserstein GAN ［EB/OL］. ［2020-12-19］. .
21	ARJOVSKY M， BOTTOU L. Towards principled methods for training generative adversarial networks ［EB/OL］. ［2020-12-19］. .
22	邹秀芳，朱定局. 生成对抗网络研究综述［J］. 计算机系统应用， 2019， 28（11）： 1-9.
	ZOU X F， ZHU D J. Review on generative adversarial network［J］. Computer Systems & Applications， 2019， 28（11）： 1-9.
23	柴梦婷，朱远平. 生成式对抗网络研究与应用进展［J］. 计算机工程， 2019， 45（9）： 222-234. 10.19678/j.issn.1000-3428.0051964
	CHAI M T， ZHU Y P. Research and application progress of generative countermeasure network［J］ Computer Engineering， 2019， 45（9）： 222-234. 10.19678/j.issn.1000-3428.0051964
24	GULRAJANI I， AHMED F， ARJOVSKY M， et al. Improved training of Wasserstein GANs ［EB/OL］. ［2020-12-19］. .
25	林懿伦，戴星原，李力，等. 人工智能研究的新前线：生成式对抗网络［J］. 自动化学报， 2018， 44（5）： 775-792. 10.16383/j.aas.2018.y000002
	LIN Y L， DAI X Y， LI L， et al. The new frontier of ai research： generative adversarial networks［J］. Acta Automatica Sinica， 2018， 44（5）： 775-792. 10.16383/j.aas.2018.y000002
26	HELLER D， VINGRON M. SVIM： structural variant identification using mapped long reads［J］. Bioinformatics， 2019， 35（17）： 2907-2915. 10.1093/bioinformatics/btz041
27	SEDLAZECK F J， RESCHENEDER P， SMOLKA M， et al. Accurate detection of complex structural variations using single-molecule sequencing［J］. Nature Methods， 2018， 15（6）： 461-468. 10.1038/s41592-018-0001-7
28	ENGLISH A C， SALERNO W J， REID J G. PBHoney： identifying genomic variants via long-read discordance and interrupted mapping［J］. BMC Bioinformatics， 2014， 15（1）： 1-7. 10.1186/1471-2105-15-180

[1]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[2]	Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL： positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492.
[3]	Jie GUO, Jiayu LIN, Zuhong LIANG, Xiaobo LUO, Haitao SUN. Recommendation method based on knowledge‑awareness and cross-level contrastive learning [J]. Journal of Computer Applications, 2024, 44(4): 1121-1127.
[4]	Haoran WANG, Dan YU, Yuli YANG, Yao MA, Yongle CHEN. Domain transfer intrusion detection method for unknown attacks on industrial control systems [J]. Journal of Computer Applications, 2024, 44(4): 1158-1165.
[5]	Sunjie YU, Hui ZENG, Shiyu XIONG, Hongzhou SHI. Incentive mechanism for federated learning based on generative adversarial network [J]. Journal of Computer Applications, 2024, 44(2): 344-352.
[6]	Andi GUO, Zhen JIA, Tianrui LI. High-precision entity and relation extraction in medical domain based on pseudo-entity data augmentation [J]. Journal of Computer Applications, 2024, 44(2): 393-402.
[7]	Yifei SONG, Yi LIU. Fast adversarial training method based on data augmentation and label noise [J]. Journal of Computer Applications, 2024, 44(12): 3798-3807.
[8]	Xinrong HU, Jingxue CHEN, Zijian HUANG, Bangchao WANG, Xun YAO, Junping LIU, Qiang ZHU, Jie YANG. Graph convolution network-based masked data augmentation [J]. Journal of Computer Applications, 2024, 44(11): 3335-3344.
[9]	Hui ZHOU, Yuling CHEN, Xuewei WANG, Yangwen ZHANG, Jianjiang HE. Deep shadow defense scheme of federated learning based on generative adversarial network [J]. Journal of Computer Applications, 2024, 44(1): 223-232.
[10]	Anyang LIU, Huaici ZHAO, Wenlong CAI, Zechao XU, Ruideng XIE. Adaptive image deblurring generative adversarial network algorithm based on active discrimination mechanism [J]. Journal of Computer Applications, 2023, 43(7): 2288-2294.
[11]	Shaoquan CHEN, Jianping CAI, Lan SUN. Differential privacy generative adversarial network algorithm with dynamic gradient threshold clipping [J]. Journal of Computer Applications, 2023, 43(7): 2065-2072.
[12]	Xin JIN, Yangchuan LIU, Yechen ZHU, Zijian ZHANG, Xin GAO. Sinogram inpainting for sparse-view cone-beam computed tomography image reconstruction based on residual encoder-decoder generative adversarial network [J]. Journal of Computer Applications, 2023, 43(6): 1950-1957.
[13]	Jiagao WU, Shiwen ZHANG, Yudong JIANG, Linfeng LIU. Social-interaction GAN for pedestrian trajectory prediction based on state-refinement long short-term memory and attention mechanism [J]. Journal of Computer Applications, 2023, 43(5): 1565-1570.
[14]	Jinwen GUO, Xinghua MA, Gongning LUO, Wei WANG, Yang CAO, Kuanquan WANG. Guidewire artifact removal method of structure-enhanced IVOCT based on Transformer [J]. Journal of Computer Applications, 2023, 43(5): 1596-1605.
[15]	Hao WANG, Zicheng WANG, Chao ZHANG, Yunsheng MA. Generative adversarial network based data uncertainty quantification method [J]. Journal of Computer Applications, 2023, 43(4): 1094-1101.