融合卷积与自注意力机制的基因型填补算法

doi:10.11772/j.issn.1001-9081.2022111756

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3534-3539.DOI: 10.11772/j.issn.1001-9081.2022111756

• 先进计算 • 上一篇

融合卷积与自注意力机制的基因型填补算法

陈炯环¹^,², 鲍胜利¹^,²(), 王啸飞¹^,², 李若凡¹^,²

^1.中国科学院成都计算机应用研究所，成都 610213
^2.中国科学院大学，北京 100049

收稿日期:2022-11-24 修回日期:2023-02-06 接受日期:2023-02-09 发布日期:2023-02-28 出版日期:2023-11-10
通讯作者: 鲍胜利
作者简介:陈炯环（1998—），男，山东潍坊人，硕士研究生，主要研究方向：机器学习、大数据系统、大规模数据分析
鲍胜利（1973—），男，安徽黄山人，研究员，博士，主要研究方向：软件工程、大数据智能 baoshengli@casit.com.cn
王啸飞（1997—），男，湖南慈利人，硕士研究生，主要研究方向：机器学习、推荐算法
李若凡（1997—），男，甘肃兰州人，硕士研究生，主要研究方向：机器学习、时序预测。
基金资助:
中国科学院“西部青年学者”项目(RRJZ2021003)

Genotype imputation algorithm fusing convolution and self-attention mechanism

Jionghuan CHEN¹^,², Shengli BAO¹^,²(), Xiaofei WANG¹^,², Ruofan LI¹^,²

^1.Chengdu Institute of Computer Application，Chinese Academy of Sciences，Chengdu Sichuan 610213，China
^2.University of Chinese Academy of Sciences，Beijing 100049，China

Received:2022-11-24 Revised:2023-02-06 Accepted:2023-02-09 Online:2023-02-28 Published:2023-11-10
Contact: Shengli BAO
About author:CHEN Jionghuan， born in 1998， M. S. candidate. His research interests include machine learning， big data system， large-scale data analysis.
BAO Shengli， born in 1973， Ph. D.， research fellow. His research interests include software engineering， big data intelligence.
WANG Xiaofei， born in 1997， M. S. candidate. His research interests include machine learning， recommendation algorithm.
LI Ruofan， born in 1997， M. S. candidate. His research interests include machine learning， time series forecasting.
Supported by:
“Western Young Scholars” Project of Chinese Academy of Sciences(RRJZ2021003)

摘要/Abstract

摘要：

基因型填补可以通过填补估算出在基因测序数据中未覆盖的样本区域弥补因技术限制导致的缺失，但现有的基于深度学习的填补方法不能有效捕捉到全序列位点间的连锁关系，造成整体填补准确率低、批量序列填补准确率分散等问题。针对这些问题提出一种融合卷积与自注意力机制的填补方法——FCSA，使用两种融合模块构成编解码器组建网络模型。编码器融合模块使用自注意力层得到全序列位点间的关联度，将该关联度融合到全局位点后再通过卷积层提取局部特征；解码器融合模块使用卷积对编码后的低维向量进行局部特征重建，应用自注意力层对全序列建模并融合。使用多物种的动物基因数据进行模型训练，并在Dog、Pig和Chicken数据集上进行比较验证，结果表明，与SCDA（Sparse Convolutional Denoising Autoencoders）、AGIC（Autoencoder Genome Imputation and Compression）和U-net相比，FCSA在10%、20%和30%缺失率下的平均填补准确率均取得了最高值，且批量序列填补准确率的分散程度较小；消融实验的结果也表明，这两种融合模块的设计能够有效提升基因型填补的准确率。

关键词: 基因型填补, 卷积, 自注意力, 融合模块, 全序列建模

Abstract:

Genotype imputation can compensate for the missing due to technical limitations by estimating the sample regions that are not covered in gene sequencing data with imputation， but the existing deep learning-based imputation methods cannot effectively capture the linkage among complete sequence loci， resulting in low overall imputation accuracy and high dispersion of batch sequence imputation accuracy. Therefore， FCSA （Fusing Convolution and Self-Attention）， an imputation method that fuses convolution and self-attention mechanism， was proposed to address the above problems， and two fusion modules were used to form encoder and decoder to construct network model. In the encoder fusion module， a self-attention layer was used to obtain the correlation among complete sequence loci， and the local features were extracted through the convolutional layer after fusing the correlation to global loci. In the decoder fusion module， the local features of the encoded low-dimensional vector were reconstructed by convolution， and the complete sequence was modeled and fused by self-attention layer. The genetic data of multiple species of animals were used for model training， and the comparison and validation were carried out on Dog， Pig and Chicken datasets. The results show that compared to SCDA （Sparse Convolutional Denoising Autoencoders）， AGIC （Autoencoder Genome Imputation and Compression） and U-net， FCSA achieves the highest average imputation accuracy at 10%， 20% and 30% missing rate. Ablation experimental results also show that the design of the two fusion modules is effective in improving the accuracy of genotype imputation.

Key words: genotype imputation, convolution, self-attention, fusion module, full sequence modeling

中图分类号:

TP391.1

陈炯环, 鲍胜利, 王啸飞, 李若凡. 融合卷积与自注意力机制的基因型填补算法[J]. 计算机应用, 2023, 43(11): 3534-3539.

Jionghuan CHEN, Shengli BAO, Xiaofei WANG, Ruofan LI. Genotype imputation algorithm fusing convolution and self-attention mechanism[J]. Journal of Computer Applications, 2023, 43(11): 3534-3539.

图/表 8

参考文献 18

1	International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome［J］. Nature， 2001， 409（6822）： 860-921.
2	LI Y， WILLER C， SANNA S， et al. Genotype imputation［J］. Annual Review of Genomics and Human Genetics， 2009， 10： 387-406. 10.1146/annurev.genom.9.081307.164242
3	WIGGINTON J E， CUTLER D J， ABECASIS G R. A note on exact tests of Hardy-Weinberg equilibrium［J］. AJHG： The American Journal of Human Genetics， 2005， 76（5）： 887-893. 10.1086/429864
4	PEI Y F， LI J， ZHANG L， et al. Analyses and comparison of accuracy of different genotype imputation methods［J］. PLoS ONE， 2008， 3（10）： No.e3551. 10.1371/journal.pone.0003551
5	ZHANG Z， DRUET T. Marker imputation with low-density marker panels in Dutch Holstein cattle［J］. Journal of Dairy Science， 2010， 93（11）： 5487-5494. 10.3168/jds.2010-3501
6	李乐义，邵东东，丁向东，等.SNP芯片基因型填充至测序数据的策略［J］.中国科技论文，2016，11（12）：1431-1436. 10.3969/j.issn.2095-2783.2016.12.022
	LI L Y， SHAO D D， DING X D， et al. Research on genotype imputation from SNP chip data to whole-genome sequence data［J］. China Sciencepaper， 2016， 11（12）： 1431-1436. 10.3969/j.issn.2095-2783.2016.12.022
7	MARCHINI J， HOWIE B， MYERS S， et al. A new multipoint method for genome-wide association studies by imputation of genotypes［J］. Nature Genetics， 2007， 39（7）： 906-913. 10.1038/ng2088
8	HOWIE B N， DONNELLY P， MARCHINI J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies［J］. PLoS Genetics， 2009， 5（6）： No.e1000529. 10.1371/journal.pgen.1000529
9	LI Y， WILLER C J， DING J， et al. MaCH： using sequence and genotype data to estimate haplotypes and unobserved genotypes［J］. Genetic Epidemiology， 2010， 34（8）： 816-834. 10.1002/gepi.20533
10	SCHEET P， STEPHENS M. A fast and flexible statistical model for large-scale population genotype data： applications to inferring missing genotypes and haplotypic phase［J］. AJHG： The American Journal of Human Genetics， 2006， 78（4）： 629-644. 10.1086/502802
11	BROWNING B L， BROWNING S R. Genotype imputation with millions of reference samples［J］. AJHG： The American Journal of Human Genetics， 2016， 98（1）： 116-126. 10.1016/j.ajhg.2015.11.020
12	CHEN J， SHI X. Sparse convolutional denoising autoencoders for genotype imputation［J］. Genes， 2019， 10（9）： No.652. 10.3390/genes10090652
13	ISLAM T， KIM C H， IWATA H， et al. A deep learning method to impute missing values and compress genome-wide polymorphism data in rice［C］// Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies — Volume 3. Setúbal： SciTePress， 2021： 101-109. 10.5220/0010233900002865
14	曹一珉，蔡磊，高敬阳.基于生成对抗网络的基因数据生成方法［J］.计算机应用，2022，42（3）：783-790. 10.11772/j.issn.1001-9081.2021040759
	CAO Y M， CAI L， GAO J Y. Gene data generation method based on generative adversarial network［J］. Journal of Computer Applications， 2022， 42（3）： 783-790. 10.11772/j.issn.1001-9081.2021040759
15	殷力. 基于深度学习的基因型填充方法研究［D］. 北京：中国科学院大学， 2020： 30-40.
	YIN L. Genotype imputation method based on deep learning［D］. Beijing： University of Chinese Academy of Sciences， 2020： 30-40.
16	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
17	BA J L， KIROS J R， HINTON G E. Layer normalization［EB/OL］. ［2023-02-23］..
18	YANG W， YANG Y， ZHAO C， et al. Animal-ImputeDB： a comprehensive database with multiple animal reference panels for genotype imputation［J］. Nucleic Acids Research， 2020， 48（D1）： D659-D667. 10.1093/nar/gkz854

物种数据	缺失率为10%		缺失率为20%		缺失率为30%
物种数据	Acc/%	Std	Acc/%	Std	Acc/%	Std
Dog	93.69	0.108 6	93.52	0.109 7	93.41	0.110 5
Pig	85.26	0.095 0	85.09	0.092 6	84.95	0.092 2
Chicken	84.62	0.079 9	84.35	0.076 7	84.06	0.075 8

物种数据	缺失率为10%		缺失率为20%		缺失率为30%
物种数据	Acc/%	Std	Acc/%	Std	Acc/%	Std
Dog	93.69	0.108 6	93.52	0.109 7	93.41	0.110 5
Pig	85.26	0.095 0	85.09	0.092 6	84.95	0.092 2
Chicken	84.62	0.079 9	84.35	0.076 7	84.06	0.075 8

模型	Dog数据集						Chicken数据集
	缺失率为10%		缺失率为20%		缺失率为30%		缺失率为10%		缺失率为20%		缺失率为30%
	Acc/%	Std	Acc/%	Std	Acc/%	Std	Acc/%	Std	Acc/%	Std	Acc/%	Std
SCDA	93.37	0.1081	93.34	0.1076	93.29	0.1077	83.37	0.0846	82.64	0.0830	81.89	0.0843
AGIC	93.42	0.1366	93.38	0.1350	93.28	0.1367	83.67	0.1224	83.18	0.1163	82.51	0.1171
U-net	92.20	0.1454	92.20	0.1446	92.19	0.1441	76.81	0.1137	76.81	0.1101	76.79	0.1091
FCSA	93.69	0.1086	93.52	0.1097	93.41	0.1105	84.62	0.0799	84.35	0.0767	84.06	0.0758

模型	Dog数据集						Chicken数据集
	缺失率为10%		缺失率为20%		缺失率为30%		缺失率为10%		缺失率为20%		缺失率为30%
	Acc/%	Std	Acc/%	Std	Acc/%	Std	Acc/%	Std	Acc/%	Std	Acc/%	Std
SCDA	93.37	0.1081	93.34	0.1076	93.29	0.1077	83.37	0.0846	82.64	0.0830	81.89	0.0843
AGIC	93.42	0.1366	93.38	0.1350	93.28	0.1367	83.67	0.1224	83.18	0.1163	82.51	0.1171
U-net	92.20	0.1454	92.20	0.1446	92.19	0.1441	76.81	0.1137	76.81	0.1101	76.79	0.1091
FCSA	93.69	0.1086	93.52	0.1097	93.41	0.1105	84.62	0.0799	84.35	0.0767	84.06	0.0758

模型	Acc/%	模型	Acc/%
FCSA-a	82.14	FCSA-c	77.72
FCSA-b	80.44	FCSA	84.06

融合卷积与自注意力机制的基因型填补算法

Genotype imputation algorithm fusing convolution and self-attention mechanism

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 18

相关文章 15

编辑推荐

Metrics

[1]	尚绍法, 蒋林, 李远成, 朱筠. 异构平台下卷积神经网络推理模型自适应划分和调度方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2828-2835.
[2]	李众, 王雅婧, 马巧梅. 基于空洞卷积的医学图像超分辨率重建算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2940-2947.
[3]	徐丽, 符祥远, 李浩然. 基于门控卷积的时空交通流预测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2760-2765.
[4]	袁国龙, 张玉金, 刘洋. 基于残差反馈和自注意力的图像篡改取证网络[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2925-2931.
[5]	路琨婷, 费蓉蓉, 张选德. 融合卷积神经网络的遥感图像全色锐化[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2963-2969.
[6]	梁美佳, 刘昕武, 胡晓鹏. 基于改进YOLOv3的列车运行环境图像小目标检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2611-2618.
[7]	樊海玮, 鲁芯丝雨, 张丽苗, 安毅生. 融合知识图谱和图注意力网络的引文推荐算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2420-2425.
[8]	李豆豆, 李汪根, 夏义春, 束阳, 高坤. 基于特征交互与自适应融合的骨骼动作识别[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2581-2587.
[9]	姜钧舰, 刘达维, 刘逸凡, 任酉贵, 赵志滨. 基于孪生网络的小样本目标检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2325-2329.
[10]	刘欢, 吴亮红, 张侣, 陈亮, 周博文, 张红强. 基于特征双融合CenterNet的白细胞检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2602-2610.
[11]	魏远, 林彦, 郭晟楠, 林友芳, 万怀宇. 融合出发地与目的地时空相关性的城市区域间出租车需求预测[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2100-2106.
[12]	张奕, 蔡钢生, 王真梅. 基于语义与全局双重注意力机制的长链非编码RNA-疾病关联预测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2125-2132.
[13]	姬张建, 张明, 王子龙. 基于改进VarifocalNet的高精度目标检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2147-2154.
[14]	秦源源, 张鸿. 基于注意力特征金字塔网络的肺结节检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2311-2318.
[15]	詹春兰, 王安志, 王明辉. 基于通道注意力和边缘融合的伪装目标分割方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2166-2172.