《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3534-3539.DOI: 10.11772/j.issn.1001-9081.2022111756

• 先进计算 • 上一篇    

融合卷积与自注意力机制的基因型填补算法

陈炯环1,2, 鲍胜利1,2(), 王啸飞1,2, 李若凡1,2   

  1. 1.中国科学院 成都计算机应用研究所,成都 610213
    2.中国科学院大学,北京 100049
  • 收稿日期:2022-11-24 修回日期:2023-02-06 接受日期:2023-02-09 发布日期:2023-02-28 出版日期:2023-11-10
  • 通讯作者: 鲍胜利
  • 作者简介:陈炯环(1998—),男,山东潍坊人,硕士研究生 ,主要研究方向:机器学习、大数据系统、大规模数据分析
    鲍胜利(1973—),男,安徽黄山人,研究员,博士,主要研究方向:软件工程、大数据智能 baoshengli@casit.com.cn
    王啸飞(1997—),男,湖南慈利人,硕士研究生,主要研究方向:机器学习、推荐算法
    李若凡(1997—),男,甘肃兰州人,硕士研究生,主要研究方向:机器学习、时序预测。
  • 基金资助:
    中国科学院“西部青年学者”项目(RRJZ2021003)

Genotype imputation algorithm fusing convolution and self-attention mechanism

Jionghuan CHEN1,2, Shengli BAO1,2(), Xiaofei WANG1,2, Ruofan LI1,2   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610213,China
    2.University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2022-11-24 Revised:2023-02-06 Accepted:2023-02-09 Online:2023-02-28 Published:2023-11-10
  • Contact: Shengli BAO
  • About author:CHEN Jionghuan, born in 1998, M. S. candidate. His research interests include machine learning, big data system, large-scale data analysis.
    BAO Shengli, born in 1973, Ph. D., research fellow. His research interests include software engineering, big data intelligence.
    WANG Xiaofei, born in 1997, M. S. candidate. His research interests include machine learning, recommendation algorithm.
    LI Ruofan, born in 1997, M. S. candidate. His research interests include machine learning, time series forecasting.
  • Supported by:
    “Western Young Scholars” Project of Chinese Academy of Sciences(RRJZ2021003)

摘要:

基因型填补可以通过填补估算出在基因测序数据中未覆盖的样本区域弥补因技术限制导致的缺失,但现有的基于深度学习的填补方法不能有效捕捉到全序列位点间的连锁关系,造成整体填补准确率低、批量序列填补准确率分散等问题。针对这些问题提出一种融合卷积与自注意力机制的填补方法——FCSA,使用两种融合模块构成编解码器组建网络模型。编码器融合模块使用自注意力层得到全序列位点间的关联度,将该关联度融合到全局位点后再通过卷积层提取局部特征;解码器融合模块使用卷积对编码后的低维向量进行局部特征重建,应用自注意力层对全序列建模并融合。使用多物种的动物基因数据进行模型训练,并在Dog、Pig和Chicken数据集上进行比较验证,结果表明,与SCDA(Sparse Convolutional Denoising Autoencoders)、AGIC(Autoencoder Genome Imputation and Compression)和U-net相比,FCSA在10%、20%和30%缺失率下的平均填补准确率均取得了最高值,且批量序列填补准确率的分散程度较小;消融实验的结果也表明,这两种融合模块的设计能够有效提升基因型填补的准确率。

关键词: 基因型填补, 卷积, 自注意力, 融合模块, 全序列建模

Abstract:

Genotype imputation can compensate for the missing due to technical limitations by estimating the sample regions that are not covered in gene sequencing data with imputation, but the existing deep learning-based imputation methods cannot effectively capture the linkage among complete sequence loci, resulting in low overall imputation accuracy and high dispersion of batch sequence imputation accuracy. Therefore, FCSA (Fusing Convolution and Self-Attention), an imputation method that fuses convolution and self-attention mechanism, was proposed to address the above problems, and two fusion modules were used to form encoder and decoder to construct network model. In the encoder fusion module, a self-attention layer was used to obtain the correlation among complete sequence loci, and the local features were extracted through the convolutional layer after fusing the correlation to global loci. In the decoder fusion module, the local features of the encoded low-dimensional vector were reconstructed by convolution, and the complete sequence was modeled and fused by self-attention layer. The genetic data of multiple species of animals were used for model training, and the comparison and validation were carried out on Dog, Pig and Chicken datasets. The results show that compared to SCDA (Sparse Convolutional Denoising Autoencoders), AGIC (Autoencoder Genome Imputation and Compression) and U-net, FCSA achieves the highest average imputation accuracy at 10%, 20% and 30% missing rate. Ablation experimental results also show that the design of the two fusion modules is effective in improving the accuracy of genotype imputation.

Key words: genotype imputation, convolution, self-attention, fusion module, full sequence modeling

中图分类号: