计算机应用 ›› 2021, Vol. 41 ›› Issue (9): 2773-2779.DOI: 10.11772/j.issn.1001-9081.2020111875

所属专题: 前沿与综合应用

• 前沿与综合应用 • 上一篇    下一篇

基于图卷积神经网络的串联质谱从头测序

牟长宁, 王海鹏, 周丕宇, 侯鑫行   

  1. 山东理工大学 计算机科学与技术学院, 山东 淄博 255000
  • 收稿日期:2020-12-02 修回日期:2021-01-08 出版日期:2021-09-10 发布日期:2021-05-12
  • 通讯作者: 王海鹏
  • 作者简介:牟长宁(1990-),男,山东淄博人,硕士研究生,CCF学生会员,主要研究方向:深度学习、生物信息学;王海鹏(1980-),男,山东淄博人,副教授,博士,主要研究方向:机器学习、生物信息学;周丕宇(1995-),男,山东淄博人,硕士研究生,主要研究方向:深度学习、生物信息学;侯鑫行(1995-),男,山东济宁人,硕士研究生,主要研究方向:深度学习、生物信息学。
  • 基金资助:
    国家自然科学基金资助项目(31500669);山东省自然科学基金资助项目(ZR2014FQ024);山东省高等学校优秀青年创新团队支持计划项目(2019KJN048)。

De novo peptide sequencing by tandem mass spectrometry based on graph convolutional neural network

MOU Changning, WANG Haipeng, ZHOU Piyu, HOU Xinhang   

  1. School of Computer Science and Technology, Shandong University of Technology, Zibo Shandong 255000, China
  • Received:2020-12-02 Revised:2021-01-08 Online:2021-09-10 Published:2021-05-12
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (31500669), the Shandong Provincial Natural Science Foundation (ZR2014FQ024), the Support Program for Outstanding Youth Innovation Teams in Colleges and Universities of Shandong Province (2019KJN048).

摘要: 在蛋白质组学中从头测序是串联质谱肽段测序的重要方法之一,其具有不依赖于蛋白质数据库的优势,并在测定未知物种蛋白序列、单克隆抗体测序等领域中起着关键作用。然而由于从头测序的复杂性,导致其测序的准确率远低于数据库搜索方法,制约了从头测序的广泛应用。针对从头测序准确率低的问题,提出一种基于图卷积神经网络(GCN)的从头测序方法denovo-GCN。该方法将质谱中谱峰之间的关系用图结构表示,并从每个相应的肽碎裂位点提取谱峰特征,然后通过GCN预测当前碎裂位点处的氨基酸类型,最后逐步组成完整的肽序列。通过实验确定了GCN模型的层数、离子类型组合和测序使用的谱峰数量这3个影响模型的重要参数,并将多个物种数据集用于实验对比。实验结果表明,该方法在肽水平上的召回率比基于图论的从头测序方法Novor、pNovo提高了4.0~21.1个百分点,比基于卷积神经网络(CNN)和长短期记忆(LSTM)网络的DeepNovo提高了2.1~10.7个百分点。

关键词: 图卷积神经网络, 从头测序, 蛋白质组学, 串联质谱

Abstract: In proteomics, de novo sequencing is one of the most important methods for peptide sequencing by tandem mass spectrometry. It has the advantage of being independent on any protein databases and plays a key role in the determination of protein sequences of unknown species, monoclonal antibodies sequencing and other fields. However, due to its complexity, the accuracy of de novo sequencing is much lower than that of the database search methods, therefore the wide application of de novo sequencing is limited. Focused on the issue of low accuracy of de novo sequencing, denovo-GCN, a de novo sequencing method based on Graph Convolutional neural Network (GCN) was proposed. In this method, the relationships between peaks in mass spectrometry were expressed by using graph structure, and the peak features were extracted from each corresponding peptide cleavage site. Then the amino acid type at the current cleavage site was predicted by GCN, and finally a complete sequence was formed step by step. Three significant parameters affecting the model were experimentally determined, including the GCN model layer number, the combination of ion types and the number of spectral peaks used for sequencing, and datasets of a wide variety of species were used for experimental comparison. Experimental results show that, the peptide-level recall of denovo-GCN is 4.0 percentage points to 21.1 percentage points higher than those of the graph theory-based methods Novor and pNovo, and is 2.1 percentage points to 10.7 percentage points higher than that of DeepNovo, which adopts Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network.

Key words: Graph Convolutional neural Network (GCN), de novo sequencing, proteomics, tandem mass spectrometry

中图分类号: