Journal of Computer Applications

    Next Articles

Transformer and gated recurrent unit-based de novo sequencing algorithm for phosphopeptides

  

  • Received:2025-01-15 Revised:2025-03-25 Online:2025-04-27 Published:2025-04-27
  • Supported by:
    National Key R&D Program of China

基于Transformer和门控循环单元的磷酸化肽从头测序算法

姚理进1,张迪1,周丕宇2,曲志坚1,王海鹏3   

  1. 1. 山东理工大学
    2. 中国科学院数学与系统科学研究院
    3. 山东理工大学计算机科学与技术学院
  • 通讯作者: 姚理进
  • 基金资助:
    国家重点研发计划

Abstract: Tandem mass spectrometry-based peptide sequencing, commonly known as peptide identification, is a cornerstone technology in proteomics research. However, current de novo sequencing algorithms face challenges in accurately identifying phosphopeptides, which are of significant biological importance. The primary challenges arise from the complex fragmentation patterns induced by phosphorylation, the frequent occurrence of neutral loss peaks, and the generally less abundance of phosphopeptides in conventional mass spectrometry data. To address these issues, a de novo sequencing algorithm, TGNovo, based on Transformer and Gated Recurrent Unit (GRU), was proposed. A spectrum graph was introduced in TGNovo, which explicitly modeled the mass differences between peaks, guiding the Transformer encoder to capture spectral features. The decoder was designed to associate these features with amino acid sequences, while the relationships among peaks and between peaks and amino acids were modeled by the GRU, collaboratively enabling peptide reconstruction. Compared to the fully Transformer-based de novo sequencing algorithm Casanovo, TGNovo fully utilizes prior spectral information through the spectrum graph and GRU module, enhancing the model's ability to capture spectral structures. In evaluations of phosphopeptides across species, TGNovo outperforms Casanovo, with an average improvement of 16.5 percentage points in peptide-level recall and 37.1 percentage points in amino acid-level recall. Additionally, experiments on an immunopeptide dataset show that TGNovo-identified high-confidence antigenic peptides cover 86% of the database search results.

Key words: de novo peptide sequencing, Transformer, Gated Recurrent Unit &#40, GRU&#41

摘要: 利用串联质谱对蛋白质酶切产生的肽段进行测序(称为肽鉴定)是蛋白质组学研究的支撑技术。现有肽段从头测序算法在鉴定具有重要生物学意义的磷酸化修饰肽段时准确度受限,主要原因是磷酸化修饰导致碎裂模式更复杂,易产生中性丢失峰,且质谱图在常规质谱数据中丰度较低。为此,提出了基于Transformer和门控循环单元(GRU)的从头测序算法TGNovo。TGNovo引入谱峰连接图,显式建模谱峰间的质量差关系,指导Transformer编码器捕捉谱图特征。解码器将谱图特征与氨基酸序列特征关联,结合GRU模块建模谱峰间与氨基酸间关系,共同协作实现肽段重建。相较于完全基于Transformer的从头测序算法Casanovo,TGNovo通过谱峰连接图和GRU模块充分利用谱图先验信息,增强了模型对谱图的建模能力。在跨物种磷酸化肽段评测中,TGNovo在肽水平和氨基酸水平的召回率上相较于Casanovo平均提升16.5和37.1个百分点。免疫肽数据集上的实验结果表明,TGNovo鉴定的高可信抗原肽覆盖了数据库搜索结果的86%。

关键词: 从头测序, Transformer, 门控循环单元, 谱峰连接图, 磷酸化肽

CLC Number: