Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (1): 297-304.DOI: 10.11772/j.issn.1001-9081.2025010060

• Frontier and comprehensive applications • Previous Articles     Next Articles

Transformer and gated recurrent unit-based de novo sequencing algorithm for phosphopeptides

Lijin YAO1, Di ZHANG1, Piyu ZHOU2, Zhijian QU1, Haipeng WANG1()   

  1. 1.School of Computer Science and Technology,Shandong University of Technology,Zibo Shandong 255049,China
    2.Academy of Mathematics and Systems Science,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2025-01-15 Revised:2025-03-25 Accepted:2025-03-26 Online:2026-01-10 Published:2026-01-10
  • Contact: Haipeng WANG
  • About author:YAO Lijin, born in 1999, M. S. candidate. His research interests include deep learning, bioinformatics.
    ZHANG Di, born in 1997, M. S. candidate. His research interests include deep learning, bioinformatics.
    ZHOU Piyu, born in 1995, Ph. D. candidate. His research interests include machine learning, bioinformatics.
    QU Zhijian, born in 1980, Ph. D., associate professor. His research interests include optimization algorithms, machine learning.
  • Supported by:
    National Key Research and Development Program of China(2022YFA1304603)

基于Transformer和门控循环单元的磷酸化肽从头测序算法

姚理进1, 张迪1, 周丕宇2, 曲志坚1, 王海鹏1()   

  1. 1.山东理工大学 计算机科学与技术学院,山东 淄博 255049
    2.中国科学院 数学与系统科学研究院,北京 100190
  • 通讯作者: 王海鹏
  • 作者简介:姚理进(1999—),男,山东聊城人,硕士研究生, CCF学生会员,主要研究方向:深度学习、生物信息学
    张迪(1997—),男,山东临沂人,硕士研究生,主要研究方向:深度学习、生物信息学
    周丕宇(1995—),男,山东淄博人,博士研究生,主要研究方向:机器学习、生物信息学
    曲志坚(1980—),男,山东青岛人,副教授,博士,主要研究方向:优化算法、机器学习
  • 基金资助:
    国家重点研发计划项目(2022YFA1304603)

Abstract:

Peptide sequencing using tandem mass spectrometry for proteolytically digested peptides (referred to as peptide identification) is a foundational technology in proteomics research. Current de novo peptide sequencing algorithms face challenges in identifying phosphopeptides accurately, which are of significant biological importance. The primary reason is the complex fragmentation patterns induced by phosphorylation, the frequent occurrence of neutral loss peaks, and the low abundance of phosphopeptides' mass spectrum in conventional mass spectrometric data. To address these issues, a Transformer and Gated Recurrent Unit (GRU)-based de novo sequencing algorithm for phosphopeptides was proposed, namely TGNovo. A spectrum graph was introduced in TGNovo to model the mass differences between peaks explicitly, guiding the Transformer encoder to capture spectral features. The Transformer module and the GRU module jointly model the association between spectral and amino acid sequence features and the dependencies among spectral peaks and amino acids, respectively, working in concert to achieve peptide reconstruction. Compared to the fully Transformer-based de novo sequencing algorithm Casanovo, TGNovo fully utilizes prior spectral information through the spectrum graph and GRU module, enhancing the model's ability to model spectrum graph. In evaluations of phosphopeptide fragments across species, TGNovo outperforms Casanovo with average improvements of 16.5 percentage points in peptide-level recall and 37.1 percentage points in amino acid-level recall. Additionally, experimental results on an immune peptide dataset show that TGNovo-identified high-confidence antigenic peptides cover 86% of the database search results.

Key words: de novo sequencing, Transformer, Gated Recurrent Unit (GRU), spectrum graph, phosphopeptide

摘要:

利用串联质谱对蛋白质酶切产生的肽段进行测序(称为肽鉴定)是蛋白质组学研究的支撑技术。现有肽段从头测序算法在鉴定具有重要生物学意义的磷酸化修饰肽段时准确度受限,主要原因是磷酸化修饰导致碎裂模式更复杂,易产生中性丢失峰,且磷酸化肽的质谱图在常规质谱数据中丰度较低。因此,提出基于Transformer和门控循环单元(GRU)的从头测序算法TGNovo。TGNovo引入谱峰连接图,显式建模谱峰间的质量差关系,指导Transformer编码器捕捉谱图特征。Transformer模块与GRU模块分别建模谱图与氨基酸序列的关联以及谱峰间与氨基酸间的依赖关系,二者协同工作以实现肽段重建。相较于完全基于Transformer的从头测序算法Casanovo, TGNovo通过谱峰连接图和GRU模块充分利用谱图先验信息,增强了模型对谱图的建模能力。在跨物种磷酸化肽段评测中, TGNovo在肽水平和氨基酸水平的召回率上比Casanovo分别平均提升了16.5和37.1个百分点;此外,在免疫肽数据集上的实验结果表明, TGNovo鉴定的高可信抗原肽覆盖了数据库搜索结果的86%。

关键词: 从头测序, Transformer, 门控循环单元, 谱峰连接图, 磷酸化肽

CLC Number: