Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3958-3964.DOI: 10.11772/j.issn.1001-9081.2023121846

• Frontier and comprehensive applications • Previous Articles    

Theoretical tandem mass spectrometry prediction method for peptide sequences based on Transformer and gated recurrent unit

Changjiu HE1,2, Jinghan YANG2, Piyu ZHOU2, Xinye BIAN1, Mingming LYU1, Di DONG1, Yan FU2, Haipeng WANG1()   

  1. 1.School of Computer Science and Technology,Shandong University of Technology,Zibo Shandong 255049,China
    2.Academy of Mathematics and Systems Science,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2024-01-05 Revised:2024-03-25 Accepted:2024-04-02 Online:2024-04-15 Published:2024-12-10
  • Contact: Haipeng WANG
  • About author:HE Changjiu, born in 1997, M. S. candidate. His research interests include deep learning, bioinformatics.
    YANG Jinghan, born in 1995, Ph. D. candidate. Her research interests include deep learning, bioinformatics.
    ZHOU Piyu, born in 1995, M. S. His research interests include machine learning, bioinformatics.
    BIAN Xinye, born in 1998, M. S. candidate. Her research interests include deep learning, bioinformatics.
    LYU Mingming, born in 1997, M. S. candidate. His research interests include deep learning, bioinformatics.
    DONG Di, born in 2000, M. S. candidate. His research interests include deep learning, bioinformatics.
    FU Yan, born in 1977, Ph. D., research fellow. His research interests include bioinformatics, biostatistics.
  • Supported by:
    National Key Research and Development Program of China(2022YFA1304603);Support Program for Outstanding Youth Innovation Teams in Colleges and Universities of Shandong Province(2019KJN048)

基于Transformer和门控循环单元的肽序列理论串联质谱图预测方法

何长久1,2, 杨婧涵2, 周丕宇2, 边昕烨1, 吕明明1, 董迪1, 付岩2, 王海鹏1()   

  1. 1.山东理工大学 计算机科学与技术学院,山东 淄博 255049
    2.中国科学院 数学与系统科学研究院,北京 100190
  • 通讯作者: 王海鹏
  • 作者简介:何长久(1997—),男,山东淄博人,硕士研究生,主要研究方向:深度学习、生物信息学
    杨婧涵(1995—),女,四川乐山人,博士研究生,主要研究方向:深度学习、生物信息学
    周丕宇(1995—),男,山东淄博人,硕士,主要研究方向:机器学习、生物信息学
    边昕烨(1998—),女,山东淄博人,硕士研究生,主要研究方向:深度学习、生物信息学
    吕明明(1997—),男,山东菏泽人,硕士研究生,主要研究方向:深度学习、生物信息学
    董迪(2000—),男,陕西咸阳人,硕士研究生,主要研究方向:深度学习、生物信息学
    付岩(1977—),男,辽宁抚顺人,研究员,博士,主要研究方向:生物信息学、生物统计学;
  • 基金资助:
    国家重点研发计划项目(2022YFA1304603);山东省高等学校优秀青年创新团队支持计划项目(2019KJN048)

Abstract:

Aiming at the issues in the existing prediction methods, such as only predicting b and y backbone fragment ions, as well as single model's difficulty in capturing the complex relationships within peptide sequences, a theoretical tandem mass spectrometry prediction method for peptide sequences based on Transformer and Gated Recurrent Unit (GRU), named DeepCollider, was proposed. Firstly, through self-attention mechanism and long-distance dependencies, the deep learning architecture combining Transformer and GRU was used to enhance the modeling ability of relationship between peptide sequences and fragment ion intensities. Secondly, unlike the existing methods encoding peptide sequences to predict all b and y backbone ions, fragmentation flags were utilized to mark fragmentation sites within peptide sequences, thereby enabling the encoding of fragment ions at specific fragmentation sites and prediction of the corresponding fragment ions. Finally, Pearson Correlation Coefficient (PCC) and Mean Absolute Error (MAE) were employed as evaluation metrics to measure the similarity between predicted spectrometry and experimental spectrometry. Experimental results demonstrate that DeepCollider shows advantages in both PCC and MAE metrics compared to the existing methods limited to predicting b and y backbone fragment ions, such as pDeep and Prosit methods, with an increase of 0.15 in PCC value and a decrease of 0.005 in MAE value. It can be seen that DeepCollider not only predicts b, y backbone ions and their corresponding dehydrated and deaminated neutral loss ions, but also further improves the peak coverage and similarity of theoretical spectrometry prediction.

Key words: theoretical mass spectrometry prediction, peptide sequence, fragment ion intensity, proteomics, deep learning

摘要:

针对现有理论串联质谱图预测仅限于预测b、y主干碎片离子以及单一模型难以捕捉肽序列复杂关系的问题,提出一种基于Transformer和门控循环单元(GRU)的肽序列理论串联质谱图预测方法,名为DeepCollider。首先,通过自注意力机制和长距离依赖关系,使用Transformer和GRU结合的深度学习架构增强对肽序列与碎片离子强度关系的建模能力;其次,与现有方法编码肽序列预测所有b、y主干离子不同,使用碎裂标志位标记肽序列的碎裂位点,从而可针对特定碎裂位点进行编码并预测相应的碎片离子;最后,为了计算预测谱图与实验谱图之间的相似度,使用皮尔逊相关系数(PCC)和平均绝对误差(MAE)作为评测指标。实验结果表明,与现有的仅限预测b、y主干碎片离子的方法(如pDeep和Prosit方法)相比,DeepCollider在PCC和MAE指标上均有优势,PCC值提升了0.15,MAE值降低了0.005。可见,DeepCollider不仅可以预测b、y、a主干离子及其相应的失水失氨中性丢失离子,还可以进一步提高理论谱图预测的谱峰覆盖度和相似性。

关键词: 理论质谱图预测, 肽序列, 碎片离子强度, 蛋白质组学, 深度学习

CLC Number: