基于Transformer和门控循环单元的肽序列理论串联质谱图预测方法

doi:10.11772/j.issn.1001-9081.2023121846

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (12): 3958-3964.DOI: 10.11772/j.issn.1001-9081.2023121846

• 前沿与综合应用 • 上一篇

基于Transformer和门控循环单元的肽序列理论串联质谱图预测方法

何长久¹^,², 杨婧涵², 周丕宇², 边昕烨¹, 吕明明¹, 董迪¹, 付岩², 王海鹏¹()

^1.山东理工大学计算机科学与技术学院，山东淄博 255049
^2.中国科学院数学与系统科学研究院，北京 100190

收稿日期:2024-01-05 修回日期:2024-03-25 接受日期:2024-04-02 发布日期:2024-04-15 出版日期:2024-12-10
通讯作者: 王海鹏
作者简介:何长久（1997—），男，山东淄博人，硕士研究生，主要研究方向：深度学习、生物信息学
杨婧涵（1995—），女，四川乐山人，博士研究生，主要研究方向：深度学习、生物信息学
周丕宇（1995—），男，山东淄博人，硕士，主要研究方向：机器学习、生物信息学
边昕烨（1998—），女，山东淄博人，硕士研究生，主要研究方向：深度学习、生物信息学
吕明明（1997—），男，山东菏泽人，硕士研究生，主要研究方向：深度学习、生物信息学
董迪（2000—），男，陕西咸阳人，硕士研究生，主要研究方向：深度学习、生物信息学
付岩（1977—），男，辽宁抚顺人，研究员，博士，主要研究方向：生物信息学、生物统计学；
基金资助:
国家重点研发计划项目(2022YFA1304603);山东省高等学校优秀青年创新团队支持计划项目(2019KJN048)

Theoretical tandem mass spectrometry prediction method for peptide sequences based on Transformer and gated recurrent unit

Changjiu HE¹^,², Jinghan YANG², Piyu ZHOU², Xinye BIAN¹, Mingming LYU¹, Di DONG¹, Yan FU², Haipeng WANG¹()

^1.School of Computer Science and Technology，Shandong University of Technology，Zibo Shandong 255049，China
^2.Academy of Mathematics and Systems Science，Chinese Academy of Sciences，Beijing 100190，China

Received:2024-01-05 Revised:2024-03-25 Accepted:2024-04-02 Online:2024-04-15 Published:2024-12-10
Contact: Haipeng WANG
About author:HE Changjiu， born in 1997， M. S. candidate. His research interests include deep learning， bioinformatics.
YANG Jinghan， born in 1995， Ph. D. candidate. Her research interests include deep learning， bioinformatics.
ZHOU Piyu， born in 1995， M. S. His research interests include machine learning， bioinformatics.
BIAN Xinye， born in 1998， M. S. candidate. Her research interests include deep learning， bioinformatics.
LYU Mingming， born in 1997， M. S. candidate. His research interests include deep learning， bioinformatics.
DONG Di， born in 2000， M. S. candidate. His research interests include deep learning， bioinformatics.
FU Yan， born in 1977， Ph. D.， research fellow. His research interests include bioinformatics， biostatistics.
Supported by:
National Key Research and Development Program of China(2022YFA1304603);Support Program for Outstanding Youth Innovation Teams in Colleges and Universities of Shandong Province(2019KJN048)

摘要/Abstract

摘要：

针对现有理论串联质谱图预测仅限于预测b、y主干碎片离子以及单一模型难以捕捉肽序列复杂关系的问题，提出一种基于Transformer和门控循环单元（GRU）的肽序列理论串联质谱图预测方法，名为DeepCollider。首先，通过自注意力机制和长距离依赖关系，使用Transformer和GRU结合的深度学习架构增强对肽序列与碎片离子强度关系的建模能力；其次，与现有方法编码肽序列预测所有b、y主干离子不同，使用碎裂标志位标记肽序列的碎裂位点，从而可针对特定碎裂位点进行编码并预测相应的碎片离子；最后，为了计算预测谱图与实验谱图之间的相似度，使用皮尔逊相关系数（PCC）和平均绝对误差（MAE）作为评测指标。实验结果表明，与现有的仅限预测b、y主干碎片离子的方法（如pDeep和Prosit方法）相比，DeepCollider在PCC和MAE指标上均有优势，PCC值提升了0.15，MAE值降低了0.005。可见，DeepCollider不仅可以预测b、y、a主干离子及其相应的失水失氨中性丢失离子，还可以进一步提高理论谱图预测的谱峰覆盖度和相似性。

关键词: 理论质谱图预测, 肽序列, 碎片离子强度, 蛋白质组学, 深度学习

Abstract:

Aiming at the issues in the existing prediction methods， such as only predicting b and y backbone fragment ions， as well as single model's difficulty in capturing the complex relationships within peptide sequences， a theoretical tandem mass spectrometry prediction method for peptide sequences based on Transformer and Gated Recurrent Unit （GRU）， named DeepCollider， was proposed. Firstly， through self-attention mechanism and long-distance dependencies， the deep learning architecture combining Transformer and GRU was used to enhance the modeling ability of relationship between peptide sequences and fragment ion intensities. Secondly， unlike the existing methods encoding peptide sequences to predict all b and y backbone ions， fragmentation flags were utilized to mark fragmentation sites within peptide sequences， thereby enabling the encoding of fragment ions at specific fragmentation sites and prediction of the corresponding fragment ions. Finally， Pearson Correlation Coefficient （PCC） and Mean Absolute Error （MAE） were employed as evaluation metrics to measure the similarity between predicted spectrometry and experimental spectrometry. Experimental results demonstrate that DeepCollider shows advantages in both PCC and MAE metrics compared to the existing methods limited to predicting b and y backbone fragment ions， such as pDeep and Prosit methods， with an increase of 0.15 in PCC value and a decrease of 0.005 in MAE value. It can be seen that DeepCollider not only predicts b， y backbone ions and their corresponding dehydrated and deaminated neutral loss ions， but also further improves the peak coverage and similarity of theoretical spectrometry prediction.

Key words: theoretical mass spectrometry prediction, peptide sequence, fragment ion intensity, proteomics, deep learning

中图分类号:

TP391.9

何长久, 杨婧涵, 周丕宇, 边昕烨, 吕明明, 董迪, 付岩, 王海鹏. 基于Transformer和门控循环单元的肽序列理论串联质谱图预测方法[J]. 计算机应用, 2024, 44(12): 3958-3964.

Changjiu HE, Jinghan YANG, Piyu ZHOU, Xinye BIAN, Mingming LYU, Di DONG, Yan FU, Haipeng WANG. Theoretical tandem mass spectrometry prediction method for peptide sequences based on Transformer and gated recurrent unit[J]. Journal of Computer Applications, 2024, 44(12): 3958-3964.

图/表 11

参考文献 37

1	孙瑞祥，付岩，李德泉，等. 基于质谱技术的计算蛋白质组学研究［J］.中国科学E辑：技术科学， 2006， 36（2）：222-234.
	SUN R X， FU Y， LI D Q， et al. Computational proteomics based on mass spectrometry ［J］. Science in China Series E： Information Sciences， 2006， 36（2）： 222-234.
2	OLSEN J V， MACEK B， LANGE O， et al. Higher-energy C-trap dissociation for peptide modification analysis ［J］. Nature Methods， 2007， 4（9）： 709-712.
3	CHI H， LIU C， YANG H， et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine ［J］. Nature Biotechnology， 2018， 36（11）： 1059-1061.
4	CHI H， HE K， YANG B， et al. pFind-Alioth： a novel unrestricted database search algorithm to improve the interpretation of high-resolution MS/MS data［J］. Journal of Proteomics， 2015， 125： 89-97.
5	WILHELM M， ZOLG D P， GRABER M， et al. Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics［J］. Nature Communications， 2021， 12： No.3346.
6	TIWARY S， LEVY R， GUTENBRUNNER P， et al. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis ［J］. Nature Methods， 2019， 16（6）： 519-525.
7	VERBRUGGEN S， GESSULAT S， GABRIELS R， et al. Spectral prediction features as a solution for the search space size problem in proteogenomics ［J］. Molecular and Cellular Proteomics， 2021， 20： No.100076.
8	ZHANG Z. Prediction of low-energy collision-induced dissociation spectra of peptides ［J］. Analytical Chemistry， 2004， 76（14）： 3908-3922.
9	ZHANG Z. Prediction of low-energy collision-induced dissociation spectra of peptides with three or more charges［J］. Analytical Chemistry， 2005， 77（19）： 6364-6373.
10	SUN S W， YANG F Q， YANG Q， et al. MS-Simulator： predicting y-ion intensities for peptides with two charges based on the intensity ratio of neighboring ions［J］. Journal of Proteome Research， 2012， 11（9）： 4509-4516.
11	WANG Y， YANG F， WU P， et al. OpenMS-Simulator： an open-source software for theoretical tandem mass spectrum prediction ［J］. BMC Bioinformatics， 2015， 16： No.110.
12	ARNOLD R， JAYASANKAR N， AGGARWAL D， et al. A machine learning approach to predicting peptide fragmentation spectra［C］// Proceedings of the 2006 Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing. Singapore： World Scientific Publishing Co Pte Ltd， 2006： 219-230.
13	LI S， ARNOLD R J， TANG H， et al. On the accuracy and limits of peptide fragmentation spectrum prediction［J］. Analytical Chemistry， 2011， 83（3）： 790-796.
14	DEGROEVE S， MADDELEIN D， MARTENS L. MS²PIP prediction server： compute and visualize MS² peak intensity predictions for CID and HCD fragmentation ［J］. Nucleic Acids Research， 2015， 43（W1）： W326-W330.
15	DEGROEVE S， MARTENS L. MS²PIP： a tool for MS/MS peak intensity prediction［J］. Bioinformatics， 2013， 29（24）： 3199-3203.
16	DONG N P， LIANG Y Z， XU Q S， et al. Prediction of peptide fragment ion mass spectra by data mining techniques［J］. Analytical Chemistry， 2014， 86（15）： 7446-7454.
17	YANG Y， LIN L， QIAO L. Deep learning approaches for data-independent acquisition proteomics［J］. Expert Review of Proteomics， 2021， 18（12）： 1031-1043.
18	WEB B， ZENG W F， LIAO Y， et al. Deep learning in proteomics［J］. Proteomics， 2020， 20（21/22）： No.1900335.
19	MEYER J G. Deep learning neural network tools for proteomics［J］. Cell Reports Methods， 2021， 1（2）： No.100003.
20	ZHOU X X， ZENG W F， CHI H， et al. pDeep： predicting MS/MS spectra of peptides with deep learning［J］. Analytical Chemistry， 2017， 89（23）： 12690-12697.
21	ZENG W F， ZHOU X X， ZHOU W J， et al. MS/MS spectrum prediction for modified peptides using pDeep2 trained by transfer learning ［J］. Analytical Chemistry， 2019， 91（15）： 9724-9731.
22	TARN C， ZENG W F. pDeep3： towards more accurate spectrum prediction with fast few-shot learning ［J］. Analytical Chemistry， 2021， 93（14）： 5815-5822.
23	ZENG W F， ZHOU X X， WILLEMS S， et al. AlphaPeptDeep： a modular deep learning framework to predict peptide properties for proteomics ［J］. Nature Communications， 2022， 13： No.7238.
24	EKVALL M， TRUONG P， GABRIEL W， et al. Prosit Transformer： a transformer for prediction of MS2 spectrum intensities［J］. Journal of Proteome Research， 2022， 21（5）： 1359-1364.
25	GESSULAT S， SCHMIDT T， ZOLG D P， et al. Prosit： proteome-wide prediction of peptide tandem mass spectra by deep learning ［J］. Nature Methods， 2019， 16（6）： 509-518.
26	VASWANI A， SHAZEER N M， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017：6000-6010.
27	CHUNG J， GULECEHRE C， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling［EB/OL］. ［2023-11-11］. .
28	ZOLG D P， WILHELM M， SCHNATBAUM K， et al. Building proteometools based on a complete synthetic human proteome ［J］. Nature Methods， 2017， 14（3）： 259-262.
29	CHICK J M， KOLIPPAKKAM D， NUSINOW D P， et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides ［J］. Nature Biotechnology， 2015， 33（7）： 743-749.
30	KULAK N A， PICHLER G， PARON I， et al. Minimal， encapsulated proteomic-sample processing applied to copy-number estimation in eukaryotic cells ［J］. Nature Methods， 2014， 11（3）： 319-324.
31	SHARMA K， SCHMITT S， BERGNER C G， et al. Cell type- and brain region-resolved mouse brain proteome［J］. Nature Neuroscience， 2015， 18（12）： 1819-1831.
32	NARAYAN V， LY T， POURKARMI E， et al. Deep proteome analysis identifies age-related processes in C elegans ［J］. Cell Systems， 2016， 3（2）： 144-159.
33	YUAN Z F， LIU C， WANG H P， et al. pParse： a method for accurate determination of monoisotopic peaks in high-resolution mass spectra ［J］. Proteomics， 2012， 12（2）： 226-235.
34	TYANOVA S， TEMU T， CARLSON A， et al. Visualization of LC-MS/MS proteomics data in MaxQuant ［J］. Proteomics， 2015， 15（8）： 1453-1456.
35	LIU K， LI S， WANG L， et al. Full-spectrum prediction of peptides tandem mass spectra using deep neural network［J］. Analytical Chemistry， 2020， 92（6）： 4275-4283.
36	LAPIN J， YAN X， DONG Q. UniSpec： deep learning for predicting the full range of peptide fragment ion series to enhance the proteomics data analysis workflow ［J］. Analytical Chemistry， 2024， 96（7）： 2783-2790.
37	COX J. Prediction of peptide mass spectral libraries with machine learning ［J］. Nature Biotechnology， 2023， 41（1）： 33-43.

数据集编号	物种	实验室	使用的能量值	谱图数
PXD004732^［28］	合成	Kuster	20，23，25，28，30，35	831 328
PXD001468^［29］	人	Gygi	25	35 404
PXD000269^［30］	酵母	Mann	25	66 008
PXD001250^［31］	鼠	Mann	25，27	102 719
PXD004584^［32］	线虫	Kenyon	25	50 911

数据集编号	物种	实验室	使用的能量值	谱图数
PXD004732^［28］	合成	Kuster	20，23，25，28，30，35	831 328
PXD001468^［29］	人	Gygi	25	35 404
PXD000269^［30］	酵母	Mann	25	66 008
PXD001250^［31］	鼠	Mann	25，27	102 719
PXD004584^［32］	线虫	Kenyon	25	50 911

离子类型	PCC>0.70	PCC>0.75	PCC>0.80	PCC>0.85	PCC>0.90
18种离子	99.49	98.99	98.15	96.22	92.15
b系列	96.46	94.87	93.13	90.22	84.42
y系列	99.40	99.15	98.60	97.64	95.16
a系列	88.49	86.11	83.30	79.24	72.73

离子类型	PCC>0.70	PCC>0.75	PCC>0.80	PCC>0.85	PCC>0.90
18种离子	99.49	98.99	98.15	96.22	92.15
b系列	96.46	94.87	93.13	90.22	84.42
y系列	99.40	99.15	98.60	97.64	95.16
a系列	88.49	86.11	83.30	79.24	72.73

离子类型	肽序列长度
离子类型	≤10	11~15	16~20	21~25	≥26
18种离子	0.990	0.982	0.968	0.951	0.931
b系列	0.992	0.982	0.961	0.942	0.912
y系列	0.993	0.988	0.979	0.972	0.953
a系列	0.996	0.978	0.929	0.875	0.815

基于Transformer和门控循环单元的肽序列理论串联质谱图预测方法

Theoretical tandem mass spectrometry prediction method for peptide sequences based on Transformer and gated recurrent unit

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 37

相关文章 15

编辑推荐

Metrics

模型	PCC>0.70	PCC>0.75	PCC>0.80	PCC>0.85	PCC>0.90
pDeep默认模型	93.84	92.87	91.47	89.59	86.20
pDeep_re模型	96.68	96.31	95.26	93.58	90.03
DeepCollider模型	99.08	98.71	98.00	96.66	93.19

指标	方法	PXD001468	PXD000269	PXD001250	PXD004584
PCC 均值	pDeep	0.668	0.812	0.781	0.818
	Prosit	0.662	0.812	0.775	0.813
	DeepCollider	0.847	0.918	0.883	0.890
PCC 中值	pDeep	0.615	0.770	0.738	0.752
	Prosit	0.612	0.770	0.732	0.747
	DeepCollider	0.774	0.888	0.857	0.838
MAE 均值	pDeep	0.022	0.020	0.019	0.016
	Prosit	0.022	0.020	0.020	0.017
	DeepCollider	0.017	0.015	0.014	0.013
MAE 中值	pDeep	0.023	0.020	0.021	0.018
	Prosit	0.023	0.020	0.022	0.019
	DeepCollider	0.019	0.015	0.016	0.015

方法	PXD001468		PXD000269		PXD001250		PXD004584
方法	PCC>0.70	PCC>0.90	PCC>0.70	PCC>0.90	PCC>0.70	PCC>0.90	PCC>0.70	PCC>0.90
pDeep	44.73	11.28	72.35	23.79	65.04	19.44	67.71	29.11
Prosit	44.89	10.25	73.57	21.58	66.94	16.04	67.47	26.53
DeepCollider	70.88	36.67	95.45	59.22	91.81	42.83	84.18	47.13

[1]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[2]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[3]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[4]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[5]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[6]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[7]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[8]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[9]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[10]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[11]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[12]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[13]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[14]	孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.
[15]	刘源泂, 何茂征, 黄益斌, 钱程. 基于ResNet50和改进注意力机制的船舶识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1935-1941.