基于多头注意力机制和残差神经网络的肽谱匹配打分算法

doi:10.11772/j.issn.1001-9081.2019101880

计算机应用 ›› 2020, Vol. 40 ›› Issue (6): 1830-1836.DOI: 10.11772/j.issn.1001-9081.2019101880

• 虚拟现实与多媒体计算 • 上一篇下一篇

基于多头注意力机制和残差神经网络的肽谱匹配打分算法

闵鑫, 王海鹏, 牟长宁

山东理工大学计算机科学与技术学院，山东淄博 255000

收稿日期:2019-11-04 修回日期:2019-12-17 发布日期:2020-06-18 出版日期:2020-06-10
通讯作者: 王海鹏(1980—)
作者简介:闵鑫(1995—)，男，四川成都人，硕士研究生，主要研究方向：深度学习、生物信息学.王海鹏(1980—)，男，山东淄博人，副教授，博士，主要研究方向：机器学习、生物信息学.牟长宁(1990—)，男，山东淄博人，硕士研究生，主要研究方向：深度学习、生物信息学.
基金资助:
国家自然科学基金资助项目（31500669）；山东省自然科学基金资助项目（ZR2014FQ024）；山东省高等学校优秀青年创新团队支持计划项目（2019KJN048）。

Peptide spectrum match scoring algorithm based on multi-head attention mechanism and residual neural network

MIN Xin, WANG Haipeng, MOU Changning

School of Computer Science and Technology, Shandong University of Technology, Zibo Shandong 255000, China

Received:2019-11-04 Revised:2019-12-17 Online:2020-06-18 Published:2020-06-10
Contact: WANG Haipeng, born in 1980, Ph. D., associate professor. His research interests include machine learning, bioinformatics.
About author:MIN Xin, born in 1995, M. S. candidate. His research interests include deep learning, bioinformatics.WANG Haipeng, born in 1980, Ph. D., associate professor. His research interests include machine learning, bioinformatics.MOU Changning, born in 1990, M. S. candidate. His research interests include deep learning, bioinformatics.
Supported by:
National Natural Science Foundation of China (31500669), the Shandong Provincial Natural Science Foundation (ZR2014FQ024), the Support Program for Outstanding Youth Innovation Teams in Higher Education of Shandong Province (2019KJN048).

摘要/Abstract

摘要： 肽谱匹配打分算法在肽序列鉴定的过程中起着关键性作用，而传统的打分算法无法充分有效地利用肽碎裂规律进行打分。针对这一问题提出了一种结合肽序列信息表征的多分类概率和式打分算法deepScore-α，该算法不需要考虑全局信息进行二次打分，不存在理论质谱与实验质谱相似度计算方法的限制。deepScore-α使用一维残差网络对序列底层信息进行抽取，再通过多头注意力机制融合序列不同肽键位点对当前肽键位点断裂产生的影响从而生成最终的碎片离子相对强度分布概率矩阵，结合肽序列碎片离子的实际相对强度计算出最终的肽谱匹配得分。该算法与常用开源鉴定工具Comet以及MSGF+进行了比较：在人类蛋白组数据集上错误发现率（FDR）为0.01时,deepScore-α保留的肽序列数量提升了约14%，Top1命中率(正确肽序列在得分最高的谱图所占比例)最大提升约5个百分点。使用人类蛋白组数据集训练的模型在ProteomeTools2数据集上进行泛化性能测试，结果表明，在FDR为0.01的条件下deepScore-α保留的肽序列数量提升了约7%，Top1命中率提升了约5个百分点，Top1中来自Decoy库的鉴定结果减少约60%。实验结果证明，deepScore-α在较低FDR值情况下保留更多的肽序列并提升Top1的命中率，且具有较好的泛化性能。

关键词: 打分算法, 肽序列鉴定, 注意力机制, 残差网络, 多分类概率和

Abstract: Peptide spectrum match scoring algorithm plays a key role in the peptide sequence identification, and the traditional scoring algorithm cannot effectively make full use of the peptide fragmentation pattern to perform scoring. In order to solve the problem, a multi-classification probability sum scoring algorithm combined with the peptide sequence information representation called deepscore-α was proposed. In this algorithm, the second scoring was not performed with the consideration of global information, and there was no limitation on the similarity calculation method of theoretical mass spectrum and experimental mass spectrum. In the algorithm, a one-dimensional residual network was used to extract the underlying information of the sequence, and then the effects of different peptide bonds on the current peptide bond fracture were integrated through the multi-attention mechanism to generate the final fragmention relative intensity distribution probability matrix, after that, the final peptide spectrum match score was calculated by combining the actual relative intensity of the peptide sequence fragmention. This algorithm was compared with Comet and MSGF+, two common open source identification tools. The results show that when False Discovery Rate （FDR） was 0.01 on humanbody proteome dataset, the number of peptide sequences retained by deepScore-α is increased by about 14%, and the Top1 hit ratio (the proportion of the correct peptide sequences in the spectrum with the highest score) of this algorithm is increased by about 5 percentage points. The generalization performance test of the model trained by human ProteomeTools2 dataset show that the number of sequences peptide retained by deepScore-α at FDR of 0.01 is improved by about 7%, the Top1 hit ratio of this algorithm is increased by about 5 percentage points, and the identification results from Decoy library in the Top1 is decreased by about 60%. Experimental results prove that, the algorithm can retain more peptide sequences at lower FDR value, improve the Top1 hit ratio, and has good generalization performance.

Key words: scoring algorithm, peptide sequence identification, attention mechanism, residual network, multi-classification probability sum

中图分类号:

TP391

闵鑫, 王海鹏, 牟长宁. 基于多头注意力机制和残差神经网络的肽谱匹配打分算法[J]. 计算机应用, 2020, 40(6): 1830-1836.

MIN Xin, WANG Haipeng, MOU Changning. Peptide spectrum match scoring algorithm based on multi-head attention mechanism and residual neural network[J]. Journal of Computer Applications, 2020, 40(6): 1830-1836.

参考文献

1 KAPPE, SCHüTZF. Overview of tandem Mass Spectrometry (MS/MS) database search algorithms [J]. Current Protocols in Protein Science, 200749(1): 25.2.1-25.2.19.
2 ENG J K, JAHANT A, HOOPMANNM R. Comet: an open-source MS/MS sequence database search tool [J]. Proteomics, 2013, 13(1): 22-24.
3 ENG J K, HOOPMANNM R, JAHANT A, et al. A deeper look into Comet - implementation and features [J]. Journal of The American Society for Mass Spectrometry, 2015, 26(11): 1865-1874.
4 KIMS, PEVZNERP A. MS-GF+ makes progress towards a universal database search tool for proteomics [J]. Nature Communications, 2014, 5: Article No.5277.
5 PERKINSD N, PAPPIND J C, CREASYD M, et al. Probability-based protein identification by searching sequence databases using mass spectrometry data [J]. Electrophoresis, 1999, 20(18): 3551-3567.
6 COX J, MANNM. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification [J]. Nature Biotechnology, 2008, 26(12): 1367-72.
7 BAIW, BILMESJ, NOBLEW S. Bipartite matching generalizations for peptide identification in tandem mass spectrometry [C]// Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. New York: ACM, 2016: 327-336.
8 BAIW, BILMESJ, NOBLEW S. Submodular generalized matching for peptide identification in tandem mass spectrometry [J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019, 16(4): 1168-1181.
9 BEPLERT, BERGERB. Learning protein sequence embeddings using information from structure [EB/OL]. [2019-03-22]. https://arxiv.org/pdf/1902.08661.pdf.
10 WANGS, PENGJ, MAJ, et al. Protein secondary structure prediction using deep convolutional neural fields [J]. Scientific Reports, 2016, 6: Article No.18962.
11 ZHOUX, ZENGW, CHIH, et al. pDeep: predicting MS/MS spectra of peptides with deep learning [J]. Analytical Chemistry, 2017, 89(23): 12690-12697.
12 HEK, ZHANGX, RENS, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016:770-778.
13 SHENGW, SUM S, LIZ, et al. Accurate de novo prediction of protein contact map by ultra-deep learning model [J]. PLoS Computational Biology, 2017, 13(1): Article No.e1005324.
14 WANGS, LIZ, YUY, et al. Folding membrane proteins by deep transfer learning [J]. Cell Systems, 2017, 5(3): 202-211.
15 BAHDANAUD, CHO K, BENGIOY. Neural machine translation by jointly learning to align and translate [EB/OL]. [2019-03-22]. https://arxiv.org/pdf/1409.0473.pdf.
16 VASWANIA, SHAZEERN, PARMARN, et al. Attention is all you need [EB/OL]. [2019-03-22]. https://arxiv.org/pdf/1706.03762.pdf.
17 WILHELMM, SCHLEGLJ, HAHNEH, et al. Mass-spectrometry-based draft of the human proteome [J]. Nature, 2014, 509(7502): 582-587.
18 GESSULATS, SCHMIDTT, ZOLGD P, et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning [J]. Nat Methods, 2019, 16(6): 509-518.
19 MIKOLOVT, CHENK, CORRADOG, et al. Efficient estimation of word representations in vector space [EB/OL]. [2019-03-22].https://arxiv.org/pdf/1301.3781.pdf.
20 MIKOLOVT, SUTSKEVERI, CHENK, et al. Distributed representations of words and phrases and their compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2013:3111-3119.

[1]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[4]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[5]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[6]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[7]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[8]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[9]	熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232.
[10]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[11]	毛典辉, 李学博, 刘峻岭, 张登辉, 颜文婧. 基于并行异构图和序列注意力机制的中文实体关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2018-2025.
[12]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[13]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[14]	李大海, 王忠华, 王振东. 结合空间域和频域信息的双分支低光照图像增强网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2175-2182.
[15]	魏文亮, 王阳萍, 岳彪, 王安政, 张哲. 基于光照权重分配和注意力的红外与可见光图像融合深度学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2183-2191.

基于多头注意力机制和残差神经网络的肽谱匹配打分算法

Peptide spectrum match scoring algorithm based on multi-head attention mechanism and residual neural network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics