计算机应用 ›› 2020, Vol. 40 ›› Issue (6): 1830-1836.DOI: 10.11772/j.issn.1001-9081.2019101880

• 虚拟现实与多媒体计算 • 上一篇    下一篇

基于多头注意力机制和残差神经网络的肽谱匹配打分算法

闵鑫, 王海鹏, 牟长宁   

  1. 山东理工大学 计算机科学与技术学院,山东 淄博 255000
  • 收稿日期:2019-11-04 修回日期:2019-12-17 出版日期:2020-06-10 发布日期:2020-06-18
  • 通讯作者: 王海鹏(1980—)
  • 作者简介:闵鑫(1995—),男,四川成都人,硕士研究生,主要研究方向:深度学习、生物信息学.王海鹏(1980—),男,山东淄博人,副教授,博士,主要研究方向:机器学习、生物信息学.牟长宁(1990—),男,山东淄博人,硕士研究生,主要研究方向:深度学习、生物信息学.
  • 基金资助:
    国家自然科学基金资助项目(31500669);山东省自然科学基金资助项目(ZR2014FQ024);山东省高等学校优秀青年创新团队支持计划项目(2019KJN048)。

Peptide spectrum match scoring algorithm based on multi-head attention mechanism and residual neural network

MIN Xin, WANG Haipeng, MOU Changning   

  1. School of Computer Science and Technology, Shandong University of Technology, Zibo Shandong 255000, China
  • Received:2019-11-04 Revised:2019-12-17 Online:2020-06-10 Published:2020-06-18
  • Contact: WANG Haipeng, born in 1980, Ph. D., associate professor. His research interests include machine learning, bioinformatics.
  • About author:MIN Xin, born in 1995, M. S. candidate. His research interests include deep learning, bioinformatics.WANG Haipeng, born in 1980, Ph. D., associate professor. His research interests include machine learning, bioinformatics.MOU Changning, born in 1990, M. S. candidate. His research interests include deep learning, bioinformatics.
  • Supported by:
    National Natural Science Foundation of China (31500669), the Shandong Provincial Natural Science Foundation (ZR2014FQ024), the Support Program for Outstanding Youth Innovation Teams in Higher Education of Shandong Province (2019KJN048).

摘要: 肽谱匹配打分算法在肽序列鉴定的过程中起着关键性作用,而传统的打分算法无法充分有效地利用肽碎裂规律进行打分。针对这一问题提出了一种结合肽序列信息表征的多分类概率和式打分算法deepScore-α,该算法不需要考虑全局信息进行二次打分,不存在理论质谱与实验质谱相似度计算方法的限制。deepScore-α使用一维残差网络对序列底层信息进行抽取,再通过多头注意力机制融合序列不同肽键位点对当前肽键位点断裂产生的影响从而生成最终的碎片离子相对强度分布概率矩阵,结合肽序列碎片离子的实际相对强度计算出最终的肽谱匹配得分。该算法与常用开源鉴定工具Comet以及MSGF+进行了比较:在人类蛋白组数据集上错误发现率(FDR)为0.01时,deepScore-α保留的肽序列数量提升了约14%,Top1命中率(正确肽序列在得分最高的谱图所占比例)最大提升约5个百分点。使用人类蛋白组数据集训练的模型在ProteomeTools2数据集上进行泛化性能测试,结果表明,在FDR为0.01的条件下deepScore-α保留的肽序列数量提升了约7%,Top1命中率提升了约5个百分点,Top1中来自Decoy库的鉴定结果减少约60%。实验结果证明,deepScore-α在较低FDR值情况下保留更多的肽序列并提升Top1的命中率,且具有较好的泛化性能。

关键词: 打分算法, 肽序列鉴定, 注意力机制, 残差网络, 多分类概率和

Abstract: Peptide spectrum match scoring algorithm plays a key role in the peptide sequence identification, and the traditional scoring algorithm cannot effectively make full use of the peptide fragmentation pattern to perform scoring. In order to solve the problem, a multi-classification probability sum scoring algorithm combined with the peptide sequence information representation called deepscore-α was proposed. In this algorithm, the second scoring was not performed with the consideration of global information, and there was no limitation on the similarity calculation method of theoretical mass spectrum and experimental mass spectrum. In the algorithm, a one-dimensional residual network was used to extract the underlying information of the sequence, and then the effects of different peptide bonds on the current peptide bond fracture were integrated through the multi-attention mechanism to generate the final fragmention relative intensity distribution probability matrix, after that, the final peptide spectrum match score was calculated by combining the actual relative intensity of the peptide sequence fragmention. This algorithm was compared with Comet and MSGF+, two common open source identification tools. The results show that when False Discovery Rate (FDR) was 0.01 on humanbody proteome dataset, the number of peptide sequences retained by deepScore-α is increased by about 14%, and the Top1 hit ratio (the proportion of the correct peptide sequences in the spectrum with the highest score) of this algorithm is increased by about 5 percentage points. The generalization performance test of the model trained by human ProteomeTools2 dataset show that the number of sequences peptide retained by deepScore-α at FDR of 0.01 is improved by about 7%, the Top1 hit ratio of this algorithm is increased by about 5 percentage points, and the identification results from Decoy library in the Top1 is decreased by about 60%. Experimental results prove that, the algorithm can retain more peptide sequences at lower FDR value, improve the Top1 hit ratio, and has good generalization performance.

Key words: scoring algorithm, peptide sequence identification, attention mechanism, residual network, multi-classification probability sum

中图分类号: