《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (2): 546-555.DOI: 10.11772/j.issn.1001-9081.2024020177

• 先进计算 • 上一篇    

基于图编码与改进流注意力的编码sORFs预测方法DeepsORF

谢冬梅1, 边昕烨1, 于连飞1, 刘文博1, 王子灵1, 曲志坚1(), 于家峰2   

  1. 1.山东理工大学 计算机科学与技术学院,山东 淄博 255049
    2.德州学院 生物物理研究院(山东省生物物理重点实验室),山东 德州 253023
  • 收稿日期:2024-02-26 修回日期:2024-04-14 接受日期:2024-04-16 发布日期:2024-06-04 出版日期:2025-02-10
  • 通讯作者: 曲志坚
  • 作者简介:谢冬梅(1998—),女,山东淄博人,硕士研究生,CCF会员,主要研究方向:深度学习、生物信息学
    边昕烨(1998—),女,山东淄博人,硕士研究生,主要研究方向:深度学习、蛋白质组学
    于连飞(1998—),男,河南周口人,硕士研究生,主要研究方向:深度学习、大数据分析
    刘文博(1998—),男,山东菏泽人,硕士研究生,CCF会员,主要研究方向:深度学习、大数据分析
    王子灵(1996—),女,山东滨州人,硕士研究生,主要研究方向:深度学习、生物信息学
    于家峰(1979—),男,山东淄博人,教授,博士,主要研究方向:生物序列分析、生物信息学。
  • 基金资助:
    山东省高等学校青年创新团队发展计划项目(2019KJN048)

DeepsORF: coding sORFs prediction method based on graph coding with improved flow attention

Dongmei XIE1, Xinye BIAN1, Lianfei YU1, Wenbo LIU1, Ziling WANG1, Zhijian QU1(), Jiafeng YU2   

  1. 1.School of Computer Science and Technology,Shandong University of Technology,Zibo Shandong 255049,China
    2.Institute of Biophysics,Dezhou University (Shandong Key Laboratory of Biophysics),Dezhou Shandong 253023,China
  • Received:2024-02-26 Revised:2024-04-14 Accepted:2024-04-16 Online:2024-06-04 Published:2025-02-10
  • Contact: Zhijian QU
  • About author:XIE Dongmei, born in 1998, M. S. candidate. Her research interests include deep learning, bioinformatics.
    BIAN Xinye, born in 1998, M. S. candidate. Her research interests include deep learning, proteomics.
    YU Lianfei, born in 1998, M. S. candidate. His research interests include deep learning, big data analysis.
    LIU Wenbo, born in 1998, M. S. candidate. His research interests include deep learning, big data analysis.
    WANG Ziling, born in 1996, M. S. candidate. Her research interests include deep learning, bioinformatics.
    YU Jiafeng, born in 1979, Ph. D., professor. His research interests include biological sequence analysis, bioinformatics.
  • Supported by:
    Youth Innovation Team Development Program of Shandong Province Higher Education Institutions(2019KJN048)

摘要:

小开放阅读框(sORFs)在多种生物学过程中发挥着关键作用,且准确识别编码sORFs和非编码sORFs是基因组学中一项重要且有挑战性的任务。针对目前大多数编码sORFs预测算法严重依赖基于先验生物知识的手工特征且缺乏通用性的问题以及原始sORFs的序列长度长短不一而无法直接输入预测模型的问题,提出一种基于sORF-Graph图编码方式的端到端的深度学习框架DeepsORF预测编码sORFs。首先,通过sORF-Graph将所有sORFs序列编码成对应的图,并将序列信息编码成图元素特征,从而对输入序列进行标准化处理;其次,引入基于卷积与残差的流注意力机制捕获sORFs中碱基远距离之间的相互作用,以更有效地表达sORFs的特征,并提高模型的预测精度。实验结果证明,DeepsORF框架在6个独立测试集上的性能均得到提升,与csORF-finder方法相比,DeepsORF在D.melanogaster nonCDS-sORFs测试集上的准确率、马修斯相关系数(MCC)以及精确率分别提升了9.97、19.49与13.07个百分点,验证了DeepsORF模型在识别编码sORFs和非编码sORFs任务中的有效性以及良好泛化能力。

关键词: 小开放阅读框, 编码sORFs, 端到端, 图编码, 流注意力

Abstract:

Small Open Reading Frames (sORFs) plays a critical role in various biological processes, and identifying coding and non-coding sORFs accurately is a significant and challenging task in genomics. Due to the severe reliance of most existing algorithms for predicting coding sORFs on manual features based on prior biological knowledge, and the lack of universality of the algorithms, as well as the variable lengths of original sORFs sequences that prevent direct input into prediction models, an sORF-Graph graph encoding method-based end-to-end deep learning framework, DeepsORF, was developed for predicting coding sORFs. Firstly, all sORFs sequences were encoded into the corresponding graphs through sORF-Graph, and the input sequences were standardized by encoding sequence information into graph element features. Then, a convolutional and residual flow attention mechanism was introduced to capture the interactions among long distant bases within sORFs, thereby enhancing the expression of sORFs features and improving the model’s prediction accuracy. Experimental results demonstrate that DeepsORF framework enhances performance on all of six independent test sets. Compared with csORF-finder method, DeepsORF achieves increases of 9.97, 19.49, and 13.07 percentage points in accuracy, Matthew Correlation Coefficient (MCC), and precision, respectively, on D.melanogaster nonCDS-sORFs test set, validating the effectiveness and good generalization ability of DeepsORF model in the task of identifying coding and non-coding sORFs.

Key words: small Open Reading Frames (sORFs), coding sORFs, end-to-end, graph encoding, flow attention

中图分类号: