《计算机应用》唯一官方网站 ›› 2021, Vol. 41 ›› Issue (12): 3614-3619.DOI: 10.11772/j.issn.1001-9081.2021061082

• 第十八届中国机器学习会议(CCML 2021) • 上一篇    

基于表示学习和深度森林的长链非编码RNA编码短肽预测模型

纪腾其, 孟军(), 赵思远, 胡鹤还   

  1. 大连理工大学 计算机科学与技术学院,辽宁 大连 116024
  • 收稿日期:2021-05-12 修回日期:2021-06-24 接受日期:2021-07-21 发布日期:2021-12-28 出版日期:2021-12-10
  • 通讯作者: 孟军
  • 作者简介:纪腾其(1996—),男,山东烟台人,硕士研究生,主要研究方向:生物信息学、机器学习
    赵思远(1992—),男,辽宁盘锦人,博士研究生,主要研究方向:机器学习、数据挖掘
    胡鹤还(1997—),男,辽宁沈阳人,硕士研究生,主要研究方向:生物信息学、深度学习。
  • 基金资助:
    国家自然科学基金资助项目(61872055)

Prediction model of lncRNA-encoded short peptides based on representation learning and deep forest

Tengqi JI, Jun MENG(), Siyuan ZHAO, Hehuan HU   

  1. School of Computer Science and Technology,Dalian University of Technology,Dalian Liaoning 116024,China
  • Received:2021-05-12 Revised:2021-06-24 Accepted:2021-07-21 Online:2021-12-28 Published:2021-12-10
  • Contact: Jun MENG
  • About author:JI Tengqi, born in 1996, M. S. candidate. His research interests include bioinformatics, machine learning.
    ZHAO Siyuan, born in 1992, Ph. D. candidate. His research interests include machine learning, data mining.
    HU Hehuan, born in 1997, M. S. candidate. His research interests include bioinformatics, deep learning.
  • Supported by:
    the National Natural Science Foundation of China(61872055)

摘要:

长链非编码RNA(lncRNA)中的小开放阅读框(sORFs)能够编码长度不超过100个氨基酸的短肽。针对短肽预测研究中lncRNA中的sORFs特征不鲜明且高可信度数据尚不充分的问题,提出一种基于表示学习的深度森林(DF)模型。首先,使用常规lncRNA特征提取方法对sORFs进行编码;其次,通过自编码器(AE)进行表示学习来获得输入数据的高效表示;最后,训练DF模型实现对lncRNA编码短肽的预测。实验结果表明,该模型在拟南芥数据集上能够达到92.08%的准确率,高于传统机器学习模型、深度学习模型以及组合模型,且具有较好的稳定性;此外,在大豆与玉米数据集上进行的模型测试中,该模型的准确率分别能达到78.16%和74.92%,验证了所提模型良好的泛化能力。

关键词: 长链非编码RNA, 小开放阅读框, 短肽, 表示学习, 深度森林, 预测

Abstract:

Small Open Reading Frames (sORFs) in long non-coding RNA (lncRNA) can encode short peptides with length no more than 100 amino acids. Aiming at the problem that the features of sORFs in lncRNA are not distinct and the data with high reliability are not enough in short peptide prediction research, a Deep Forest (DF) model based on representation learning was proposed. Firstly, the conventional lncRNA feature extraction method was used to encode the sORFs. Secondly, the AutoEncoder (AE) was used to perform representation learning to obtain highly efficient representation of the input data. Finally, a DF model was trained to predict the short peptides encoded by lncRNA. Experimental results show that the accuracy of this model can achieve 92.08% on Arabidopsis thalianadataset, which is higher than those of the traditional machine learning models , deep learning models and combined models, and this model has better stability. In addition, the prediction accuracy of this method can reach 78.16% and 74.92% on Glycine max and Zea mays datasets respectively, verifying the good generalization ability of the proposed model.

Key words: long non-coding RNA (lncRNA), small Open Reading Frames (sORFs), short peptide, representation learning, Deep Forest (DF), prediction

中图分类号: