• •    

CCML2021+85: 基于表示学习和深度森林的lncRNA编码短肽预测模型

纪腾其,孟军,赵思远,胡鹤还   

  1. 大连理工大学
  • 收稿日期:2021-06-24 修回日期:2021-06-13 发布日期:2021-06-13
  • 通讯作者: 孟军

CCML2021+85:Predictive model of lncRNA-encoded short peptides based on representation learning and Deep Forest

  • Received:2021-06-24 Revised:2021-06-13 Online:2021-06-13
  • Contact: MENG Jun

摘要: 摘 要: 长链非编码RNA(lncRNA)中的小开放阅读框(sORFs)能够编码长度不超过100个氨基酸的短肽。针对短肽预测研究中lncRNA中sORFs特征不鲜明且高可信度数据尚不充分的问题,提出一种基于表示学习的深度森林模型。首先,使用常规lncRNA特征提取方法对sORFs进行编码;其次,通过自动编码器进行表示学习获得更加有效的特征表示;最后,训练深度森林模型实现对lncRNA编码短肽的预测。实验结果表明,该方法在拟南芥数据集上能够达到92.08%的准确率,高于传统机器学习模型、深度学习模型以及组合模型,且具有较好的稳定性。此外,在大豆与玉米数据集上进行模型测试,准确率分别能达到78.16%和74.92%,验证了模型良好的泛化能力。

关键词: 长链非编码RNA, 小开放阅读框, 短肽, 表示学习, 深度森林, 预测

Abstract: Abstract: Small open reading frames (sORFs) in long non-coding RNA (lncRNA) can encode short peptides with length no more than 100 amino acids. Aiming at the problem that the features of sORFs in lncRNA are not distinct and the data with high reliability are not enough in short peptide prediction research, a deep forest model based on representation learning is proposed. Firstly, the conventional lncRNA feature extraction method was used to encode the sORFs. Secondly, the autoencoder was used for representation learning to obtain more effective features. Finally, a deep forest model was trained to predict the short peptides encoded by lncRNA. The experimental results show that the accuracy rates of this method can achieve 92.08% on Arabidopsis thaliana dataset, which is higher than that of the traditional machine learning models, deep learning models and combined models, and has better stability. In addition, the prediction accuracy rates of Glycine max and Zea mays dataset can reach 78.16% and 74.92%, which verifies the good generalization ability of the model.

Key words: Keywords: lncRNA, small open reading frames, short peptides, representation learning, Deep Forest, prediction

中图分类号: