Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (11): 3393-3399.DOI: 10.11772/j.issn.1001-9081.2020040510

• Frontier & interdisciplinary applications • Previous Articles     Next Articles

Prediction of protein subcellular localization based on deep learning

WANG Yihao, DING Hongwei, LI Bo, BAO Liyong, ZHANG Yingjie   

  1. School of Information Science and Engineering, Yunnan University, Kunming Yunnan 650500, China
  • Received:2020-04-22 Revised:2020-06-15 Online:2020-11-10 Published:2020-07-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61461053, 61461054).


王艺皓, 丁洪伟, 李波, 保利勇, 张颖婕   

  1. 云南大学 信息学院, 昆明 650500
  • 通讯作者: 丁洪伟(1964-),男,云南昆明人,教授,博士,CCF会员,主要研究方向:随机多址、轮训系统、无线传感器网络、机器学习;
  • 作者简介:王艺皓(1995-),男,山东东营人,硕士研究生,CCF会员,主要研究方向:机器学习、生物信息处理;李波(1976-),男,云南昆明人,教授,博士,CCF会员,主要研究方向:边缘计算、移动云计算、物联网;保利勇(1975-),云南楚雄人,男,副教授,博士,CCF会员,主要研究方向:通信网络MAC层多址接入、冲突分解、混沌扩频通信;张颖婕(1993-),女,云南昆明人,硕士研究生,CCF会员,主要研究方向:生物信息处理、机器学习
  • 基金资助:

Abstract: Focused on the issue that traditional machine learning algorithms still need to manually represent features, a protein subcellular localization algorithm based on the deep network of Stacked Denoising AutoEncoder (SDAE) was proposed. Firstly, the improved Pseudo-Amino Acid Composition (PseAAC), Pseudo Position Specific Scoring Matrix (PsePSSM) and Conjoint Traid (CT) were used to extract the features of the protein sequence respectively, and the feature vectors obtained by these three methods were fused to obtain a new feature expression model of protein sequence. Secondly, the fused feature vector was input into the SDAE deep network to automatically learn more effective feature representation. Thirdly, the Softmax regression classifier was adopted to make the classification and prediction of subcells, and leave-one-out cross validation was performed on Viral proteins and Plant proteins datasets. Finally, the results of the proposed algorithm were compared with those of the existing algorithms such as mGOASVM (multi-label protein subcellular localization based on Gene Ontology and Support Vector Machine) and HybridGO-Loc (mining Hybrid features on Gene Ontology for predicting subcellular Localization of multi-location proteins). Experimental results show that the new algorithm achieves 98.24% accuracy on Viral proteins dataset, which is 9.35 Percentage Points higher than that of mGOASVM algorithm. And the new algorithm achieves 97.63% accuracy on Plant proteins dataset, which is 10.21 percentage points and 4.07 percentage points higher than those of mGOASVM algorithm and HybridGO-Loc algorithm respectively. To sum up, it can be shown that the proposed new algorithm can effectively improve the accuracy of the prediction of protein subcellular localization.

Key words: deep learning, feature fusion, protein localization, Stacked Denoising AutoEncoder (SDAE), leave-one-out cross validation

摘要: 针对传统机器学习算法中仍需手工操作表示特征的问题,提出了一种基于堆栈式降噪自编码器(SDAE)深度网络的蛋白质亚细胞定位算法。首先,分别利用改进型伪氨基酸组成法(PseAAC)、伪位置特异性得分矩阵法(PsePSSM)和三联体编码法(CT)对蛋白质序列进行特征提取,并将这三种方法得到的特征向量进行融合,以得到一个全新的蛋白质序列特征表达模型;接着,将融合后的特征向量输入到SDAE深度网络里自动学习更有效的特征表示;然后选用Softmax回归分类器进行亚细胞的分类预测,并采用留一法在Viral proteins和Plant proteins两个数据集上进行交叉验证;最后,将所提算法的结果与mGOASVM、HybridGO-Loc等多种现有算法的结果进行比较。实验结果表明,所提算法在Viral proteins数据集上取得了98.24%的准确率,与mGOASVM算法相比提高了9.35个百分点;同时所提算法在Plant proteins数据集上取得了97.63%的准确率,比mGOASVM算法和HybridGO-Loc算法分别提高了10.21个百分点和4.07个百分点。综上说明所提算法可以有效提高蛋白质亚细胞定位预测的准确性。

关键词: 深度学习, 特征融合, 蛋白质定位, 堆栈式降噪自编码器, 留一法

CLC Number: