Protein function prediction method based on PPI network and machine learning
TANG Jiaqi1, WU Jingli1,2,3
1. School of Computer Science and Information Engineering, Guangxi Normal University, Guilin Guangxi 541001, China; 2. Guangxi Key Laboratory of Multi-Source Information Mining and Safety, Guangxi Normal University, Guilin Guangxi 541001, China; 3. Guangxi Regional Multi-Source Information Integration and Intelligent Processing Cooperation Innovation Center, Guilin Guangxi 541001, China
Abstract:Aiming at the problem that the prediction method of protein function based on the current Protein-Protein Interaction (PPI) network has low precision and is susceptible to data noise, a new machine learning protein function prediction method named HPMM (HC, PCA and MLP based Method) was proposed, which combined Hierarchical Clustering (HC), Principal Component Analysis (PCA) and Multi-layer Perception (MLP). HPMM took comprehensive consideration from macro and micro perspectives. It combined the information of protein families, domains and important sites into the vertex attributes of PPI networks to alleviate the effect from the data noise of networks. Firstly, the features of function modules and attribute principal components were extracted by using HC and PCA. Secondly, a mapping relationship between multi-feature and multi-function, used to predict protein functions, was constructed by training the MLP model. Three homo sapiens PPI networks, which were annotated by Molecular Functions (MF), Biological Processes (BP), and Cellular Components (CC) respectively, were adopted in the experiments. Comparisons were performed among the HPMM algorithm, the Cosine Iterative Algorithm (CIA) and the Diffusing GO Terms in the Directed PPI Network (GoDIN) Algorithm. The experimental results indicate that HPMM can obtain higher precision and F-measure than algorithms CIA and GoDIN, which are purely PPI network based methods.
唐家琪, 吴璟莉. 基于PPI网络与机器学习的蛋白质功能预测方法[J]. 计算机应用, 2018, 38(3): 722-727.
TANG Jiaqi, WU Jingli. Protein function prediction method based on PPI network and machine learning. Journal of Computer Applications, 2018, 38(3): 722-727.
[1] ANFINSEN C B, HABER E, SELA M, et al. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain[J]. Proceedings of the National Academy of Sciences of the United States of America, 1961, 47(9):1309-1314. [2] ALTSCHUL S F, GISH W, MILLER W, et al. Basic local alignment search tool[J]. Journal of Molecular Biology, 1990, 215(3):403-410. [3] ALTSCHUL S F, MADDEN T L, SCHÄFFER A A, et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J]. Nucleic Acids Research, 1997, 25(17):3389-3402. [4] GILKS W R, AUDIT B, de ANGELIS D, et al. Percolation of annotation errors through hierarchically structured protein sequence databases[J]. Mathematical Biosciences, 2005, 193(2):223-234. [5] YE Y, GODZIK A. FATCAT:a Web server for flexible structure comparison and structure similarity searching[J]. Nucleic Acids Research, 2004, 32(Web Server issue):W582-W585. [6] TÄUBIG H, BUCHNER A, GRIEBSCH J. PAST:fast structure-based searching in the PDB[J]. Nucleic Acids Research, 2006, 34(Web Server issue):W20-W23. [7] LASKOWSKI R A, WATSON J D, THORNTON J M. From protein structure to biochemical function?[J]. Journal of Structural & Functional Genomics, 2003, 4(2/3):167-177. [8] WATSON J D, LASKOWSKI R A, THORNTON J M. Predicting protein function from sequence and structural data[J]. Current Opinion in Structural Biology, 2005, 15(3):275-284. [9] YOU Z H, LEI Y K, ZHU L, et al. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis[J]. BMC Bioinformatics, 2013, 14(S8):1-11. [10] WEI L, XING P, ZENG J, et al. Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier[J]. Artificial Intelligence in Medicine, 2017,83:67-74. [11] OLIVER S. Proteomics:guilt-by-association goes global[J]. Nature, 2000, 403(6770):601-603. [12] CHI X, HOU J. An iterative approach of protein function prediction[J]. BMC Bioinformatics, 2011, 12(1):437-445. [13] XIONG W, XIE L, GUAN J, et al. Active learning for protein function prediction in protein-protein interaction networks[C]//Proceedings of the 8th IAPR International Conference on Pattern Recognition in Bioinformatics. Berlin:Springer, 2014:172-183. [14] WANG H, HUANG H, DING C. Function-function correlated multi-label protein function prediction over interaction networks[C]//Proceedings of the 16th Annual International Conference on Research in Computational Molecular Biology. Berlin:Springer, 2012:302-313. [15] TENG Z, GUO M, LIU X, et al. Revealing protein functions based on relationships of interacting proteins and GO terms[J]. Journal of Computational Biology, 2013,20(4):322-343. [16] YU G, WANG J, LIU J. Protein function prediction by random walks on a hybrid graph[J]. Current Proteomics, 2016, 13(2):130-142. [17] HARTWELL L H, HOPFIELD J J, LEIBLER S, et al. From molecular to modular cell biology[J]. Nature, 1999, 402(6761 Suppl):47-52. [18] RIVES A W, GALITSKI T. Modular organization of cellular networks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2003, 100(3):1128-1133. [19] ARNAU V, MARS S, MARÍN I. Iterative cluster analysis of protein interaction data[J]. Bioinformatics, 2005, 21(3):364-378. [20] CLAUSET A, NEWMAN M E J, MOORE C. Finding community structure in very large networks[J]. Physical Review E:Statistical, Nonlinear, and Soft Matter Physics, 2004, 70(6):066111. [21] NEWMAN M E J, GIRVAN M. Finding and evaluating community structure in networks[J]. Physical Review E:Statistical, Nonlinear, and Soft Matter Physics, 2004, 69(2):026113. [22] ABDI H, WILLIAMS L J. Principal component analysis[J]. Wiley Interdisciplinary Reviews Computational Statistics, 2010, 2(4):433-459. [23] GILLIS J, PAVLIDIS P. The impact of multifunctional genes on "guilt by association" analysis[J]. PLOS ONE, 2011, 6(2):e17258. [24] CARPENTER G A, GROSSBERG S. Self-organizing neural networks for supervised and unsupervised learning and prediction[M]//From Statistics to Neural Networks, NATO ASI Series 136. Berlin:Springer, 1994:319-348. [25] GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[EB/OL].[2017-03-01]. http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf. [26] 刘威,刘尚,周璇.BP神经网络子批量学习方法研究[J].智能系统学报,2016,11(2):226-232.(LIU W, LIU S, ZHOU X. Subbatch learning method for BP neural networks[J]. CAAI Transactions on Intelligent Systems, 2016, 11(2):226-232.) [27] XENARIOS I, RICE D W, SALWINSKI L, et al. DIP:the database of interacting proteins[J]. Nucleic Acids Research, 2000, 28(1):289-291. [28] ASHBURNER M, BALL C A, BLAKE J A, et al. Gene ontology:tool for the unification of biology[J]. Nature Genetics, 2000, 25(1):25-29. [29] MULDER N J, APWEILER R, ATTWOOD T K, et al. InterPro, progress and status in 2005[J]. Nucleic Acids Research, 2005, 33(Database issue):D201-D205. [30] CONSORTIUM U P. The Universal Protein resource (UniProt) in 2010[J]. Nucleic Acids Research, 2010, 38(Database issue):142-148. [31] DURINCK S, SPELLMAN P T, BIRNEY E, et al. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt[J]. Nature Protocols, 2009, 4(8):1184-1191. [32] RADIVOJAC P, CLARK W T, ORON T R, et al. A large-scale evaluation of computational protein function prediction[J]. Nature Methods, 2013, 10(3):221-227.