Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (3): 722-727.DOI: 10.11772/j.issn.1001-9081.2017082042

Previous Articles     Next Articles

Protein function prediction method based on PPI network and machine learning

TANG Jiaqi1, WU Jingli1,2,3   

  1. 1. School of Computer Science and Information Engineering, Guangxi Normal University, Guilin Guangxi 541001, China;
    2. Guangxi Key Laboratory of Multi-Source Information Mining and Safety, Guangxi Normal University, Guilin Guangxi 541001, China;
    3. Guangxi Regional Multi-Source Information Integration and Intelligent Processing Cooperation Innovation Center, Guilin Guangxi 541001, China
  • Received:2017-08-22 Revised:2017-11-06 Online:2018-03-10 Published:2018-03-07
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61363035, 61762015), the Natural Science Foundation of Guangxi (2015GXNSFAA139288), the "Bagui Scholars" Project, the Systematic Research Foundation of Guangxi Key Laboratory of Multi-source Information Mining and Safety (14-A-03-02, 15-A-03-02), the Guangxi Graduate Education Innovation Program (XYCSZ2017067).

基于PPI网络与机器学习的蛋白质功能预测方法

唐家琪1, 吴璟莉1,2,3   

  1. 1. 广西师范大学 计算机科学与信息工程学院, 广西 桂林 541004;
    2. 广西师范大学 广西多源信息挖掘与安全重点实验室, 广西 桂林 541004;
    3. 广西区域多源信息集成与智能处理协同创新中心, 广西 桂林 541004
  • 通讯作者: 吴璟莉
  • 作者简介:唐家琪(1992-),男,浙江杭州人,硕士研究生,主要研究方向:生物信息学、机器学习;吴璟莉(1978-),女,广西博白人,教授,博士,CCF会员,主要研究方向:生物信息学、算法设计与分析。
  • 基金资助:
    国家自然科学基金资助项目(61363035,61762015);广西自然科学基金资助项目(2015GXNSFAA139288);"八桂学者"工程专项;广西多源信息挖掘与安全重点实验室系统性研究基金资助项目(14-A-03-02,15-A-03-02);广西研究生教育创新计划项目(XYCSZ2017067)。

Abstract: Aiming at the problem that the prediction method of protein function based on the current Protein-Protein Interaction (PPI) network has low precision and is susceptible to data noise, a new machine learning protein function prediction method named HPMM (HC, PCA and MLP based Method) was proposed, which combined Hierarchical Clustering (HC), Principal Component Analysis (PCA) and Multi-layer Perception (MLP). HPMM took comprehensive consideration from macro and micro perspectives. It combined the information of protein families, domains and important sites into the vertex attributes of PPI networks to alleviate the effect from the data noise of networks. Firstly, the features of function modules and attribute principal components were extracted by using HC and PCA. Secondly, a mapping relationship between multi-feature and multi-function, used to predict protein functions, was constructed by training the MLP model. Three homo sapiens PPI networks, which were annotated by Molecular Functions (MF), Biological Processes (BP), and Cellular Components (CC) respectively, were adopted in the experiments. Comparisons were performed among the HPMM algorithm, the Cosine Iterative Algorithm (CIA) and the Diffusing GO Terms in the Directed PPI Network (GoDIN) Algorithm. The experimental results indicate that HPMM can obtain higher precision and F-measure than algorithms CIA and GoDIN, which are purely PPI network based methods.

Key words: function prediction, machine learning, Protein-Protein Interaction (PPI), Hierarchical Clustering (HC), Principal Component Analysis (PCA), Multi-Layer Perceptron (MLP)

摘要: 针对现有的基于蛋白质相互作用(PPI)网络的蛋白质功能预测方法预测精度不高、易受数据噪声影响的问题,提出一种基于机器学习(层次聚类、主成分分析和多层感知器)的蛋白质功能预测方法HPMM。该方法综合考虑蛋白质宏观和微观层面的信息,将蛋白质家族、结构域和重要位点信息作为顶点属性整合到PPI网络中以减轻网络中数据噪声的影响。首先,基于层次聚类和主成分分析进行特征提取,得到功能模块和属性主成分特征,然后训练多层感知器模型,建立多特征与多功能之间的映射关系以用于功能预测。在三个分别被分子功能(MF)、生物过程(BP)和细胞组件(CC)注释的人类PPI网络上进行测试,对HPMM、余弦迭代算法(CIA)和有向PPI网络基因本体术语传播(GoDIN)算法的功能预测效果进行比较分析。实验结果表明,相比CIA和GoDIN这两种完全基于PPI网络的方法,HPMM的精确度与F值更高。

关键词: 功能预测, 机器学习, 蛋白质相互作用, 层次聚类, 主成分分析, 多层感知器

CLC Number: