Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (5): 1510-1514.DOI: 10.11772/j.issn.1001-9081.2019111992

• Frontier & interdisciplinary applications • Previous Articles     Next Articles

Protein complex identification algorithm based on XGboost and topological structural information

XU Zhoubo1, YANG Jian1, LIU Huadong1,2, HUANG Wenwen1   

  1. 1.Guangxi Key Laboratory of Trusted Software (Guilin University of Electronic Technology), GuilinGuangxi 541004, China
    2.School of Mechanical and Electrical Engineering, Guilin University of Electronic Technology, GuilinGuangxi 541004, China
  • Received:2019-11-25 Revised:2020-01-19 Online:2020-05-10 Published:2020-05-15
  • Contact: LIU Huadong, born in 1978, M. S., lecturer. His research interests include representation of graph data.
  • About author:XU Zhoubo, born in 1976, Ph. D., professor. Her research interests include symbolic calculation, intelligent planning, constraint solving.YANG Jian, born in 1994, M. S. candidate. His research interests include graph data mining.LIU Huadong, born in 1978, M. S., lecturer. His research interests include representation of graph data.HUANG Wenwen, born in 1994, M. S. candidate. His research interests include pattern mining.
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China (61762027), the Natural Science Foundation of Guangxi (2017GXNSFAA198172).

基于XGBoost与拓扑结构信息的蛋白质复合物识别算法

徐周波1, 杨健1, 刘华东1,2, 黄文文1   

  1. 1.广西可信软件重点实验室(桂林电子科技大学),广西桂林 541004
    2.桂林电子科技大学 机电工程学院,广西桂林 541004
  • 通讯作者: 刘华东(1978—)
  • 作者简介:徐周波(1976—),女,浙江奉化人,教授,博士,CCF高级会员,主要研究方向:符号计算、智能规划、约束求解; 杨健(1994—),男,江苏宿迁人,硕士研究生,主要研究方向:图数据挖掘; 刘华东(1978—),男,江西瑞金人,讲师,硕士,主要研究方向:图数据表示; 黄文文(1994—),男,安徽芜湖人,硕士研究生,主要研究方向:模式挖掘。
  • 基金资助:

    国家自然科学基金资助项目(61762027);广西自然科学基金资助项目(2017GXNSFAA198172)。

Abstract:

Large amount of uncertainty in PPI network and the incompleteness of the known protein complex data add inaccuracy to the methods only considering the topological structural information to search or performing supervised learning to the known complex data. In order to solve the problem, a search method called XGBoost model for Predicting protein complex (XGBP) was proposed. Firstly, feature extraction was performed based on the topological structural information of complexes. Then, the extracted features were trained by XGBoost model. Finally, a mapping relationship between features and protein complexes was constructed by combining topological structural information and supervised learning method, in order to improve the accuracy of protein complex prediction. Comparisons were performed with eight popular unsupervised algorithms: Markov CLustering (MCL), Clustering based on Maximal Clique (CMC), Core-Attachment based method (COACH), Fast Hierarchical clustering algorithm for functional modules discovery in Protein Interaction (HC-PIN), Cluster with Overlapping Neighborhood Expansion (ClusterONE), Molecular COmplex DEtection (MCODE), Detecting Complex based on Uncertain graph model (DCU), Weighted COACH (WCOACH); and three supervisedmethods Bayesian Network (BN), Support Vector Machine (SVM), Regression Model (RM). The results show that the proposed algorithm has good performance in terms of precision, sensitivity and F-measure.

Key words: protein complex, XGBoost model, Protein-Protein Interaction (PPI) network, graph data mining, machine learning

摘要:

蛋白质相互作用(PPI)网络中存在大量不确定性及已知蛋白质复合物数据的不完整性,单独地根据结构信息进行搜索或对已知复合物进行监督学习的方法在识别蛋白质复合物的准确性上存在不足。对此,提出一种XGBoost模型与复合物拓扑结构信息相结合的搜索方法(XGBP)。首先,根据复合物拓扑结构信息进行特征提取;然后,把所提取的特征用XGBoost模型进行训练;最后,将拓扑结构信息与监督学习方法相结合,建立特征与复合物之间的映射关系以提高蛋白质复合物预测的准确性。该算法分别与目前流行的马尔可夫聚类算法(MCL)、极大团聚类方法(CMC)、基于核心-附属结构算法(COACH)、快速层级聚类算法(HC-PIN)、基于重叠邻居的扩展聚类(ClusterONE)、分子复合物检测算法(MCODE)、基于不确定图模型的蛋白质复合物检测方法(DCU)和加权核心-附属算法(WCOACH)这八种非监督学习算法和三种监督学习方法贝叶斯网络(BN)、支持向量机(SVM)、回归模型(RM)进行比较,所提方法在精准度、敏感度、F-measure方面显示出良好的性能。

关键词: 蛋白质复合物, XGBoost模型, 蛋白质相互作用网络, 图数据挖掘, 机器学习

CLC Number: