Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (6): 1637-1642.DOI: 10.11772/j.issn.1001-9081.2015.06.1637

Previous Articles     Next Articles

Protein function prediction based on doubly indexed matrix

MENG Jun, ZHANG Xin   

  1. School of Computer Science and Technology, Dalian University of Technology, Dalian Liaoning 116024, China
  • Received:2015-01-13 Revised:2015-04-03 Published:2015-06-12

基于双重索引矩阵的蛋白质功能预测

孟军, 张信   

  1. 大连理工大学 计算机科学与技术学院, 辽宁 大连 116024
  • 通讯作者: 孟军(1964-),女,辽宁大连人,副教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘;mengjun@dlut.edu.cn
  • 作者简介:张信(1990-),男,安徽安庆人,硕士研究生,主要研究方向:机器学习、数据挖掘。
  • 基金资助:

    国家自然科学基金资助项目(61472061)。

Abstract:

The single data source cannot effectively predict the function of protein and the information of protein interaction network is incomplete. In order to solve the problem, A Multi-Source Integration and Random Walk with Doubly Indexed Matrix (MSI-RWDIM) algorithm was proposed. The proposed algorithm used protein sequence, gene expression and protein-protein interaction for the prediction of protein function. The weighting networks were constructed from the data sources with their characteristics. A network, which was fused by the weighting networks, integrated with function correlation network to construct a doubly indexed matrix. Random walk was used to calculate annotation scores and predict protein function. The cross-validation experiments on Yeast show that MSI-RWDIM can achieve higher prediction accuracy, lower coverage and lower loss rate of function labels. The research results show that the overall performance of MSI-RWDIM is much better than commonly used k-nearest neighbor, transductive multi-label ensemble classifier and fast simultaneous weighting method.

Key words: multiple data integration, random walk, doubly indexed matrix, function correlation network, protein function prediction

摘要:

针对单一数据源预测蛋白质功能效果不佳以及蛋白质相互作用网络信息不完全等问题,提出一种多数据源融合和基于双重索引矩阵的随机游走的蛋白质功能预测(MSI-RWDIM)算法。该算法使用了蛋白质序列、基因表达和蛋白质相互作用数据预测蛋白质功能,并根据这些数据源特性构建相应的相互作用加权网络;然后融合各数据源加权网络并结合功能相关性网络构建双重索引矩阵,使用随机游走算法计算得分进而预测蛋白质功能。在酵母数据集的五折交叉验证中,MSI-RWDIM算法具有较高的准确率和较低的覆盖率,还可降低功能标签损失率。研究结果表明,MSI-RWDIM算法的总体性能优于常用的k-近邻、直推式多标签集成分类和快速同步加权方法。

关键词: 多数据源融合, 随机游走, 双重索引矩阵, 功能相关性网络, 蛋白质功能预测

CLC Number: