计算机应用 ›› 2016, Vol. 36 ›› Issue (8): 2150-2156.DOI: 10.11772/j.issn.1001-9081.2016.08.2150

• 第六届中国数据挖掘会议(CCDM 2016) • 上一篇    下一篇

基于三层集成多标记学习的蛋白质多亚细胞定位预测

乔善平1,2,3, 闫宝强4   

  1. 1. 山东师范大学 管理科学与工程学院, 济南 250014;
    2. 济南大学 信息科学与工程学院, 济南 250022;
    3. 山东省网络环境智能计算技术重点实验室, 济南 250022;
    4. 山东师范大学 数学科学学院, 济南 250014
  • 收稿日期:2016-03-15 修回日期:2016-03-29 出版日期:2016-08-10 发布日期:2016-08-10
  • 通讯作者: 闫宝强
  • 作者简介:乔善平(1971-),男,山东梁山人,副教授,硕士,主要研究方向:机器学习、智能计算、生物信息学;闫宝强(1966-),男,山东菏泽人,教授,博士,主要研究方向:微分方程、动力系统。
  • 基金资助:
    国家自然科学基金资助项目(61302128);山东省自然科学基金资助项目(ZR2013FL002);济南大学科研基金资助项目(XKY1402)。

Protein subcellular multi-localization prediction based on three-layer ensemble multi-label learning

QIAO Shanping1,2,3, YAN Baoqiang4   

  1. 1. School of Management Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China;
    2. School of Information Science and Engineering, University of Jinan, Jinan Shandong 250022, China;
    3. Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan Shandong 250022, China;
    4. School of Mathematical Sciences, Shandong Normal University, Jinan Shandong 250014, China
  • Received:2016-03-15 Revised:2016-03-29 Online:2016-08-10 Published:2016-08-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61302128), the Natural Science Foundation of Shandong Province (ZR2013FL002), the Science and Technology Foundation of University of Jinan (XKY1402).

摘要: 针对多标记学习和集成学习在解决蛋白质多亚细胞定位预测问题上应用还不成熟的状况,研究基于集成多标记学习的蛋白质多亚细胞定位预测方法。首先,从多标记学习和集成学习相结合的角度提出了一种三层的集成多标记学习系统框架结构,该框架将学习算法和分类器进行了层次性分类,并把二分类学习、多分类学习、多标记学习和集成学习进行有效整合,形成一个通用型的三层集成多标记学习模型;其次,基于面向对象技术和统一建模语言(UML)对系统模型进行了设计,使系统具备良好的可扩展性,通过扩展手段增强系统的功能和提高系统的性能;最后,使用Java编程技术对模型进行扩展,实现了一个学习系统软件,并成功应用于蛋白质多亚细胞定位预测问题上。通过在革兰氏阳性细菌数据集上进行测试,验证了系统功能的可操作性和较好的预测性能,该系统可以作为解决蛋白质多亚细胞定位预测问题的一个有效工具。

关键词: 蛋白质多亚细胞定位预测, 多标记学习, 集成学习, 面向对象技术, Java

Abstract: Aiming at the situation that multi-label learning and ensemble learning are not applied maturely in solving the problem of protein subcellular multi-localization prediction, an ensemble multi-label learning based method was studied to address this issue. Firstly, from the view of combination of multi-label learning and ensemble learning, a three-layer ensemble multi-label learning framework was proposed. Learning algorithms and classifiers were both categorized into three groups according to the corresponding three layers of the proposed framework. In this framework, binary classification learning, multi-class classification learning, multi-label learning and ensemble learning were all integrated together effectively, and thus a general-purpose ensemble multi-label learning model was constructed. Secondly, a learning system with good expansibility was designed using the object-oriented technology and Unified Modeling Language (UML), which can enhance the function and improve the performance of the system. Finally, by extending the model, a Java-based learning system was developed and applied successfully to predict protein's multiple subcellular localizations. The test results on a gram positive bacteria dataset indicate the operability of the system function as well as better prediction performance, the proposed system may become a useful tool to predict protein multiple subcellular localizations.

Key words: protein subcellular multi-localization prediction, multi-label learning, ensemble learning, object-oriented technology, Java

中图分类号: