计算机应用 ›› 2015, Vol. 35 ›› Issue (2): 374-377.DOI: 10.11772/j.issn.1001-9081.2015.02.0374

• 先进计算 • 上一篇    下一篇

基于特征聚类和随机子空间的microRNA识别方法

芮志良1, 朱玉全1, 耿霞1, 陈耿2   

  1. 1. 江苏大学 计算机科学与通信工程学院, 江苏 镇江 212013;
    2. 南京审计学院 工学院, 南京 210029
  • 收稿日期:2014-09-11 修回日期:2014-11-06 出版日期:2015-02-10 发布日期:2015-02-12
  • 通讯作者: 芮志良
  • 作者简介:芮志良(1990-),男,江苏南京人,硕士研究生,主要研究方向:数据挖掘、生物信息学; 朱玉全(1966-),男,江苏常州人,教授,博士,主要研究方向:模式识别、数据挖掘、云计算; 耿霞(1978-),女,山西汾阳人,讲师,博士研究生,主要研究方向:数据挖掘、生物信息学; 陈耿(1965-),男,江苏无锡人,教授,博士,主要研究方向:数据挖掘。
  • 基金资助:

    国家自然科学基金资助项目(71271117);江苏省科技型企业技术创新资金资助项目(BC2012201);江苏省六大人才高峰项目(2013-WLW-005)。

microRNA identification method based on feature clustering and random subspace

RUI Zhiliang1, ZHU Yuquan1, GENG Xia1, CHEN Geng2   

  1. 1. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang Jiangsu 212013, China;
    2. School of Technology, Nanjing Audit University, Nanjing Jiangsu 210029, China
  • Received:2014-09-11 Revised:2014-11-06 Online:2015-02-10 Published:2015-02-12

摘要:

针对microRNA识别方法中过多注重新特征、忽略弱分类能力特征和冗余特征,导致敏感性和特异性指标不佳或两者不平衡的问题,提出一种基于特征聚类和随机子空间的集成算法CLUSTER-RS。该算法采用信息增益率剔除部分弱分类能力的特征后,利用信息熵度量特征之间相关性,对特征进行聚类,再从每个特征簇中随机选取等量特征组成特征集用于构建基分类器,最后将基分类器集成用于microRNA识别。通过调整参数、选择基分类器实现算法最优化后,在microRNA最新数据集上与经典方法Triplet-SVM、miPred、MiPred、microPred和HuntMi进行对比实验,结果显示CLUSTER-RS在识别中敏感性不及microPred但优于其他模型,特异性为六者最优,而且从整体性能指标准确性和马修兹系数可以看出,CLUSTER-RS比其他算法具有优势。结果表明,CLUSTER-RS取得了较好的识别效果,在敏感性和特异性上实现了很好的平衡,即在性能指标平衡方面优于对比方法。

关键词: microRNA识别, 分类能力, 特征聚类, 随机子空间, 相关性

Abstract:

As sensitivity and specificity of current microRNA identification methods are not ideal or imbalanced because of emphasizing new features but ignoring weak classification ability and redundancy of features. An ensemble algorithm based on feature clustering and random subspace method was proposed, named CLUSTER-RS. After eliminating some features with weak classification ability using information ratio, the algorithm utilized information entropy to measure feature relevance and grouped the features into clusters. Then it selected the same number of features randomly from each cluster to compose a feature set, which was used to train base classifiers for constituting the final identification model. By tuning parameter and selecting base classifiers to optimize the algorithm, experimental comparison of CLUSTER-RS and five classic microRNA identification methods (Triplet-SVM,miPred,MiPred,microPred,HuntMi) was conducted using latest microRNA dataset. CLUSTER-RS was only inferior to microPred in sensitivity and performed best in specificity, and also had advantage in accuracy and Matthew correlation coefficient. Experiments show that, CLUSTER-RS algorithm achieves good performance and is superior to the rivals in the aspect of balance between sensitivity and specificity.

Key words: microRNA identification, classification ability, feature clustering, random subspace, relevance

中图分类号: