基于特征聚类和随机子空间的microRNA识别方法

doi:10.11772/j.issn.1001-9081.2015.02.0374

计算机应用 ›› 2015, Vol. 35 ›› Issue (2): 374-377.DOI: 10.11772/j.issn.1001-9081.2015.02.0374

基于特征聚类和随机子空间的microRNA识别方法

芮志良¹, 朱玉全¹, 耿霞¹, 陈耿²

1. 江苏大学计算机科学与通信工程学院, 江苏镇江 212013;
2. 南京审计学院工学院, 南京 210029

收稿日期:2014-09-11 修回日期:2014-11-06 发布日期:2015-02-12 出版日期:2015-02-10
通讯作者: 芮志良
作者简介:芮志良(1990-),男,江苏南京人,硕士研究生,主要研究方向:数据挖掘、生物信息学; 朱玉全(1966-),男,江苏常州人,教授,博士,主要研究方向:模式识别、数据挖掘、云计算; 耿霞(1978-),女,山西汾阳人,讲师,博士研究生,主要研究方向:数据挖掘、生物信息学; 陈耿(1965-),男,江苏无锡人,教授,博士,主要研究方向:数据挖掘。
基金资助:
国家自然科学基金资助项目(71271117);江苏省科技型企业技术创新资金资助项目(BC2012201);江苏省六大人才高峰项目(2013-WLW-005)。

microRNA identification method based on feature clustering and random subspace

RUI Zhiliang¹, ZHU Yuquan¹, GENG Xia¹, CHEN Geng²

1. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang Jiangsu 212013, China;
2. School of Technology, Nanjing Audit University, Nanjing Jiangsu 210029, China

Received:2014-09-11 Revised:2014-11-06 Online:2015-02-12 Published:2015-02-10

摘要/Abstract

摘要：

针对microRNA识别方法中过多注重新特征、忽略弱分类能力特征和冗余特征,导致敏感性和特异性指标不佳或两者不平衡的问题,提出一种基于特征聚类和随机子空间的集成算法CLUSTER-RS。该算法采用信息增益率剔除部分弱分类能力的特征后,利用信息熵度量特征之间相关性,对特征进行聚类,再从每个特征簇中随机选取等量特征组成特征集用于构建基分类器,最后将基分类器集成用于microRNA识别。通过调整参数、选择基分类器实现算法最优化后,在microRNA最新数据集上与经典方法Triplet-SVM、miPred、MiPred、microPred和HuntMi进行对比实验,结果显示CLUSTER-RS在识别中敏感性不及microPred但优于其他模型,特异性为六者最优,而且从整体性能指标准确性和马修兹系数可以看出,CLUSTER-RS比其他算法具有优势。结果表明,CLUSTER-RS取得了较好的识别效果,在敏感性和特异性上实现了很好的平衡,即在性能指标平衡方面优于对比方法。

关键词: microRNA识别, 分类能力, 特征聚类, 随机子空间, 相关性

Abstract:

As sensitivity and specificity of current microRNA identification methods are not ideal or imbalanced because of emphasizing new features but ignoring weak classification ability and redundancy of features. An ensemble algorithm based on feature clustering and random subspace method was proposed, named CLUSTER-RS. After eliminating some features with weak classification ability using information ratio, the algorithm utilized information entropy to measure feature relevance and grouped the features into clusters. Then it selected the same number of features randomly from each cluster to compose a feature set, which was used to train base classifiers for constituting the final identification model. By tuning parameter and selecting base classifiers to optimize the algorithm, experimental comparison of CLUSTER-RS and five classic microRNA identification methods (Triplet-SVM,miPred,MiPred,microPred,HuntMi) was conducted using latest microRNA dataset. CLUSTER-RS was only inferior to microPred in sensitivity and performed best in specificity, and also had advantage in accuracy and Matthew correlation coefficient. Experiments show that, CLUSTER-RS algorithm achieves good performance and is superior to the rivals in the aspect of balance between sensitivity and specificity.

Key words: microRNA identification, classification ability, feature clustering, random subspace, relevance

中图分类号:

TP301.6

芮志良, 朱玉全, 耿霞, 陈耿. 基于特征聚类和随机子空间的microRNA识别方法[J]. 计算机应用, 2015, 35(2): 374-377.

RUI Zhiliang, ZHU Yuquan, GENG Xia, CHEN Geng. microRNA identification method based on feature clustering and random subspace[J]. Journal of Computer Applications, 2015, 35(2): 374-377.

参考文献

[1] SU N. Bioinformatical analysis of gene regulatory network consisting of transcription factor and microRNA [D]. Beijing: Peking University, 2013: 1-11. (苏乃芳. 转录因子和microRNA组成的基因调控网络的生物信息学分析[D]. 北京:北京大学, 2013: 1-11.)
[2] HU L, HUANG Y, WANG Q, et al. Benchmark comparison of ab initio microRNA identification methods and software [J]. Genetics and Molecular Research, 2012, 11(4): 4525-4538.
[3] WEI L, LIAO M, GAO Y, et al. Improved and promising identification of human microRNAs by incorporating a high-quality negative set [J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2014, 11(1):192-201.
[4] XUE C, LI F, HE T, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine [J]. BMC Bioinformatics, 2005, 6: 310.
[5] NG K L, MISHRA S K. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures [J]. Bioinformatics, 2007, 23(11): 1321-1330.
[6] JIANG P, WU H, WANG W, et al. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features [J]. Nucleic Acids Research, 2007, 35(Web Server issue): W339-W344.
[7] BATUWITA R, PALADE V. microPred: effective classification of pre-miRNAs for human miRNA gene prediction [J]. Bioinformatics, 2009, 25(8): 989-995.
[8] GUDYS A, SZCZESNIAK M W, SIKORA M, et al. HuntMi: an efficient and taxon-specific approach in pre-miRNA identification [J]. BMC Bioinformatics, 2013, 14: 83.
[9] KOZOMARA A, GRIFFITHS-JONES S. miRBase: integrating microRNA annotation and deep-sequencing data [J]. Nucleic Acids Resarch, 2011, 39(Database issue): D152-D157.
[10] GHODSI M, LIU B, POP M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes [J]. BMC Bioinformatics, 2011, 12: 271.
[11] HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software: an update [J]. ACM SIGKDD Explorations Newsletter, 2009, 11(1): 10-18.
[12] ZOU Q, GUO M, LIU Y, et al. A classification method for class-imbalanced data and its application on bioinformatics [J]. Journal of Computer Research and Development, 2010, 47(8):1407-1414. (邹权,郭茂祖,刘扬,等.类别不平衡的分类方法及在生物信息学中的应用[J]. 计算机研究与发展, 2010, 47(8):1407-1414.)

[1]	宋洪涛, 于江生, 韩启龙. 工业多元时序数据质量评估方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1743-1750.
[2]	付顺旺, 陈茜, 李智, 王国美, 卢妤. 用于篡改图像检测和定位的双通道渐进式特征过滤网络[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1303-1309.
[3]	孟圣洁, 于万钧, 陈颖. 最大相关和最大差异的高维数据特征选择算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 767-771.
[4]	林于翔, 吴运兵, 阴爱英, 廖祥文. 基于语义相关性分析的多模态摘要模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 65-72.
[5]	陈佳, 张鸿. 基于特征增强和语义相关性匹配的图像文本检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 16-23.
[6]	魏远, 林彦, 郭晟楠, 林友芳, 万怀宇. 融合出发地与目的地时空相关性的城市区域间出租车需求预测[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2100-2106.
[7]	赵静, 韩京宇, 钱龙, 毛毅. 基于改进的RAKEL算法的心电图诊断分类[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1892-1897.
[8]	李艳, 郭劼, 范斌. 元学习的不确定性特征构建及初步分析[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 343-348.
[9]	刘长红, 曾胜, 张斌, 陈勇. 基于语义关系图的跨模态张量融合网络的图像文本检索[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3018-3024.
[10]	李恒鑫, 常侃, 谭宇飞, 凌铭阳, 覃团发. 应用通道间相关性及增强信息蒸馏的彩色图像去马赛克网络[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 245-251.
[11]	谢雨, 蒋瑜, 龙超奇. 基于随机子空间的扩展隔离林算法[J]. 计算机应用, 2021, 41(6): 1679-1685.
[12]	王磊. 改进粗糙集属性约简结合K-means聚类的网络入侵检测方法[J]. 计算机应用, 2020, 40(7): 1996-2002.
[13]	曹堉, 王成, 王鑫, 高悦尔. 基于时空节点选择和深度学习的城市道路短时交通流预测[J]. 计算机应用, 2020, 40(5): 1488-1493.
[14]	张伍, 陈红梅. 基于核模糊粗糙集的高光谱波段选择算法[J]. 计算机应用, 2020, 40(1): 258-263.
[15]	程玉胜, 钱坤, 王一宾, 赵大卫. 融合萤火虫方法的多标签懒惰学习算法[J]. 计算机应用, 2019, 39(5): 1305-1311.

基于特征聚类和随机子空间的microRNA识别方法

microRNA identification method based on feature clustering and random subspace

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics