microRNA identification method based on feature clustering and random subspace
RUI Zhiliang1, ZHU Yuquan1, GENG Xia1, CHEN Geng2
1. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang Jiangsu 212013, China;
2. School of Technology, Nanjing Audit University, Nanjing Jiangsu 210029, China
As sensitivity and specificity of current microRNA identification methods are not ideal or imbalanced because of emphasizing new features but ignoring weak classification ability and redundancy of features. An ensemble algorithm based on feature clustering and random subspace method was proposed, named CLUSTER-RS. After eliminating some features with weak classification ability using information ratio, the algorithm utilized information entropy to measure feature relevance and grouped the features into clusters. Then it selected the same number of features randomly from each cluster to compose a feature set, which was used to train base classifiers for constituting the final identification model. By tuning parameter and selecting base classifiers to optimize the algorithm, experimental comparison of CLUSTER-RS and five classic microRNA identification methods (Triplet-SVM,miPred,MiPred,microPred,HuntMi) was conducted using latest microRNA dataset. CLUSTER-RS was only inferior to microPred in sensitivity and performed best in specificity, and also had advantage in accuracy and Matthew correlation coefficient. Experiments show that, CLUSTER-RS algorithm achieves good performance and is superior to the rivals in the aspect of balance between sensitivity and specificity.
[1] SU N. Bioinformatical analysis of gene regulatory network consisting of transcription factor and microRNA [D]. Beijing: Peking University, 2013: 1-11. (苏乃芳. 转录因子和microRNA组成的基因调控网络的生物信息学分析[D]. 北京:北京大学, 2013: 1-11.) [2] HU L, HUANG Y, WANG Q, et al. Benchmark comparison of ab initio microRNA identification methods and software [J]. Genetics and Molecular Research, 2012, 11(4): 4525-4538. [3] WEI L, LIAO M, GAO Y, et al. Improved and promising identification of human microRNAs by incorporating a high-quality negative set [J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2014, 11(1):192-201. [4] XUE C, LI F, HE T, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine [J]. BMC Bioinformatics, 2005, 6: 310. [5] NG K L, MISHRA S K. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures [J]. Bioinformatics, 2007, 23(11): 1321-1330. [6] JIANG P, WU H, WANG W, et al. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features [J]. Nucleic Acids Research, 2007, 35(Web Server issue): W339-W344. [7] BATUWITA R, PALADE V. microPred: effective classification of pre-miRNAs for human miRNA gene prediction [J]. Bioinformatics, 2009, 25(8): 989-995. [8] GUDYS A, SZCZESNIAK M W, SIKORA M, et al. HuntMi: an efficient and taxon-specific approach in pre-miRNA identification [J]. BMC Bioinformatics, 2013, 14: 83. [9] KOZOMARA A, GRIFFITHS-JONES S. miRBase: integrating microRNA annotation and deep-sequencing data [J]. Nucleic Acids Resarch, 2011, 39(Database issue): D152-D157. [10] GHODSI M, LIU B, POP M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes [J]. BMC Bioinformatics, 2011, 12: 271. [11] HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software: an update [J]. ACM SIGKDD Explorations Newsletter, 2009, 11(1): 10-18. [12] ZOU Q, GUO M, LIU Y, et al. A classification method for class-imbalanced data and its application on bioinformatics [J]. Journal of Computer Research and Development, 2010, 47(8):1407-1414. (邹权,郭茂祖,刘扬,等.类别不平衡的分类方法及在生物信息学中的应用[J]. 计算机研究与发展, 2010, 47(8):1407-1414.)