Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (2): 421-426.DOI: 10.11772/j.issn.1001-9081.2017061609

Previous Articles     Next Articles

Mutual information maximum value filter criteria combined with particle swarm optimization algorithm in feature gene selection for tumor classification

YU Dekuang, YANG Yi   

  1. School of Biomedical Engineering, Southern Medical University, Guangzhou Guangdong 510515, China
  • Received:2017-06-29 Revised:2017-08-23 Online:2018-02-10 Published:2018-02-10
  • Supported by:
    This work is partially supported by the Science and Technology Planning Project of Guangdong Province (2014A020212545, 2013B051000054), the Key Platforms and Research Projects of Universities in Guangdong Province (2016GXJK021).


喻德旷, 杨谊   

  1. 南方医科大学 生物医学工程学院, 广州 510515
  • 通讯作者: 杨谊
  • 作者简介:喻德旷(1972-),男,江西南昌人,副教授,博士,主要研究方向:医学信号仿真、医学信息处理;杨谊(1973-),女,广东河源人,副教授,博士,主要研究方向:信息系统设计。
  • 基金资助:

Abstract: Gene data has the characteristics of small sample, high dimensionality and high redundancy, which easily lead to "curse of dimensionality" and "over-fitting" in feature gene selection. To overcome these obstacles, a feature gene selection algorithm, named Mutual Information Maximum Value Filter Criteria-Inertia-Weight Particle Swarm Optimization (MIMVFC-IWPSO), was proposed. Firstly, interaction between genes was calculated by newly defined feature entropies of gene-category and gene-gene, and Feature Gene Candidates Subset (FGCS) was obtained by MIMVFC (Mutual Information Maximum Value Filter Criteria) which reduced the scope of classification operations and improved the probability of feature genes being covered. Secondly, the Particle Swarm Optimization (PSO) algorithm was reconstructed to IWPSO (Inertia Weight Particle Swarm Optimization) by introduction of self-adjusted inertia weight which enabled the algorithm to have strong global optimization ability in the early stage of iteration and strong local search ability in the later stage. Lastly, Core Feature Gene Subset (CFGS) was extracted from FGCS by IWPSO which was exploited in the classification of samples into tumor and normal classes. The experiments were carried out based on three public tumor gene databases. Compared with four popular filter methods, MIMVFC achieved higher correct classification rate than the methods based on Signal-to-Noise Ratio (SNR), t-statistic and Information Gain (IG), and ranked nearly the same as Chi-Square method, but the proposed method still had the optimized step to enhance the results furthermore. For the same FGCS, compared with BPSO-CGA (Binary Partical Swarm Optimization and Combat Genetic Algorithm), an algorithm with good performance, IWPSO gained a smaller CFGS with slightly increased time consumption and a higher accuracy; compared with classic PSO, IWPSO gained a smaller FGCS with less time consumption and a higher accuracy. The simulation results show that MIMVFC-IWPSO has comprehensive classification performance in both the aspects of accuracy and efficiency which proves to be feasible and effective in feature gene selection of multiple types of tumors, and it can be employed in assisting instruction in molecular biology experiment design and validation.

Key words: tumor feature gene selection, Mutual Information Maximum Value Filter Criteria (MIMVFC), Particle Swarm Optimization (PSO) algorithm, Feature Gene Candidate Subset (FGCS), Core Feature Gene Subset (CFGS)

摘要: 基因数据小样本、高维数、高冗余的特点常导致特征基因选择出现"维数灾难"和"过拟合",针对这一问题,提出一种特征基因提取算法——互信息最值过滤原则-惯性权重粒子群优化(MIMVFC-IWPSO)算法。首先,借鉴过滤法的思路,通过计算互信息指标,依据互信息最值过滤原则(MIMVFC)获得特征基因候选子集(FGCS),缩小分类操作的范围,提高特征基因被覆盖的概率;接着,对粒子群优化(PSO)算法进行改进,引入惯性权重实现自调节可变惯性权重粒子群优化(IWPSO)算法,使得在算法迭代初期有着快速的全局优化能力,而在算法后期具有较强的局部搜索能力;最后,运用IWPSO从FGCS中提取核心信息基因子集(CFGS),并基于CFGS对样本进行肿瘤与正常组织的分类。采用3个公开的肿瘤基因表达谱数据进行实验,MIMVFC正确分类率优于信噪比(SNR)、t-检验和信息增益(IG)方法,与卡方统计值(Chi-Square)方法接近,而MIMVFC还能利用IWPSO进一步优化结果。基于相同的FGCS,与目前效果较好的二进制粒子群优化与防治基因算法(BPSO-CGA)相比,IWPSO的运算耗时有所增加,但所获得的CFGS规模减小,准确率提高;而与经典PSO相比,所获得的CFGS规模减小、运算耗时减少、准确率提高。实验结果表明MIMVFC-IWPSO具有较好的综合分类性能,能有效提高准确率和效率,可用于多种肿瘤的特征基因选择,辅助指导分子生物学实验设计和验证。

关键词: 肿瘤特征基因选择, 互信息最值过滤原则, 粒子群优化算法, 特征基因候选子集, 核心信息基因子集

CLC Number: