计算机应用 ›› 2013, Vol. 33 ›› Issue (08): 2194-2197.

• 数据库技术 • 上一篇    下一篇

高维数据挖掘中特征选择的稳健方法

李泽安1,2,陈建平1,2,章雅娟1,2,赵为华3   

  1. 1. 南通大学 计算机科学与技术学院,江苏 南通226019
    2. 南通大学 计算机科学与技术学院,江苏 南通226019
    3. 南通大学 理学院,江苏 南通 226019
  • 收稿日期:2013-03-11 修回日期:2013-05-06 出版日期:2013-08-01 发布日期:2013-09-11
  • 通讯作者: 李泽安
  • 作者简介:李泽安(1977-),女,江苏南通人,讲师,硕士,CCF会员〖BP(〗【(E200027409M)〖BP)〗,主要研究方向:数据挖掘;
    陈建平(1960-),男,江苏南通人,教授,主要研究方向:数据分析;
    章雅娟(1977-),女,甘肃白银人,讲师,硕士,主要研究方向:数据挖掘;
    赵为华(1978-), 男,江苏海门人,讲师,博士,主要研究方向:统计学。
  • 基金资助:

    南通大学杏林学院自然科学基金资助项目;南通大学自然科学基金资助项目

Robust feature selection method in high-dimensional data mining

LI Zhean1,CHEN Jianping1,ZHANG Yajuan1,ZHAO Weihua2   

  1. 1. College of Computer Science and Technology, Nantong University, Nantong Jiangsu 226019, China
    2. Colloge of Science, Nantong University, Nantong Jiangsu 226019, China
  • Received:2013-03-11 Revised:2013-05-06 Online:2013-09-11 Published:2013-08-01
  • Contact: LI Zhean

摘要: 针对高维数据的特点,即数据中变量个数往往大于样本观测数目,并且数据往往具有异质性特点,基于众数回归分析和变量选择降维技术,提出了一种稳健有效的特征选择方法,利用局部二次逼近算法(LQA)和最大期望(EM)算法,给出估计算法和最优调节参数的选取方法。通过实验的模拟数据分析表明,所提出的特征提取选择方法整体优于基于最小二乘和中位数的正则化估计方法,特别当误差是非正态分布时,与已有方法相比具有较高的预测能力和稳健性。

关键词: 高维数据, 特征选择, 众数回归, 自适应LASSO, 最大期望算法

Abstract: According to the feature of high-dimensional data, the number of variables is usually larger than the sample size and the data are often heterogeneous, a robust and effective feature selection method was proposed by using the dimensional reduction technique of variable selection and the modal regression based estimation method. The estimation algorithm was given by using Local Quadratic Algorithm (LQA) and Expectation-Maximum (EM) algorithm, and the selection method of the parameter adjustment was also discussed. Data analysis of the simulation shows that the proposed method is overall better than the least square and median regression based regularized method. Compared with the existing methods, the proposed method has higher prediction ability and stronger robustness especially for the non-normal error distribution.

Key words: high-dimensional data, feature selection, modal regression, adaptive Least Absolute Shrinkage and Selection Operator (LASSO), Expectation-Maximum (EM) algorithm

中图分类号: