计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 871-875.DOI: 10.11772/j.issn.1001-9081.2017.03.871

• 数据科学与技术 • 上一篇    下一篇

基于偏最小二乘回归的鲁棒性特征选择与分类算法

尚志刚, 董永慧, 李蒙蒙, 李志辉   

  1. 郑州大学 电气工程学院, 郑州 450001
  • 收稿日期:2016-08-05 修回日期:2016-10-18 出版日期:2017-03-10 发布日期:2017-03-22
  • 通讯作者: 李志辉
  • 作者简介:尚志刚(1975-),男,甘肃兰州人,副教授,博士,主要研究方向:数据挖掘、信号处理;董永慧(1993-),女,安徽宿州人,硕士研究生,主要研究方向:信号处理、模式识别;李蒙蒙(1990-),男,河南商丘人,硕士研究生,主要研究方向:图像处理、特征选择;李志辉(1978-),女,河南濮阳人,讲师,博士,主要研究方向:信号处理、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(U1304602,61473266,61305080);河南省高等学校重点科研项目(15A120016)。

Robust feature selection and classification algorithm based on partial least squares regression

SHANG Zhigang, DONG Yonghui, LI Mengmeng, LI Zhihui   

  1. College of Electrical Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
  • Received:2016-08-05 Revised:2016-10-18 Online:2017-03-10 Published:2017-03-22
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (U1304602, 61473266, 61305080), the Key Scientific Research Program of Henan University (15A120016).

摘要: 提出一种基于偏最小二乘回归的鲁棒性特征选择与分类算法(RFSC-PLSR)用于解决特征选择中特征之间的冗余和多重共线性问题。首先,定义一个基于邻域估计的样本类一致性系数;然后,根据不同k近邻(kNN)操作筛选出局部类分布结构稳定的保守样本,用其建立偏最小二乘回归模型,进行鲁棒性特征选择;最后,在全局结构角度上,用类一致性系数和所有样本的优选特征子集建立偏最小二乘分类模型。从UCI数据库中选择了5个不同维度的数据集进行数值实验,实验结果表明,与支持向量机(SVM)、朴素贝叶斯(NB)、BP神经网络(BPNN)和Logistic回归(LR)四种典型的分类器相比,RFSC-PLSR在低维、中维、高维等不同情况下,分类准确率、鲁棒性和计算效率三种性能上均表现出较强的竞争力。

关键词: 偏最小二乘回归, k近邻, 噪声样本, 特征选择, 鲁棒性

Abstract: A Robust Feature Selection and Classification algorithm based on Partial Least Squares Regression (RFSC-PLSR) was proposed to solve the problem of redundancy and multi-collinearity between features in feature selection. Firstly, the consistency coefficient of sample class based on neighborhood estimation was defined. Then, the k Nearest Neighbor (kNN) operation was used to select the conservative samples with local class structure stability, and the partial least squares regression model was used to construct the robust feature selection. Finally, a partial least squares classification model was constructed using the class consistency coefficient and the preferred feature subset for all samples from a global structure perspective. Five data sets of different dimensions were selected from the UCI database for numerical experiments. The experimental results show that compared with four typical classifiers-Support Vector Machine (SVM), Naive Bayes (NB), Back-Propagation Neural Network (BPNN) and Logistic Regression (LR), RFSC-PLSR is more efficient in low-dimensional, medium-dimension, high-dimensional and other different cases, and shows stronger competitiveness in classification accuracy, robustness and computational efficiency.

Key words: Partial Least Squares Regression (PLSR), k Nearest Neighbor (kNN), noise sample, feature selection, robust

中图分类号: