数据流分类中的增量特征选择算法

计算机应用 ›› 2010, Vol. 30 ›› Issue (9): 2321-2323.

数据流分类中的增量特征选择算法

李敏¹,王勇²,蔡立军³

1. 西北工业大学理学院
2. 西北工业大学计算机学院
3. 西北工业大学理学院

收稿日期:2010-03-12 修回日期:2010-05-04 发布日期:2010-09-03 出版日期:2010-09-01
通讯作者: 李敏
基金资助:
国家自然科学基金资助

Incremental feature selection algorithm for data stream classification

Received:2010-03-12 Revised:2010-05-04 Online:2010-09-03 Published:2010-09-01

摘要/Abstract

摘要： 概念流动的出现及数据的高维性增加了数据流特征选择的复杂性。信息增益是最有效的特征选择算法之一，但计算量大。对信息增益做了等价替换，提出一种基于改进信息增益的混合增量特征选择(IFS)算法。该算法首先利用与分类器无关的评价函数选出候选特征集合，然后将分类器作用于候选特征集合，利用分类精度作为评价标准去选择特征子集，在遇到概念漂移时重新选择特征子集。通过在超平面数据集和UCI数据集上的实验,表明基于IFS算法的分类器能够很快地适应概念漂移，并且比基于全部特征的分类算法有更高的精度。

关键词: 数据流分类, 信息增益, 增量特征选择, 概念漂移

Abstract: The complexity of feature selection for real-world data stream will increase because of high-dimensional data and concept drifting. Information gain is one of the most effective feature selections, but its computation is too huge. In order to deal with the problem, the authors proposed an incremental feature selection algorithm based on improved information gain, named IFS. Firstly, the algorithm selected candidate feature set by using independent evaluation function; secondly, feature set was selected with classifer role in candidate feature set. Finally, it selected feature set again while encountering concept drifting. The experiment was operated on moving hyperplane data set and UCI data set. The experimental results show that the proposed approach can adapt to the concept drifting with higher speed and works much better than non-feature selection algorithms.

Key words: data stream classification, information gain, Incremental Feature Selection (IFS), concept drifting

中图分类号:

李敏王勇蔡立军. 数据流分类中的增量特征选择算法[J]. 计算机应用, 2010, 30(9): 2321-2323.

[1]	尹春勇, 张帼杰. 面向分布式漂移数据流的集成分类模型[J]. 计算机应用, 2021, 41(7): 1947-1955.
[2]	白东颖, 易亚星, 王庆超, 余志勇. 面向概念漂移问题的渐进多核学习方法[J]. 计算机应用, 2019, 39(9): 2494-2498.
[3]	张译天, 于炯, 鲁亮, 李梓杨. 大数据流式计算框架Heron环境下的流分类任务调度策略[J]. 计算机应用, 2019, 39(4): 1106-1116.
[4]	王伟, 谢耀滨, 尹青. 针对不平衡数据的决策树改进方法[J]. 计算机应用, 2019, 39(3): 623-628.
[5]	袁泉, 郭江帆. 新型含噪数据流集成分类的算法[J]. 计算机应用, 2018, 38(6): 1591-1595.
[6]	王嘉卿, 朱焱, 陈同孝, 张真诚. 欺诈网页检测中基于遗传算法的特征优选[J]. 计算机应用, 2018, 38(1): 295-299.
[7]	吴峰, 王颖. 基于改进信息增益的人体动作识别视觉词典建立[J]. 计算机应用, 2017, 37(8): 2240-2243.
[8]	王欢, 张丽萍, 闫盛, 刘东升. 克隆代码有害性预测中的特征选择模型[J]. 计算机应用, 2017, 37(4): 1135-1142.
[9]	刘茂张东波赵圆圆. 基于交叠数据窗距离测度概念漂移检测新方法[J]. 计算机应用, 2014, 34(2): 542-545.
[10]	李南郭躬德陈黎飞. 基于少量类标签的概念漂移检测算法[J]. 计算机应用, 2012, 32(08): 2176-2185.
[11]	李南郭躬德. 面向高速数据流的集成分类器算法[J]. 计算机应用, 2012, 32(03): 629-633.
[12]	贾娴刘培玉公伟. 应用于入侵取证的改进信息增益算法[J]. 计算机应用, 2011, 31(08): 2156-2158.
[13]	郭宁孙晓妍林和牟华. 基于属性序约简的恶意代码检测[J]. 计算机应用, 2011, 31(04): 1006-1009.
[14]	杨健汪海航. 基于隐马尔可夫模型的文本分类算法[J]. 计算机应用, 2010, 30(9): 2348-2350.
[15]	张文良黄亚楼倪维健 . 一种基于聚类的文本特征选择方法[J]. 计算机应用, 2007, 27(1): 205-206.