Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (8): 2355-2359.DOI: 10.11772/j.issn.1001-9081.2015.08.2355

Fast unsupervised feature selection algorithm based on rough set theory

BAI Hexiang1, WANG Jian1, LI Deyu1,2, CHEN Qian1   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China;
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education (Shanxi University), Taiyuan Shanxi 030006, China
  • Received:2015-03-01 Revised:2015-05-08 Online:2015-08-14 Published:2015-08-10


白鹤翔1, 王健1, 李德玉1,2, 陈千1   

  1. 1. 山西大学 计算机与信息技术学院, 太原 030006;
    2. 计算智能与中文信息处理教育部重点实验室(山西大学), 太原 030006
  • 通讯作者: 白鹤翔(1980-),男,山西晋中人,讲师,博士,CCF会员,主要研究方向:基于粗糙集理论的空间数据挖掘,
  • 作者简介:王健(1992-),男,山西临汾人,硕士研究生,主要研究方向:计算机图像模式识别; 李德玉(1965-),山西临汾人,教授,博士,主要研究方向:智能计算、模式识别; 陈千(1983-),男,湖北黄冈人,讲师,博士,主要研究方向:文本挖掘、主题检测。
Focusing on the issue that feature selection for the usually encountered large scale data sets in the "big data" is too slow to meet the practical requirements, a fast feature selection algorithm for unsupervised massive data sets was proposed based on the incremental absolute reduction algorithm in traditional rough set theory. Firstly, the large scale data set was regarded as a random object sequence and the candidate reduct was set empty. Secondly, random object was one by one drawn from the large scale data set without replacement; next, each random drawn object was checked if it could be distinguished with the other objects in the current object set and then merged with current object set, if the new object could not be distinguished using the candidate reduct, a new attribute that can distinguish the new object should be added into the candidate reduct. Finally, if successive I objects were distinguishable using the candidate reduct, the candidate reduct was used as the reduct of the large scale data set. Experiments on five unsupervised large-scale data sets demonstrated that a reduct which can distinguish no less than 95% object pairs could be found within 1% time needed by the discernibility matrix based algorithm and incremental absolute reduction algorithm. In the experiment of the text topic mining, the topic found by the reducted data set was consistent with that of the original data set. The experimental results show that the proposed algorithm can obtain effective reducts for large scale data set in practical time.

Key words: massive data, absolute reduct, incremental algorithm, rough set, feature selection



关键词: 海量数据, 绝对约简, 增量式算法, 粗糙集, 属性选择

