计算机应用 ›› 2015, Vol. 35 ›› Issue (8): 2355-2359.DOI: 10.11772/j.issn.1001-9081.2015.08.2355

• 数据技术 • 上一篇    下一篇

基于粗糙集的非监督快速属性选择算法

白鹤翔1, 王健1, 李德玉1,2, 陈千1   

  1. 1. 山西大学 计算机与信息技术学院, 太原 030006;
    2. 计算智能与中文信息处理教育部重点实验室(山西大学), 太原 030006
  • 收稿日期:2015-03-01 修回日期:2015-05-08 出版日期:2015-08-10 发布日期:2015-08-14
  • 通讯作者: 白鹤翔(1980-),男,山西晋中人,讲师,博士,CCF会员,主要研究方向:基于粗糙集理论的空间数据挖掘,baihx@sxu.edu.cn
  • 作者简介:王健(1992-),男,山西临汾人,硕士研究生,主要研究方向:计算机图像模式识别; 李德玉(1965-),山西临汾人,教授,博士,主要研究方向:智能计算、模式识别; 陈千(1983-),男,湖北黄冈人,讲师,博士,主要研究方向:文本挖掘、主题检测。
  • 基金资助:

    国家自然科学基金资助项目(41101440,61272095,61403238);山西省青年科技基金资助项目(2014021022-1);中国博士后科学基金资助项目(2013M530891)。

Fast unsupervised feature selection algorithm based on rough set theory

BAI Hexiang1, WANG Jian1, LI Deyu1,2, CHEN Qian1   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China;
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education (Shanxi University), Taiyuan Shanxi 030006, China
  • Received:2015-03-01 Revised:2015-05-08 Online:2015-08-10 Published:2015-08-14

摘要:

针对"大数据"中常见的大规模无监督数据集中特征选择速度难以满足实际应用要求的问题,在经典粗糙集绝对约简增量式算法的基础上提出了一种快速的属性选择算法。首先,将大规模数据集看作一个随机到来的对象序列,并初始化候选约简为空集;然后每次都从大规模数据集中无放回地随机抽取一个对象,并且每次都判断使用当前候选约简能否区分这一对象和当前对象集中所有应当区分的对象,并将该对象放入到当前对象集中,如果不能区分则向候选约简中添加合适的属性;最后,如果连续I次都没有发现无法区分的对象,那么将候选约简作为大规模数据集的约简。在5个非监督大规模数据集上的实验表明,所求得的约简能够区分95%以上的对象对,并且求取该约简所需的时间不到基于区分矩阵的算法和增量式约简算法的1%;在文本主题挖掘的实验中,使用约简后的数据集挖掘出的文本主题同原始数据集挖掘出的主题基本一致。两组实验结果表明该方法能够有效快速对大规模数据集进行属性选择。

关键词: 海量数据, 绝对约简, 增量式算法, 粗糙集, 属性选择

Abstract:

Focusing on the issue that feature selection for the usually encountered large scale data sets in the "big data" is too slow to meet the practical requirements, a fast feature selection algorithm for unsupervised massive data sets was proposed based on the incremental absolute reduction algorithm in traditional rough set theory. Firstly, the large scale data set was regarded as a random object sequence and the candidate reduct was set empty. Secondly, random object was one by one drawn from the large scale data set without replacement; next, each random drawn object was checked if it could be distinguished with the other objects in the current object set and then merged with current object set, if the new object could not be distinguished using the candidate reduct, a new attribute that can distinguish the new object should be added into the candidate reduct. Finally, if successive I objects were distinguishable using the candidate reduct, the candidate reduct was used as the reduct of the large scale data set. Experiments on five unsupervised large-scale data sets demonstrated that a reduct which can distinguish no less than 95% object pairs could be found within 1% time needed by the discernibility matrix based algorithm and incremental absolute reduction algorithm. In the experiment of the text topic mining, the topic found by the reducted data set was consistent with that of the original data set. The experimental results show that the proposed algorithm can obtain effective reducts for large scale data set in practical time.

Key words: massive data, absolute reduct, incremental algorithm, rough set, feature selection

中图分类号: