计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2184-2187.DOI: 10.11772/j.issn.1001-9081.2014.08.2184

• 第五届中国数据挖掘会议(CCDM 2014)论文 • 上一篇    下一篇

基于集成学习的无监督离散化算法

徐盈盈1,2,钟才明1,2   

  1. 1. 宁波大学 科学技术学院,浙江 宁波315210
    2. 宁波大学 信息科学与工程学院,浙江 宁波315210
  • 收稿日期:2014-04-30 修回日期:2014-05-08 出版日期:2014-08-01 发布日期:2014-08-10
  • 通讯作者: 徐盈盈
  • 作者简介:徐盈盈(1990-),女,安徽桐城人,硕士研究生,主要研究方向:机器学习、模式识别;钟才明(1970-),男,浙江宁波人,副教授,博士,主要研究方向:模式识别、机器学习。
  • 基金资助:

    国家自然科学基金资助项目

Unsupervised discretization algorithm based on ensemble learning

XU Yingying1,2,ZHONG Caiming1,2   

  1. 1. College of Information Science and Engineering, Ningbo University, Ningbo Zhejiang 315210, China;
    2. College of Science and Technology, Ningbo University, Ningbo Zhejiang 315210, China
  • Received:2014-04-30 Revised:2014-05-08 Online:2014-08-01 Published:2014-08-10
  • Contact: XU Yingying

摘要:

模式识别与机器学习的一些算法只能处理离散属性值,而在现实生活中的很多数据具有连续的属性值,针对数据离散化的问题提出了一种无监督的方法。首先,使用K-means方法将数据集进行划分得到类别信息;然后,应用有监督的离散化方法对划分后的数据离散化,重复上述过程以得到多个离散化的结果,再将这些结果进行集成;最后,将集成得到的最小子区间进行合并,这里根据数据间的邻居关系选择优先合并的维度及相邻区间。其中,通过数据间的近邻关系自动寻求子区间数目,尽可能保持其内在结构关系不变。将离散后的数据应用于聚类算法,如谱聚类算法,并对聚类后的效果进行评价。实验结果表明,该算法聚类精确度比其他4种方法平均提高约33%,表明了该算法的可行性和有效性。通过该算法得到的离散化数据可应用于一些数据挖掘算法,如ID3决策树算法。

Abstract:

Some algorithms in pattern recognition and machine learning can only deal with discrete attribute values, while in real world many data sets consist of continuous data values. An unsupervised method was proposed according to the question of discretization. First, K-means method was employed to partition the data set into multiple subgroups to acquire label information, and then a supervised discretization algorithm was applied to the divided data set. When the process was repeatedly executed, multiple discrete results were obtained. These results were then integrated with an ensemble technique. Finally, the minimum sub-intervals were merged after priority dimensions and adjacent intervals were determined according to the neighbor relationship of data, where the number of sub-intervals was automatically estimated by preserving the correlation so that the intrinsic structure of the data set was maintained. The experimental results of applying categorical clustering algorithms such as spectral clustering demonstrate the feasibility and effectiveness of the proposed method. For example, its clustering accuracy improves by about 33% on average than other four methods. Discrete data attained can be used for some data mining algorithm, such as ID3 decision tree algorithm.

中图分类号: