计算机应用 ›› 2011, Vol. 31 ›› Issue (05): 1409-1412.DOI: 10.3724/SP.J.1087.2011.01409

• 数据库技术 • 上一篇    下一篇

基于统计相关系数的数据离散化方法

解亚萍   

  1. 兰州资源环境职业技术学院 计算机中心,兰州 730021
  • 收稿日期:2010-09-14 修回日期:2010-11-08 发布日期:2011-05-01 出版日期:2011-05-01
  • 通讯作者: 解亚萍
  • 作者简介:解亚萍(1965-),女,陕西周至人,副教授,主要研究方向:数据挖掘、模式识别、计算机网络。
  • 基金资助:

    解亚萍(1965-),女,陕西周至人,副教授,主要研究方向:数据挖掘、模式识别、计算机网络。

Data discretization method based on statistical correlation coefficient

XIE Ya-ping   

  1. Computer Center, Lanzhou Vocational and Technology College of Resources and Environment, Lanzhou Gansu 730021, China
  • Received:2010-09-14 Revised:2010-11-08 Online:2011-05-01 Published:2011-05-01

摘要: 很多数据挖掘方法只能处理离散值的属性,因此,连续属性必须进行离散化。提出一种统计相关系数的数据离散化方法,基于统计相关理论有效地捕获了类-属性间的相互依赖,选取最佳断点。此外,将变精度粗糙集(VPRS)模型纳入离散化中,有效地控制数据的信息丢失。将所提方法在乳腺癌症诊断以及其他领域数据上进行了应用,实验结果表明,该方法显著地提高了See5决策树的分类学习精度。

关键词: 离散化, 数据挖掘, 类-属性相互依赖, 变精度粗糙集, 决策树

Abstract: Most data mining and induction learning methods can only deal with discrete attributes; therefore, discretization of continuous attributes is necessary. The author proposed a data discretization method based on statistical correlation coefficient. The method captured the interdependence between attributes and target class with the aim to select optimal cut points based on statistical correlation theory. In addition, the author incorporated Variable Precision Rough Set (VPRS) model to effectively control information loss. The proposed method was applied to breast tumor diagnosis and data of other fields. The experimental results show that this method significantly enhances the accuracy of classification of See5.

Key words: discretization, data mining, Class-Attribute Interdependence (CAI), Variable Precision Rough Set (VPRS), decision tree