计算机应用 ›› 2014, Vol. 34 ›› Issue (6): 1613-1617.DOI: 10.11772/j.issn.1001-9081.2014.06.1613

• 人工智能 • 上一篇    下一篇

基于随机森林的高维数据可视化

吕兵,王华珍   

  1. 华侨大学 计算机科学与技术学院,福建 厦门 361021
  • 收稿日期:2013-12-23 修回日期:2014-02-06 出版日期:2014-06-01 发布日期:2014-07-02
  • 通讯作者: 吕兵
  • 作者简介:吕兵(1990-),男,安徽桐城人,硕士研究生,主要研究方向:数据挖掘、机器学习、软件工程;王华珍(1975-),女,福建泉州人,讲师,博士,CCF会员,主要研究方向:机器学习、模式识别、数据挖掘。
  • 基金资助:

    福建省自然科学基金资助项目;华侨大学高层次人才科研启动基金资助项目

High-dimensional data visualization based on random forest

LYV Bing,WANG Huazhen   

  1. College of Computer Science and Technology, Huaqiao University, Xiamen Fujian 361021, China
  • Received:2013-12-23 Revised:2014-02-06 Online:2014-06-01 Published:2014-07-02
  • Contact: LYV Bing

摘要:

目前对高维数据进行挖掘的方法大多是基于数学理论而非可视化的直觉。为便于直观分析和评价高维数据,提出引入随机森林(RF)方法对高维数据进行数据可视化。首先,采用RF进行有监督学习得到样本间的相似度度量,并采用主坐标分析法对其进行降维,将高维数据的关系信息变换到低维空间;然后,在低维空间中采用散点图进行可视化。在高维基因数据集上实验结果表明,基于RF有监督降维的可视化能够较好地展现高维数据的类分布规律,且优于传统的无监督降维后的可视化效果。

Abstract:

High-dimensional data mining methods are mostly based on the mathematical theory rather than visual intuition currently. To facilitate visual analysis and evaluation of high-dimensional data, Random Forest (RF) was introduced to visualize high-dimensional data. Firstly, RF applied supervised learning to get the proximity measurement from the source data and the principal coordinate analysis was used for dimension reduction, which transformed the high-dimensional data relationship into the low-dimensional space. Then scattering plots were used to visualize the data in low-dimensional space. The results of experiment on high-dimensional gene datasets show that visualization with supervised dimension-reduction based on RF can illustrate perfectly discrimination of class distribution and outperforms traditional unsupervised dimension-reduction.

中图分类号: