计算机应用 ›› 2018, Vol. 38 ›› Issue (9): 2507-2510.DOI: 10.11772/j.issn.1001-9081.2018020460

• 数据科学与技术 • 上一篇    下一篇

基于析因设计的大数据相关关系挖掘算法

唐小川, 罗亮   

  1. 电子科技大学 计算机科学与工程学院, 成都 611731
  • 收稿日期:2018-03-07 修回日期:2018-03-27 出版日期:2018-09-10 发布日期:2018-09-06
  • 通讯作者: 唐小川
  • 作者简介:唐小川(1986—),男,四川成都人,博士研究生,CCF会员,主要研究方向:特征选择、机器学习、大数据分析;罗亮(1980—),男,陕西汉中人,讲师,博士,主要研究方向:云计算可靠性建模、大数据处理。
  • 基金资助:
    国家自然科学基金资助项目(61602094)。

Big data correlation mining algorithm based on factorial design

TANG Xiaochuan, LUO Liang   

  1. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 611731, China
  • Received:2018-03-07 Revised:2018-03-27 Online:2018-09-10 Published:2018-09-06
  • Contact: 唐小川
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61602094).

摘要: 针对高维大数据的降维问题,提出了一种基于统计学析因设计的特征选择算法——FFD。首先,使用析因设计的因子效应作为过滤式特征选择算法中特征与目标变量之间相关关系的度量标准;其次,提出一个分治算法用于搜索适合于输入数据集的最优析因设计;再次,为了解决传统实验设计需要人工执行实验的问题,提出一种数据驱动的方法从输入数据集中自动搜索析因设计的响应值;最后,根据设计矩阵和平均响应值计算因子效应,并使用因子效应对特征和交互作用进行排序,得到显著的特征和交互作用。实验结果表明,FFD的平均分类错误率比互信息最大化算法(MIM)降低了2.95个百分点,比联合互信息最大化算法(JMIM)降低了3.33个百分点,比ReliefF算法降低了6.62个百分点。因此,FFD在实际数据集中能有效挖掘与目标变量相关的特征和交互作用。

关键词: 大数据, 相关关系, 特征选择, 交互作用, 析因设计

Abstract: Focused on the issue of dimensionality reduction in high-dimensional big data, a feature selection algorithm based on statistical factorial design was proposed, which was named Full Factorial Design (FFD). Firstly, the factor effect of the factorial design was used to measure the correlation between features and the target variable; secondly, a divide-and-conquer algorithm for finding the optimal factorial design for a given dataset was proposed; thirdly, in order to solve the problem that the traditional experimental design required manual execution of experiments, a data-driven approach was proposed to automatically search the response values for the factorial design from the input dataset; finally, the factor effects were calculated based on the design matrix and the average response values, and the features and interactions were sorted by the factor effects. Then the significant features and interactions could be obtained. The experimental results show that the average classification error rate of FFD over Mutual Information Maximisation (MIM), Joint Mutual Information Maximisation (JMIM) and ReliefF was 2.95, 3.33 and 6.62 percentage points, respectively. Therefore, FFD can effectively identify significant features and interactions that are highly correlated with the target variable in real-world datasets.

Key words: big data, correlation, feature selection, interaction, factorial design

中图分类号: