基于析因设计的大数据相关关系挖掘算法

doi:10.11772/j.issn.1001-9081.2018020460

计算机应用 ›› 2018, Vol. 38 ›› Issue (9): 2507-2510.DOI: 10.11772/j.issn.1001-9081.2018020460

基于析因设计的大数据相关关系挖掘算法

唐小川, 罗亮

电子科技大学计算机科学与工程学院, 成都 611731

收稿日期:2018-03-07 修回日期:2018-03-27 出版日期:2018-09-10 发布日期:2018-09-06
通讯作者: 唐小川
作者简介:唐小川(1986—),男,四川成都人,博士研究生,CCF会员,主要研究方向:特征选择、机器学习、大数据分析;罗亮(1980—),男,陕西汉中人,讲师,博士,主要研究方向:云计算可靠性建模、大数据处理。
基金资助:
国家自然科学基金资助项目（61602094）。

Big data correlation mining algorithm based on factorial design

TANG Xiaochuan, LUO Liang

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 611731, China

Received:2018-03-07 Revised:2018-03-27 Online:2018-09-10 Published:2018-09-06
Contact: 唐小川
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61602094).

摘要/Abstract

摘要： 针对高维大数据的降维问题，提出了一种基于统计学析因设计的特征选择算法——FFD。首先，使用析因设计的因子效应作为过滤式特征选择算法中特征与目标变量之间相关关系的度量标准；其次，提出一个分治算法用于搜索适合于输入数据集的最优析因设计；再次，为了解决传统实验设计需要人工执行实验的问题，提出一种数据驱动的方法从输入数据集中自动搜索析因设计的响应值；最后，根据设计矩阵和平均响应值计算因子效应，并使用因子效应对特征和交互作用进行排序，得到显著的特征和交互作用。实验结果表明，FFD的平均分类错误率比互信息最大化算法（MIM）降低了2.95个百分点，比联合互信息最大化算法（JMIM）降低了3.33个百分点，比ReliefF算法降低了6.62个百分点。因此，FFD在实际数据集中能有效挖掘与目标变量相关的特征和交互作用。

关键词: 大数据, 相关关系, 特征选择, 交互作用, 析因设计

Abstract: Focused on the issue of dimensionality reduction in high-dimensional big data, a feature selection algorithm based on statistical factorial design was proposed, which was named Full Factorial Design (FFD). Firstly, the factor effect of the factorial design was used to measure the correlation between features and the target variable; secondly, a divide-and-conquer algorithm for finding the optimal factorial design for a given dataset was proposed; thirdly, in order to solve the problem that the traditional experimental design required manual execution of experiments, a data-driven approach was proposed to automatically search the response values for the factorial design from the input dataset; finally, the factor effects were calculated based on the design matrix and the average response values, and the features and interactions were sorted by the factor effects. Then the significant features and interactions could be obtained. The experimental results show that the average classification error rate of FFD over Mutual Information Maximisation (MIM), Joint Mutual Information Maximisation (JMIM) and ReliefF was 2.95, 3.33 and 6.62 percentage points, respectively. Therefore, FFD can effectively identify significant features and interactions that are highly correlated with the target variable in real-world datasets.

Key words: big data, correlation, feature selection, interaction, factorial design

中图分类号:

TP181

唐小川, 罗亮. 基于析因设计的大数据相关关系挖掘算法[J]. 计算机应用, 2018, 38(9): 2507-2510.

TANG Xiaochuan, LUO Liang. Big data correlation mining algorithm based on factorial design[J]. Journal of Computer Applications, 2018, 38(9): 2507-2510.

参考文献

[1] TAN M, TSANG I W, WANG L. Towards ultrahigh dimensional feature selection for big data[J]. Journal of Machine Learning Research, 2014, 15(4):1371-1429.
[2] FAN J, HAN F, LIU H. Challenges of big data analysis[J]. National Science Review, 2014, 1(2):293-314.
[3] 李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J].中国科学院院刊,2012,27(6):647-657.(LI G J, CHENG X Q. Research status and scientific thinking of big data[J]. Bulletin of the Chinese Academy of Sciences, 2012, 27(6):647-657.)
[4] 梁吉业,冯晨娇,宋鹏.大数据相关分析综述[J].计算机学报,2016,39(1):1-18.(LIANG J Y, FENG C J, SONG P. A survey on correlation analysis of big data[J]. Chinese Journal of Computers, 2016, 39(1):1-18.)
[5] MAYER-SCHNBERGER V, CUKIER K. Big Data:A Revolution That Will Transform How We Live, Work, and Think[M]. New York:Houghton Mifflin Harcourt, 2013:50-72.
[6] BROWN G, POCOCK A, ZHAO M J, et al. Conditional likelihood maximisation:a unifying framework for information theoretic feature selection[J]. Journal of Machine Learning Research, 2012, 13(1):27-66.
[7] SAEYS Y, INZA I, LARRAÑAGA P. WLD:review of feature selection techniques in bioinformatics[J]. Bioinformatics, 2007, 23(19):2507-2517.
[8] YANG Y, PEDERSEN J O. A comparative study on feature selection in text categorization[C]//ICML'97:Proceedings of the 14th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann, 1997:412-420.
[9] LEWIS D D. Feature selection and feature extraction for text categorization[C]//HLT'91:Proceedings of the Workshop on Speech and Natural Language. Stroudsburg, PA:Association for Computational Linguistics, 1992:212-217.
[10] VINH N X, ZHOU S, CHAN J, et al. Can high-order dependencies improve mutual information based feature selection?[J]. Pattern Recognition, 2016, 53(C):46-58.
[11] TANG X, DAI Y, SUN P, et al. Interaction-based feature selection using factorial design[J]. Neurocomputing, 2018, 281:47-54.
[12] MONTGOMERY D C. Design and Analysis of Experiments[M]. 9th ed. Hoboken:John Wiley and Sons, 2017:179-220.
[13] ZHAO Z, LIU H. Searching for interacting features[EB/OL].[2018-01-04]. http://www.ijcai.org/Proceedings/07/Papers/187.pdf.
[14] BENNASAR M, HICKS Y, SETCHI R. Feature selection using joint mutual information maximisation[J]. Expert Systems with Applications, 2015, 42(22):8520-8532.
[15] ROBNIK-ŠIKONJA M, KONONENKO I. Theoretical and empirical analysis of ReliefF and RReliefF[J]. Machine Learning, 2003, 53(1/2):23-69.
[16] SONG Q, NI J, WANG G. A fast clustering-based feature subset selection algorithm for high-dimensional data[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 25(1):1-14.

基于析因设计的大数据相关关系挖掘算法

Big data correlation mining algorithm based on factorial design

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	湛航, 何朗, 黄樟灿, 李华峰, 张蔷, 谈庆. 改进的基于层次距离的基因表达式编程特征选择分类算法[J]. 计算机应用, 2021, 41(9): 2658-2667.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	李蒙蒙, 秦伟, 刘艺, 刁兴春. 结合头脑风暴优化的混合蚁群优化算法[J]. 计算机应用, 2021, 41(8): 2412-2417.
[4]	林筠超, 万源. 基于图结构优化的自适应多度量非监督特征选择方法[J]. 计算机应用, 2021, 41(5): 1282-1289.
[5]	贾鹤鸣, 姜子超, 李瑶, 孙康健. 基于改进斑点鬣狗优化算法的同步优化特征选择[J]. 计算机应用, 2021, 41(5): 1290-1298.
[6]	张志浩, 林耀进, 卢舜, 郭晨, 王晨曦. 缺失标记下基于类属属性的多标记特征选择[J]. 计算机应用, 2021, 41(10): 2849-2857.
[7]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[8]	顾桐, 许国良, 李万林, 李家浩, 王志愿, 雒江涛. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020, 40(9): 2762-2767.
[9]	黄学雨, 徐浩特, 陶剑文. 具有特征选择的多源自适应分类框架[J]. 计算机应用, 2020, 40(9): 2499-2506.
[10]	肖跃雷, 张云娇. 基于特征选择和超参数优化的恐怖袭击组织预测方法[J]. 计算机应用, 2020, 40(8): 2262-2267.
[11]	刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡流量数据的分类模型[J]. 计算机应用, 2020, 40(8): 2327-2333.
[12]	汪志远, 降爱莲, 奥斯曼·穆罕默德. 基于正则互表示的无监督特征选择方法[J]. 计算机应用, 2020, 40(7): 1896-1900.
[13]	曹策俊, 刘桔. 灾害运作管理中应急组织决策建模方法综述[J]. 计算机应用, 2020, 40(7): 2142-2149.
[14]	朱小杰, 赵子豪, 杜一. 模型驱动的大数据流水线框架PiFlow[J]. 计算机应用, 2020, 40(6): 1638-1647.
[15]	曹堉, 王成, 王鑫, 高悦尔. 基于时空节点选择和深度学习的城市道路短时交通流预测[J]. 计算机应用, 2020, 40(5): 1488-1493.