计算机应用 ›› 2021, Vol. 41 ›› Issue (1): 74-80.DOI: 10.11772/j.issn.1001-9081.2020060982

所属专题: 第八届中国数据挖掘会议(CCDM 2020)

• 第八届中国数据挖掘会议(CCDM 2020) • 上一篇    下一篇

基于随机森林和投票机制的大数据样例选择算法

周翔1,2, 翟俊海1,2, 黄雅婕1,2, 申瑞彩1,2, 侯璎真1,2   

  1. 1. 河北大学 数学与信息科学学院, 河北 保定 071002;
    2. 河北省机器学习与计算智能重点实验室(河北大学), 河北 保定 071002
  • 收稿日期:2020-05-31 修回日期:2020-08-04 出版日期:2021-01-10 发布日期:2020-09-02
  • 通讯作者: 翟俊海
  • 作者简介:周翔(1995-),男,河北保定人,硕士研究生,主要研究方向:云计算与大数据处理;翟俊海(1964-),男,河北易县人,教授,博士,CCF会员,主要研究方向:机器学习、云计算与大数据处理、深度学习;黄雅婕(1996-),女,河北唐山人,硕士研究生,主要研究方向:云计算与大数据处理;申瑞彩(1993-),女,河北邯郸人,硕士研究生,主要研究方向:深度学习;侯璎真(1995-),女,河北唐山人,硕士研究生,主要研究方向:深度学习。
  • 基金资助:
    河北省重点研发计划项目(19210310D);河北大学研究生创新资助项目(hbu2020ss045)。

Instance selection algorithm for big data based on random forest and voting mechanism

ZHOU Xiang1,2, ZHAI Junhai1,2, HUANG Yajie1,2, SHEN Ruicai1,2, HOU Yingzhen1,2   

  1. 1. College of Mathematics and Information Science, Hebei University, Baoding Hebei 071002, China;
    2. Hebei Key Laboratory of Machine Learning and Computational Intelligence(Hebei University), Baoding Hebei 071002, China
  • Received:2020-05-31 Revised:2020-08-04 Online:2021-01-10 Published:2020-09-02
  • Supported by:
    This work is partially supported by the Key Research and Development Program of Hebei Province (19210310D), the Hebei University Graduate Innovation Funding Project (hbu2020ss045).

摘要: 针对大数据样例选择问题,提出了一种基于随机森林(RF)和投票机制的大数据样例选择算法。首先,将大数据集划分成两个子集,要求第一个子集是大型的,第二个子集是中小型的。然后,将第一个大型子集划分成q个规模较小的子集,并将这些子集部署到q个云计算节点,并将第二个中小型子集广播到q个云计算节点。接下来,在各个节点用本地数据子集训练随机森林,并用随机森林从第二个中小型子集中选择样例,之后合并在各个节点选择的样例以得到这一次所选样例的子集。重复上述过程p次,得到p个样例子集。最后,用这p个子集进行投票,得到最终选择的样例子集。在Hadoop和Spark两种大数据平台上实现了提出的算法,比较了两种大数据平台的实现机制。此外,在6个大数据集上将所提算法与压缩最近邻(CNN)算法和约简最近邻(RNN)算法进行了比较,实验结果显示数据集的规模越大时,与这两个算法相比,提出的算法测试精度更高且时间消耗更短。证明了提出的算法在大数据处理上具有良好的泛化能力和较高的运行效率,可以有效地解决大数据的样例选择问题。

关键词: 大数据, 样例选择, 决策树, 随机森林, 投票机制

Abstract: To deal with the problem of instance selection for big data, an instance selection algorithm based on Random Forest (RF) and voting mechanism was proposed for big data. Firstly, a dataset of big data was divided into two subsets:the first subset is large and the second subset is small or medium. Then, the first large subset was divided into q smaller subsets, and these subsets were deployed to q cloud computing nodes, and the second small or medium subset was broadcast to q cloud computing nodes. Next, the local data subsets at different nodes were used to train the random forest, and the random forest was used to select instances from the second small or medium subset. The selected instances at different nodes were merged to obtain the subset of selected instances of this time. The above process was repeated p times, and p subsets of selected instances were obtained. Finally, these p subsets were used for voting to obtain the final selected instance set. The proposed algorithm was implemented on two big data platforms Hadoop and Spark, and the implementation mechanisms of these two big data platforms were compared. In addition, the comparison between the proposed algorithm with the Condensed Nearest Neighbor (CNN) algorithm and the Reduced Nearest Neighbor (RNN) algorithm was performed on 6 large datasets. Experimental results show that compared with these two algorithms, the proposed algorithm has higher test accuracy and smaller time consumption when the dataset is larger. It is proved that the proposed algorithm has good generalization ability and high operational efficiency in big data processing, and can effectively solve the problem of big data instance selection.

Key words: big data, instance selection, decision tree, Random Forest (RF), voting mechanism

中图分类号: