Journal of Computer Applications

    Next Articles

Two-stage data selection method for classifier with high performance and low energy consumption

  

  • Received:2024-06-28 Revised:2024-08-16 Online:2024-11-08 Published:2024-11-08

面向低能耗高性能的分类器两阶段数据选择方法

崔双双,王宏志,朱加昊,吴昊   

  1. 哈尔滨工业大学
  • 通讯作者: 王宏志
  • 基金资助:
    国家自然科学基金项目;国家自然科学基金项目;国家重点研发计划项目

Abstract: Aiming at the problems of large training data size, long training time and high carbon emission when constructing classification models using massive data, a two-stage sample selection method TSDS(Two-Stage Data Selection) was proposed for low energy comsumption and high classifier performance. First, the clustering center was determined by modifying the cosine similarity, and the sample data was split into hierarchical clusters based on dissimilar points. Then, the clustering results were adaptively sampled according to the data distribution to obtain a high-quality subset. Finally, the subset was used to train on the classification model, which accelerated the training process and improved the model accuracy at the same time. Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP) classification models are constructed on six data sets, including Spambase, Bupa and Phoneme, to verify the performance of TSDS. The experimental results show that the proposed two-stage sample selection method TSDS can improve the classification model accuracy by 3 to 10 percentage points even when the sample data compression ratio reaches 85%, and at the same time accelerates model training, reducing energy consumption of SVM classifiers by 93.76%, and that of MLP classifiers by 75.41%. TSDS can not only shorten training time and reduce energy consumption, but also improve the performance of classifiers in classification tasks in big data scenarios, helping to achieve the " carbon peaking and carbon neutrality" goal.

Key words: Classifier, Hierarchical clustering, Adaptive sampling, Data selection, Few-Shot learning

摘要: 针对利用海量数据构建分类模型时训练数据规模大、训练时间长且碳排放量大的问题,本文提出了面向低能耗高性能的分类器两阶段样本选择方法 TSDS(Two-Stage Data Selection)。首先通过修正余弦相似度确定聚类中心,将样本数据进行基于不相似点的分裂层次聚类,其次对聚类结果按照数据分布进行自适应抽样以组成高质量的子样本集,最后利用子样本集在分类模型上进行训练,在加速训练过程的同时提升模型精度。在 Spambase、Bupa 以及 Phoneme 等 6 个数据集上构建支持向量机(SVM)、多层感知机(MLP)分类模型对 TSDS 性能进行验证,实验结果表明:提出的两阶段样本选择方法 TSDS 能够实现在样本数据压缩比达到 85%的情况下,依然能够提升分类模型精度 3 至 10 个百分点,同时加速模型训练,使训练 SVM 分类器能耗降低 93.76%,训练 MLP 分类器能耗降低75.41%。TSDS 在大数据场景的分类任务中既能缩短训练时间、减少能耗,又 能提升分类器性能,助力实现“双碳”目标。

关键词: 分类器, 层次聚类, 自适应采样, 数据选择, 小样本学习

CLC Number: