《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (6): 1703-1711.DOI: 10.11772/j.issn.1001-9081.2024060883

• 第十二届CCF大数据学术会议 •    

面向低能耗高性能的分类器两阶段数据选择方法

崔双双, 王宏志(), 朱加昊, 吴昊   

  1. 哈尔滨工业大学 计算学部,哈尔滨 150001
  • 收稿日期:2024-06-28 修回日期:2024-08-16 接受日期:2024-08-20 发布日期:2024-11-08 出版日期:2025-06-10
  • 通讯作者: 王宏志
  • 作者简介:崔双双(1997—),女,黑龙江哈尔滨人,博士研究生,CCF会员,主要研究方向:数据库、查询优化
    王宏志(1978—),男,黑龙江哈尔滨人,教授,博士生导师,博士,CCF杰出会员,主要研究方向:数据库、大数据、数据质量 wangzh@hit.edu.cn
    朱加昊(2001—),男,上海人,主要研究方向:数据库、查询优化
    吴昊(2001—),男,安徽阜阳人, CCF会员,主要研究方向:数据库、查询优化。
  • 基金资助:
    国家自然科学基金资助项目(62232005);国家重点研发计划项目(2021YFB3300502)

Two-stage data selection method for classifier with low energy consumption and high performance

Shuangshuang CUI, Hongzhi WANG(), Jiahao ZHU, Hao WU   

  1. Faculty of Computing,Harbin Institute of Technology,Harbin Heilongjiang 150001,China
  • Received:2024-06-28 Revised:2024-08-16 Accepted:2024-08-20 Online:2024-11-08 Published:2025-06-10
  • Contact: Hongzhi WANG
  • About author:CUI Shuangshuang, born in 1997, Ph. D. candidate. Her research interests include database, query optimization.
    WANG Hongzhi, born in 1978, Ph. D., professor. His research interests include database, big data, data quality.
    ZHU Jiahao, born in 2001. His research interests include database, query optimization.
    WU Hao, born in 2001. His research interests include database, query optimization.
  • Supported by:
    National Natural Science Foundation of China(62232005);National Key Research and Development Program of China(2021YFB3300502)

摘要:

针对利用海量数据构建分类模型时训练数据规模大、训练时间长且碳排放量大的问题,提出面向低能耗高性能的分类器两阶段数据选择方法TSDS (Two-Stage Data Selection)。首先,通过修正余弦相似度确定聚类中心,并将样本数据进行基于不相似点的分裂层次聚类;其次,对聚类结果按数据分布自适应抽样以组成高质量的子样本集;最后,利用子样本集在分类模型上训练,在加速训练过程的同时提升模型精度。在Spambase、Bupa和Phoneme等6个数据集上构建支持向量机(SVM)和多层感知机(MLP)分类模型,验证TSDS的性能。实验结果表明在样本数据压缩比达到85.00%的情况下,TSDS能将分类模型准确率提升3~10个百分点,同时加速模型训练,使训练SVM分类器的能耗平均降低93.76%,训练MLP分类器的能耗平均降低75.41%。可见,TSDS在大数据场景的分类任务上既能缩短训练时间和减少能耗,又能提升分类器性能,从而助力实现“双碳”目标。

关键词: 分类器, 层次聚类, 自适应采样, 数据选择, 小样本学习

Abstract:

Aiming at the problems of large training data size, long training time and high carbon emission when constructing classification models using massive data, a two-stage data selection method TSDS (Two-Stage Data Selection) was proposed for low energy consumption and high classifier performance. Firstly, the clustering center was determined by modifying the cosine similarity, and the sample data was split and hierarchically clustered on the basis of dissimilar points. Then, the clustering results were sampled adaptively according to the data distribution, so as to obtain a high-quality subset. Finally, the subset was used to train on the classification model, which accelerated the training process and improved the model accuracy at the same time. Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP) classification models were constructed on six datasets, including Spambase, Bupa and Phoneme, to verify the performance of TSDS. Experimental results show that when the sample data compression ratio reaches 85.00%, TSDS can improve the classification model accuracy by 3 to 10 percentage points, and accelerates model training at the same time, with reducing the energy consumption of SVM classifiers by average 93.76%, and reducing that of MLP classifiers by average 75.41%. It can be seen that TSDS can shorten the training time and reduce the energy consumption, as well as improve the performance of classifiers in classification tasks in big data scenarios, thereby helping to achieve the “carbon peaking and carbon neutrality” goal.

Key words: classifier, hierarchical clustering, adaptive sampling, data selection, few-shot learning

中图分类号: