《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (2): 389-394.DOI: 10.11772/j.issn.1001-9081.2021071188

• 人工智能 • 上一篇    

基于哈希学习的投票样例选择算法

黄雅婕1, 翟俊海1,2, 周翔1, 李艳1,2,3   

  1. 1.河北大学 数学与信息科学学院, 河北 保定 071002
    2.河北省机器学习与计算智能重点实验室(河北大学), 河北 保定 071002
    3.北京师范大学珠海校区 应用数学与交叉科学研究中心, 广东 珠海 519087
  • 收稿日期:2021-07-09 修回日期:2021-08-23 接受日期:2021-08-27 发布日期:2021-11-02 出版日期:2022-02-10
  • 作者简介:黄雅婕(1996—),女,河北唐山人,硕士研究生,主要研究方向:云计算、大数据处理;
    翟俊海(1964—),男,河北易县人,教授,博士,CCF会员,主要研究方向:机器学习、云计算、大数据处理、深度学习;
    周翔(1995—),男,河北保定人,硕士研究生,主要研究方向:云计算、大数据处理;
    李艳(1976—),女,河北衡水人,教授,博士,CCF会员,主要研究方向:机器学习、不确定性信息处理。
  • 基金资助:
    河北省科技计划项目重点研发专项(19210310D);河北省自然科学基金资助项目(F2018201096);河北大学研究生创新资助项目(hbu2019ss077)

Voting instance selection algorithm based on learning to hash

Yajie HUANG1, Junhai ZHAI1,2, Xiang ZHOU1, Yan LI1,2,3   

  1. 1.College of Mathematics and Information Science,Baoding Hebei 071002,China
    2.Key Laboratory of Machine Learning and Computational Intelligence (Hebei University),Baoding Hebei 071002,China
    3.Research Center for Applied Mathematics and Interdisciplinary Sciences,Beijing Normal University at Zhuhai,Zhuhai Guangzhou 519087,China
  • Received:2021-07-09 Revised:2021-08-23 Accepted:2021-08-27 Online:2021-11-02 Published:2022-02-10
  • Supported by:
    Key Research and Development Program of Science and Technology Project of Hebei Province(19210310D);Natural Science Foundation of Hebei Province(F2018201096);Graduate Innovation Foundation of Hebei University(hbu2019ss077)

摘要:

随着数据的海量型增长,如何存储并利用数据成为目前学术研究和工业应用等方面的热门问题。样例选择是解决此类问题的方法之一,它在原始数据中依据既定规则选出代表性的样例,从而有效地降低后续工作的难度。基于此,提出一种基于哈希学习的投票样例选择算法。首先通过主成分分析(PCA)方法将高维数据映射到低维空间;然后利用k-means算法结合矢量量化方法进行迭代运算,并将数据用聚类中心的哈希码表示;接着将分类后的数据按比例进行随机选择,在多次独立运行算法后投票选择出最终的样例。与压缩近邻(CNN)算法和大数据线性复杂度样例选择算法LSH-IS-F相比,所提算法在压缩比方面平均提升了19%。所提算法思想简单容易实现,能够通过调节参数自主控制压缩比。在7个数据集上的实验结果显示所提算法在测试精度相似的情况下在压缩比和运行时间方面较随机哈希有较大优势。

关键词: 样例选择, 哈希学习, 海明距离, 矢量量化, 投票方法

Abstract:

With the massive growth of data, how to store and use data has become a hot issue in academic research and industrial applications. As one of the methods to solve these problems, instance selection effectively reduces the difficulty of follow-up work by selecting representative instances from original data according to the established rules. Therefore, a voting instance selection algorithm based on learning to hash was proposed. Firstly, the Principal Component Analysis (PCA) method was used to map high-dimensional data to low-dimensional space. Secondly, the k-means algorithm was used to perform iterative operations by combining with the vector quantization method, and the hash codes of the cluster center were used to represent the data. After that, the classified data were randomly selected according to the proportion, and the final instances were selected by voting after several times independent running of the algorithm. Compared with the Compressed Nearest Neighbor (CNN) algorithm and the instance selection algorithm of linear complexity for big data named LSH-IS-F (Instance Selection algorithm by Hashing with two passes), the proposed algorithm has the compression ratio improved by an average of 19%. The idea of the proposed algorithm is simple and easy to implement, and the algorithm can control the compression ratio automatically by adjusting the parameters. Experimental results on 7 datasets show that the proposed algorithm has a great advantage compared to random hashing in terms of compression ratio and running time with similar test accuracy.

Key words: instance selection, learning to hash, Hamming distance, vector quantization, voting method

中图分类号: