基于哈希学习的投票样例选择算法

doi:10.11772/j.issn.1001-9081.2021071188

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (2): 389-394.DOI: 10.11772/j.issn.1001-9081.2021071188

• 人工智能 • 上一篇

基于哈希学习的投票样例选择算法

黄雅婕¹, 翟俊海¹^,², 周翔¹, 李艳¹^,²^,³

^1.河北大学数学与信息科学学院, 河北保定 071002
^2.河北省机器学习与计算智能重点实验室(河北大学), 河北保定 071002
^3.北京师范大学珠海校区应用数学与交叉科学研究中心, 广东珠海 519087

收稿日期:2021-07-09 修回日期:2021-08-23 接受日期:2021-08-27 发布日期:2021-11-02 出版日期:2022-02-10
作者简介:黄雅婕（1996—），女，河北唐山人，硕士研究生，主要研究方向：云计算、大数据处理；
翟俊海（1964—），男，河北易县人，教授，博士，CCF会员，主要研究方向：机器学习、云计算、大数据处理、深度学习；
周翔（1995—），男，河北保定人，硕士研究生，主要研究方向：云计算、大数据处理；
李艳（1976—），女，河北衡水人，教授，博士，CCF会员，主要研究方向：机器学习、不确定性信息处理。
基金资助:
河北省科技计划项目重点研发专项(19210310D);河北省自然科学基金资助项目(F2018201096);河北大学研究生创新资助项目(hbu2019ss077)

Voting instance selection algorithm based on learning to hash

Yajie HUANG¹, Junhai ZHAI¹^,², Xiang ZHOU¹, Yan LI¹^,²^,³

^1.College of Mathematics and Information Science，Baoding Hebei 071002，China
^2.Key Laboratory of Machine Learning and Computational Intelligence （Hebei University），Baoding Hebei 071002，China
^3.Research Center for Applied Mathematics and Interdisciplinary Sciences，Beijing Normal University at Zhuhai，Zhuhai Guangzhou 519087，China

Received:2021-07-09 Revised:2021-08-23 Accepted:2021-08-27 Online:2021-11-02 Published:2022-02-10
Supported by:
Key Research and Development Program of Science and Technology Project of Hebei Province(19210310D);Natural Science Foundation of Hebei Province(F2018201096);Graduate Innovation Foundation of Hebei University(hbu2019ss077)

摘要/Abstract

摘要：

随着数据的海量型增长，如何存储并利用数据成为目前学术研究和工业应用等方面的热门问题。样例选择是解决此类问题的方法之一，它在原始数据中依据既定规则选出代表性的样例，从而有效地降低后续工作的难度。基于此，提出一种基于哈希学习的投票样例选择算法。首先通过主成分分析（PCA）方法将高维数据映射到低维空间；然后利用k-means算法结合矢量量化方法进行迭代运算，并将数据用聚类中心的哈希码表示；接着将分类后的数据按比例进行随机选择，在多次独立运行算法后投票选择出最终的样例。与压缩近邻（CNN）算法和大数据线性复杂度样例选择算法LSH-IS-F相比，所提算法在压缩比方面平均提升了19%。所提算法思想简单容易实现，能够通过调节参数自主控制压缩比。在7个数据集上的实验结果显示所提算法在测试精度相似的情况下在压缩比和运行时间方面较随机哈希有较大优势。

关键词: 样例选择, 哈希学习, 海明距离, 矢量量化, 投票方法

Abstract:

With the massive growth of data， how to store and use data has become a hot issue in academic research and industrial applications. As one of the methods to solve these problems， instance selection effectively reduces the difficulty of follow-up work by selecting representative instances from original data according to the established rules. Therefore， a voting instance selection algorithm based on learning to hash was proposed. Firstly， the Principal Component Analysis （PCA） method was used to map high-dimensional data to low-dimensional space. Secondly， the k-means algorithm was used to perform iterative operations by combining with the vector quantization method， and the hash codes of the cluster center were used to represent the data. After that， the classified data were randomly selected according to the proportion， and the final instances were selected by voting after several times independent running of the algorithm. Compared with the Compressed Nearest Neighbor （CNN） algorithm and the instance selection algorithm of linear complexity for big data named LSH-IS-F （Instance Selection algorithm by Hashing with two passes）， the proposed algorithm has the compression ratio improved by an average of 19%. The idea of the proposed algorithm is simple and easy to implement， and the algorithm can control the compression ratio automatically by adjusting the parameters. Experimental results on 7 datasets show that the proposed algorithm has a great advantage compared to random hashing in terms of compression ratio and running time with similar test accuracy.

Key words: instance selection, learning to hash, Hamming distance, vector quantization, voting method

中图分类号:

TP181

黄雅婕, 翟俊海, 周翔, 李艳. 基于哈希学习的投票样例选择算法[J]. 计算机应用, 2022, 42(2): 389-394.

Yajie HUANG, Junhai ZHAI, Xiang ZHOU, Yan LI. Voting instance selection algorithm based on learning to hash[J]. Journal of Computer Applications, 2022, 42(2): 389-394.

图/表 3

表1 三个人工数据集相应的概率分布

Tab. 1 Corresponding probability distribution of three synthetic datasets

p (x | w 1) ∼ N 1.0 1.0, 0.6 - 0.2 - 0.2 0.6 p (x | w 2) ∼ N 2.5 2.5, 0.2 - 0.1 - 0.1 0.2

Gauss2

$p (x | w 1) ∼ N 00, 1001 p (x | w 2) ∼ N 11, 1001$

$p (x | w 3) ∼ 12 N 0.5 0.5, 1001 + 12 N - 0.5 0.5, 1001$

Gauss3

$p (x | w 1) ∼ N 000, 100010001 p (x | w 2) ∼ N 010, 101022125$

$p (x | w 3) ∼ N - 1 01, 200060001 p (x | w 4) ∼ N 0 0.5 1, 200010003$

表1 三个人工数据集相应的概率分布

Tab. 1 Corresponding probability distribution of three synthetic datasets

人工数据集

类别数

概率分布

Gauss1

p (x | w 1) ∼ N 1.0 1.0, 0.6 - 0.2 - 0.2 0.6 p (x | w 2) ∼ N 2.5 2.5, 0.2 - 0.1 - 0.1 0.2

Gauss2

$p (x | w 1) ∼ N 00, 1001 p (x | w 2) ∼ N 11, 1001$

$p (x | w 3) ∼ 12 N 0.5 0.5, 1001 + 12 N - 0.5 0.5, 1001$

Gauss3

$p (x | w 1) ∼ N 000, 100010001 p (x | w 2) ∼ N 010, 101022125$

$p (x | w 3) ∼ N - 1 01, 200060001 p (x | w 4) ∼ N 0 0.5 1, 200010003$

表2 实验所用7个数据集的基本信息

Tab. 2 Basic information of 7 datasets used in experiments

数据集	样例数	属性数	类别数
Gauss1	1 000 000	2	2
Gauss2	1 200 000	2	3
Gauss3	1 000 000	3	4
Shuttle	58 000	9	7
Poker	1 000 000	10	10
Covtype	581 012	54	7
Skin	245 057	3	2

表3 三个算法在7个数据集上的测试精度、压缩比和运行时间比较

Tab. 3 Comparison of test accuracy， compression ratio and running time of 3 algorithms on 7 datasets

数据集	算法	测试精度	压缩比	运行时间/s
Gauss1	LSH-IS-F	0.979	9.336	66.054 0
	CNN	0.978	7.169	352.465 0
	LH-VIS	0.982	9.163	50.822 0
Gauss2	LSH-IS-F	0.411	7.565	72.546 0
	CNN	0.418	6.966	142.632 0
	LH-VIS	0.435	6.433	60.273 0
Gauss3	LSH-IS-F	0.498	5.528	136.026 0
	CNN	0.414	7.049	121.667 0
	LH-VIS	0.513	4.046	61.782 0
Shuttle	LSH-IS-F	0.987	1.534	213.096 0
	CNN	0.998	3.348	290.342 0
	LH-VIS	0.988	1.162	106.793 0
Poker	LSH-IS-F	0.853	8.648	421.108 0
	CNN	0.902	6.333	897.323 5
	LH-VIS	0.903	7.954	268.186 0
Covtype	LSH-IS-F	0.920	2.233	63.785 0
	CNN	0.924	3.073	300.956 0
	LH-VIS	0.931	1.337	50.483 0
Skin	LSH-IS-F	0.962	3.883	49.887 0
	CNN	0.960	3.346	40.357 0
	LH-VIS	0.950	2.936	40.474 0

参考文献 20

1	ASLANI M， SEIPEL S. A fast instance selection method for support vector machines in building extraction［J］. Applied Soft Computing， 2020， 97（Pt B）： No.106716. 10.1016/j.asoc.2020.106716
2	MALHAT M， MENSHAWY M EL， MOUSA H， et al. A new approach for instance selection： algorithms， evaluation， and comparisons［J］. Expert Systems with Applications， 2020， 149： No.113297. 10.1016/j.eswa.2020.113297
3	ZHU Z H， WANG Z， LI D D， et al. NearCount： selecting critical instances based on the cited counts of nearest neighbors［J］. Knowledge-Based Systems， 2020， 190： No.105196. 10.1016/j.knosys.2019.105196
4	KIM D， KANG S， CHO S. Expected margin-based pattern selection for support vector machines［J］. Expert Systems with Applications， 2019， 139： No.112865. 10.1016/j.eswa.2019.112865
5	RICO-JUAN J R， VALERO-MAS J J， CALVO-ZARAGOZA J. Extensions to rank-based prototype selection in k-nearest neighbour classification［J］. Applied Soft Computing， 2019， 85： No.105803. 10.1016/j.asoc.2019.105803
6	DE HARO-GARCÍA A， CERRUELA-GARCÍA G， GARCÍA-PEDRAJAS N. Instance selection based on boosting for instance-based learners［J］. Pattern Recognition， 2019， 96： No.106959. 10.1016/j.patcog.2019.07.004
7	HAR-PELED S， INDYK P， MOTWANI R， et al. Approximate nearest neighbor： towards removing the curse of dimensionality［J］. Theory of Computing， 2012， 8： 321-350. 10.4086/toc.2012.v008a014
8	CHARIKAR M S. Similarity estimation techniques from rounding algorithms ［C］// Proceedings of the 34th Annual ACM Symposium on Theory of Computing. New York： ACM， 2002： 380-388. 10.1145/509907.509965
9	MANKU G S， JAIN A， SARMA A DAS. Detecting near-duplicates for web crawling ［C］// Proceedings of the 16th International Conference on World Wide Web. New York： ACM， 2007： 141-150. 10.1145/1242572.1242592
10	DATAR M， IMMORLICA N， INDYK P， et al. Locality-sensitive hashing scheme based on p-stable distributions ［C］// Proceedings of the 20th ACM Annual Symposium on Computational Geometry. New York： ACM， 2004， 20： 253-262. 10.1145/997817.997857
11	DURMAZ O， BILGE H Ş. Fast image similarity search by distributed locality sensitive hashing［J］. Pattern Recognition Letters， 2019， 128： 361-369. 10.1016/j.patrec.2019.09.025
12	LI Y Q， XIAO R L， WEI X， et al. GLDH： toward more efficient global low-density locality-sensitive hashing for high dimensions［J］. Information Sciences， 2020， 533： 43-59. 10.1016/j.ins.2020.04.046
13	GONG Y C， LAZEBNIK S， GORDO A， et al. Iterative quantization： a procrustean approach to learning binary codes for large-scale image retrieval［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013， 35（12）： 2916-2929. 10.1109/tpami.2012.193
14	DENG C， DENG H R， LIU X L， et al. Adaptive multi-bit quantization for hashing［J］. Neurocomputing， 2015， 151（Pt 1）： 319-326. 10.1016/j.neucom.2014.09.033
15	HE K M， WEN F， SUN J. K-means hashing： an affinity-preserving quantization method for learning binary compact codes ［C］// Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2013： 2938-2945. 10.1109/cvpr.2013.378
16	沈琳，林劼，江育娥.深度学习哈希综述［J］.小型微型计算机系统， 2020， 41（10）： 2082-2091. 10.3969/j.issn.1000-1220.2020.10.011
	SHEN L， LIN J， JIANG Y E. Survey of deep learning hashing［J］. Journal of Chinese Computer Systems， 2020， 41（10）： 2082-2091. 10.3969/j.issn.1000-1220.2020.10.011
17	朱茂然，朱艳鹏，高松，等.基于深度哈希的相似图片推荐系统：以Airbnb为例［J］.管理科学， 2020， 33（5）： 17-28. 10.3969/j.issn.1672-0334.2020.05.002
	ZHU M R， ZHU Y P， GAO S， et al. Similar picture recommendation system based on deep hashing： evidence from the Airbnb platform［J］. Journal of Management Science， 2020， 33（5）： 17-28. 10.3969/j.issn.1672-0334.2020.05.002
18	林计文，刘华文，郑忠龙.面向图像检索的深度汉明嵌入哈希［J］.模式识别与人工智能， 2020， 33（6）： 542-550. 10.16451/j.cnki.issn1003-6059.202006007
	LIN J W， LIU H W， ZHENG Z L. Deep hamming embedding based hashing for image retrieval［J］. Pattern Recognition and Artificial Intelligence， 2020， 33（6）： 542-550. 10.16451/j.cnki.issn1003-6059.202006007
19	HART P E. The condensed nearest neighbor rule （Corresp.）［J］. IEEE Transactions on Information Theory， 1968， 14（5）： 515-516. 10.1109/tit.1968.1054155
20	ARNAIZ-GONZÁLEZ Á， DÍEZ-PASTOR J F， RODRÍGUEZ J J， et al. Instance selection of linear complexity for big data［J］. Knowledge-Based Systems， 2016， 107： 83-95. 10.1016/j.knosys.2016.05.056

[1]	李明威, 蒋庆远, 解银朋, 何金栋, 吴丹. 基于哈希学习的异常SQL检测[J]. 计算机应用, 2021, 41(1): 121-126.
[2]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[3]	翟俊海, 张素芳, 王聪, 沈矗, 刘晓萌. 基于MapReduce的大数据主动学习[J]. 计算机应用, 2018, 38(10): 2759-2763.
[4]	陈景波. 多标号学习矢量量化的食用油掺伪检测[J]. 计算机应用, 2013, 33(11): 3141-3143.
[5]	曾仕伦徐家品. 利用多级线性预测改善有限反馈系统反馈量[J]. 计算机应用, 2013, 33(11): 3042-3044.
[6]	郭艳菊陈雷陈国鹰. 基于改进人工蜂群的图像矢量量化码书设计算法[J]. 计算机应用, 2013, 33(09): 2573-2576.
[7]	程传鹏杨要科. 自动文摘中的冗余句消除方法[J]. 计算机应用, 2011, 31(12): 3275-3277.
[8]	范成礼雷英杰张戈. 改进的直觉模糊粗糙集相似性度量方法[J]. 计算机应用, 2011, 31(05): 1344-1347.
[9]	韩笑蕾赵晓群方腾龙贾晓光. 线谱频率及差分线谱频率参数相关性分析[J]. 计算机应用, 2011, 31(02): 548-552.
[10]	刘欣耿烨李智杰. 有序抖动半调图像压缩算法[J]. 计算机应用, 2011, 31(01): 154-155.
[11]	管军斌熊卫华潘海鹏. 一种新颖的基于哈德码变换的码字搜索算法[J]. 计算机应用, 2009, 29(1): 89-91,9.
[12]	亢明汪成亮陈娟娟. 基于动态阈值失量量化的说话人识别[J]. 计算机应用, 2009, 29(1): 146-148.
[13]	许允喜俞一彪. 说话人识别中采用混合免疫算法的VQ码本设计[J]. 计算机应用, 2008, 28(2): 339-341,.
[14]	袁和金张艳宁周涛. 基于矢量量化和深度优先搜索的轨迹分布模式学习算法[J]. 计算机应用, 2007, 27(5): 1126-1128.
[15]	晋良念欧阳缮李民政. 基于模糊核LVQ的Sammon非线性映射算法[J]. 计算机应用, 2007, 27(3): 553-555.

基于哈希学习的投票样例选择算法

Voting instance selection algorithm based on learning to hash

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献 20

相关文章 15

编辑推荐

Metrics