基于KL散度和近邻点间距离的球面嵌入算法

doi:10.11772/j.issn.1001-9081.2017.03.680

计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 680-683.DOI: 10.11772/j.issn.1001-9081.2017.03.680

• 第四届大数据学术会议(CCF BIGDATA2016) • 上一篇下一篇

基于KL散度和近邻点间距离的球面嵌入算法

张变兰, 路永钢, 张海涛

兰州大学信息科学与工程学院, 兰州 730000

收稿日期:2016-09-19 修回日期:2016-11-11 出版日期:2017-03-10 发布日期:2017-03-22
通讯作者: 路永钢
作者简介:张变兰(1991-),女,山西吕梁人,硕士研究生,主要研究方向:模式识别;路永钢(1974-),男,甘肃陇南人,教授,博士,CCF会员,主要研究方向:模式识别、人工智能、生物信息;张海涛(1986-),男,甘肃兰州人,博士,主要研究方向:模式识别、软件工程。
基金资助:
国家自然科学基金面上项目（61272213）；中央高校基本科研业务费专项资金资助项目（lzujbky-2016-k07，lzujbky-2016-142）。

Spherical embedding algorithm based on Kullback-Leibler divergence and distances between nearest neighbor points

ZHANG Bianlan, LU Yonggang, ZHANG Haitao

School of Information Science and Engineering, Lanzhou University, Lanzhou Gansu 730000, China

Received:2016-09-19 Revised:2016-11-11 Online:2017-03-10 Published:2017-03-22
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61272213), the Fundamental Research Funds for the Central Universities (lzujbky-2016-k07, lzujbky-2016-142).

摘要/Abstract

摘要： 针对现有球面嵌入算法在非近邻点间的距离度量不准确或缺失的情况下，不能有效地进行低维嵌入的问题，提出了一种新的球面嵌入算法，它能够只利用近邻点间的距离，将任何尺度的高维数据嵌入到单位球面上，同时求出适合原始数据分布的球面半径。该算法从一个随机产生的球面分布开始，利用KL散度衡量每对近邻点间的归一化距离在原始空间和球面空间中的差异，并基于此差异构建出目标函数，然后再用带有动量的随机梯度下降法，不断优化球面上点的分布，直到结果稳定。为了测试算法，模拟产生了两类球面分布数据：分别是球面均匀分布和球面正态分布的数据。实验结果表明，对于球面均匀分布的数据，即使在近邻点个数很少的情况下，仍然能够将数据准确地嵌入球面空间，嵌入后的数据分布与原始数据分布的均方根误差（RMSE）低于0.00001，且球面半径的估算误差低于0.000001；而对于球面正态分布的数据，在近邻点个数较多的情况下，该算法也可以将数据较准确地嵌入球面空间。因此，在非近邻点间距离缺失的情况下，所提方法仍然可以较准确地对数据进行低维嵌入，这非常有利于数据的可视化研究。

关键词: 球面嵌入, KL散度, 随机梯度下降法, 最近邻

Abstract: Aiming at the problem that the existing spherical embedding algorithm cannot effectively embed the data into the low-dimensional space in the case that the distances between points far apart are inaccurate or absent, a new spherical embedding method was proposed, which can take the distances between the nearest neighbor points as input, and embeds high dimensional data of any scale onto the unit sphere, and then estimates the radius of the sphere which fit the distribution of the original data. Starting from a randomly generated spherical distribution, the Kullback-Leibler (KL) divergence was used to measure the difference of the normalized distance between each pair of neighboring points in the original space and the spherical space. Based on the difference, the objective function was constructed. Then, the stochastic gradient descent method with momentum was used to optimize the distribution of the points on the sphere until the result is stable. To test the algorithm, two types of spherical distribution data sets were simulated: which are spherical uniform distribution and Kent distribution on the unit sphere. The experimental results show that, for the uniformly distributed data, the data can be accurately embedded in the spherical space even if the number of neighbors is very small, the Root Mean Square Error (RMSE) of the embedded data distribution and the original data distribution is less than 0.00001, and the spherical radius of the estimated error is less than 0.000001; for spherical normal distribution data, the data can be embedded into the spherical space accurately when the number of neighbors is large. Therefore, in the case that the distance between points far apart are absent, the proposed method can still be quite accurate for low-dimensional data embedding, which is very helpful for the visualization of data.

Key words: spherical embedding, Kullback-Leibler (KL) divergence, stochastic gradient descent method, nearest neighbor

中图分类号:

TP181

张变兰, 路永钢, 张海涛. 基于KL散度和近邻点间距离的球面嵌入算法[J]. 计算机应用, 2017, 37(3): 680-683.

ZHANG Bianlan, LU Yonggang, ZHANG Haitao. Spherical embedding algorithm based on Kullback-Leibler divergence and distances between nearest neighbor points[J]. Journal of Computer Applications, 2017, 37(3): 680-683.

参考文献

[1] 田守财,孙喜利,路永钢.基于最近邻的随机非线性降维[J].计算机应用,2016,36(2):377-381.(TIAN S C, SUN X L, LU Y G. Stochastic nonlinear dimensionality reduction based on nearest neighbors[J]. Journal of Computer Applications, 2016, 36(2):377-381.)
[2] 郝晓军,闫京海,樊友谊.大数据分析过程中的降维方法[J].航天电子对抗,2014(4):58-60.(HAO X J, YAN J H, FAN Y Y. Dimensionality reduction of large volumes of data analysis[J]. Aerospace Electronic Warfare, 2014(4):58-60).
[3] COX M A A, COX T F. Multidimensional scaling[J]. Econometric Institute Research Papers, 2014, 46(2):1050-1057.
[4] WEINBERGER K Q, SAUL L K. Unsupervised learning of image manifolds by semidefinite programming[C]//Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2004:988-995.
[5] TENENBAUM J B, DE SILVA V, LANGFORD J C. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5500):2319-2323.
[6] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(11):2579-2605.
[7] VAN DER MAATEN L J P, POSTMA E O, VAN DEN HERIK H J. Dimensionality reduction:a comparative review[EB/OL].[2016-03-08]. https://static.aminer.org/pdf/PDF/000/272/419/comparative_investigation_on_dimension_reduction_and_regression_in_three_layer.pdf.
[8] WILSON R C, HANCOCK E R, PEKALSKA E, et al. Spherical and hyperbolic embeddings of data[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11):2255-2269.
[9] WILSON R C, HANCOCK E R. Spherical embedding and classification[C]//Proceedings of the 2010 Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition. Berlin:Springer, 2010:589-599.
[10] ELAD A, KELLER Y, KIMMEL R. Texture mapping via spherical multi-dimensional scaling[C]//Scale Space and PDE Methods in Computer Vision, LNCS 3459. Berlin:Springer, 2005:443-455.
[11] COX M A A, COX T F. Multidimensional scaling on the sphere[M]//EDWARDS D, RAUN N E. Compstat. Berlin:Springer, 1988:323-328.
[12] ROWEIS S T, SAUL L K. Nonlinear dimensionality reduction by locally linear embedding[J]. Science, 2000, 290(5500):2323-2326.
[13] KULLBACK S, LEIBLER R A. On information and sufficiency[J]. Annals of Mathematical Statistics, 1951, 22(1):79-86.
[14] KULLBACK S. Information Theory and Statistics[M]. Hoboken, NJ:John Wiley and Sons, 1959.
[15] SUTSKEVER I. Training recurrent neural networks[EB/OL].[2016-02-09]. http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf.
[16] SUTSKEVER I, MARTENS J, DAHL G, et al. On the importance of initialization and momentum in deep learning[EB/OL].[2016-02-09]. http://www.cs.toronto.edu/~hinton/absps/momentum.pdf.
[17] KENT J T. The Fisher-Bingham distribution on the sphere[J]. Journal of the Royal Statistical Society, 1982, 44(1):71-80.

基于KL散度和近邻点间距离的球面嵌入算法

Spherical embedding algorithm based on Kullback-Leibler divergence and distances between nearest neighbor points

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	郭一村, 陈华辉. 在线哈希算法研究综述[J]. 计算机应用, 2021, 41(4): 1106-1112.
[2]	彭莉, 张海清, 李代伟, 唐聃, 于曦, 何磊. 基于粗糙集理论的不完备数据分析方法的混合信息系统填补算法[J]. 计算机应用, 2021, 41(3): 677-685.
[3]	曹阳, 闫秋艳, 吴鑫. 不平衡时间序列集成分类算法[J]. 计算机应用, 2021, 41(3): 651-656.
[4]	李明威, 蒋庆远, 解银朋, 何金栋, 吴丹. 基于哈希学习的异常SQL检测[J]. 计算机应用, 2021, 41(1): 121-126.
[5]	李博, 张晓, 颜靖艺, 李可威, 李恒, 凌玉龙, 张勇. 基于值差度量和聚类优化的K最近邻算法在银行客户行为预测中的应用[J]. 计算机应用, 2019, 39(9): 2784-2788.
[6]	马友忠, 张智辉, 林春杰. 大数据相似性连接查询技术研究进展[J]. 计算机应用, 2018, 38(4): 978-986.
[7]	黄宇扬, 董明刚, 敬超. 面向K最近邻分类的遗传实例选择算法[J]. 计算机应用, 2018, 38(11): 3112-3118.
[8]	吕佳, 黎隽男. 结合半监督聚类和数据剪辑的自训练方法[J]. 计算机应用, 2018, 38(1): 110-115.
[9]	李新春, 侯跃. 基于改进AP选择和K最近邻法算法的室内定位技术[J]. 计算机应用, 2017, 37(11): 3276-3280.
[10]	田守财, 孙喜利, 路永钢. 基于最近邻的随机非线性降维[J]. 计算机应用, 2016, 36(2): 377-381.
[11]	于苹苹, 倪建成, 姚彬修, 李淋淋, 曹博. 基于Spark框架的高效KNN中文文本分类算法[J]. 计算机应用, 2016, 36(12): 3292-3297.
[12]	赵京东, 杨凤华. 激光散乱点云K最近邻搜索算法[J]. 计算机应用, 2016, 36(10): 2863-2869.
[13]	王佩瑶, 曹江涛, 姬晓飞. 基于改进时空兴趣点特征的双人交互行为识别[J]. 计算机应用, 2016, 36(10): 2875-2879.
[14]	徐文轩, 张莉. 基于单核苷酸统计和支持向量机集成的人类基因启动子识别[J]. 计算机应用, 2015, 35(10): 2808-2812.
[15]	陈志, 李天瑞, 李明, 杨燕. 基于计算统一设备架构的高铁故障诊断方法[J]. 计算机应用, 2015, 35(10): 2819-2823.