计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 680-683.DOI: 10.11772/j.issn.1001-9081.2017.03.680

• 第四届大数据学术会议(CCF BIGDATA2016) • 上一篇    下一篇

基于KL散度和近邻点间距离的球面嵌入算法

张变兰, 路永钢, 张海涛   

  1. 兰州大学 信息科学与工程学院, 兰州 730000
  • 收稿日期:2016-09-19 修回日期:2016-11-11 出版日期:2017-03-10 发布日期:2017-03-22
  • 通讯作者: 路永钢
  • 作者简介:张变兰(1991-),女,山西吕梁人,硕士研究生,主要研究方向:模式识别;路永钢(1974-),男,甘肃陇南人,教授,博士,CCF会员,主要研究方向:模式识别、人工智能、生物信息;张海涛(1986-),男,甘肃兰州人,博士,主要研究方向:模式识别、软件工程。
  • 基金资助:
    国家自然科学基金面上项目(61272213);中央高校基本科研业务费专项资金资助项目(lzujbky-2016-k07,lzujbky-2016-142)。

Spherical embedding algorithm based on Kullback-Leibler divergence and distances between nearest neighbor points

ZHANG Bianlan, LU Yonggang, ZHANG Haitao   

  1. School of Information Science and Engineering, Lanzhou University, Lanzhou Gansu 730000, China
  • Received:2016-09-19 Revised:2016-11-11 Online:2017-03-10 Published:2017-03-22
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61272213), the Fundamental Research Funds for the Central Universities (lzujbky-2016-k07, lzujbky-2016-142).

摘要: 针对现有球面嵌入算法在非近邻点间的距离度量不准确或缺失的情况下,不能有效地进行低维嵌入的问题,提出了一种新的球面嵌入算法,它能够只利用近邻点间的距离,将任何尺度的高维数据嵌入到单位球面上,同时求出适合原始数据分布的球面半径。该算法从一个随机产生的球面分布开始,利用KL散度衡量每对近邻点间的归一化距离在原始空间和球面空间中的差异,并基于此差异构建出目标函数,然后再用带有动量的随机梯度下降法,不断优化球面上点的分布,直到结果稳定。为了测试算法,模拟产生了两类球面分布数据:分别是球面均匀分布和球面正态分布的数据。实验结果表明,对于球面均匀分布的数据,即使在近邻点个数很少的情况下,仍然能够将数据准确地嵌入球面空间,嵌入后的数据分布与原始数据分布的均方根误差(RMSE)低于0.00001,且球面半径的估算误差低于0.000001;而对于球面正态分布的数据,在近邻点个数较多的情况下,该算法也可以将数据较准确地嵌入球面空间。因此,在非近邻点间距离缺失的情况下,所提方法仍然可以较准确地对数据进行低维嵌入,这非常有利于数据的可视化研究。

关键词: 球面嵌入, KL散度, 随机梯度下降法, 最近邻

Abstract: Aiming at the problem that the existing spherical embedding algorithm cannot effectively embed the data into the low-dimensional space in the case that the distances between points far apart are inaccurate or absent, a new spherical embedding method was proposed, which can take the distances between the nearest neighbor points as input, and embeds high dimensional data of any scale onto the unit sphere, and then estimates the radius of the sphere which fit the distribution of the original data. Starting from a randomly generated spherical distribution, the Kullback-Leibler (KL) divergence was used to measure the difference of the normalized distance between each pair of neighboring points in the original space and the spherical space. Based on the difference, the objective function was constructed. Then, the stochastic gradient descent method with momentum was used to optimize the distribution of the points on the sphere until the result is stable. To test the algorithm, two types of spherical distribution data sets were simulated: which are spherical uniform distribution and Kent distribution on the unit sphere. The experimental results show that, for the uniformly distributed data, the data can be accurately embedded in the spherical space even if the number of neighbors is very small, the Root Mean Square Error (RMSE) of the embedded data distribution and the original data distribution is less than 0.00001, and the spherical radius of the estimated error is less than 0.000001; for spherical normal distribution data, the data can be embedded into the spherical space accurately when the number of neighbors is large. Therefore, in the case that the distance between points far apart are absent, the proposed method can still be quite accurate for low-dimensional data embedding, which is very helpful for the visualization of data.

Key words: spherical embedding, Kullback-Leibler (KL) divergence, stochastic gradient descent method, nearest neighbor

中图分类号: