Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (3): 733-737.DOI: 10.11772/j.issn.1001-9081.2020060851

Special Issue: 数据科学与技术

• Data science and technology • Previous Articles     Next Articles

Method of dynamically constructing spatial topic R-tree based on k-means++

ZOU Zhiwen, QIN Cheng   

  1. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang Jiangsu 212013, China
  • Received:2020-06-19 Revised:2020-10-21 Online:2021-03-10 Published:2021-01-15
  • Supported by:
    This work is partially supported by the Zhenjiang Key Research and Development Program (Industrial Prospect and Common Key Technology) (GY2017025).


邹志文, 秦程   

  1. 江苏大学 计算机科学与通信工程学院, 江苏 镇江 212013
  • 通讯作者: 秦程
  • 作者简介:邹志文(1968-),男,江西宜黄人,副教授,硕士,主要研究方向:数据库、分布式系统;秦程(1994-),男,安徽滁州人,硕士研究生,主要研究方向:空间数据。
  • 基金资助:

Abstract: The existing R-tree spatial clustering technology usually randomly designates or calculates the Euclidean distance between spatial data to select the cluster centers, without considering the topic relevance between spatial data, so that the clustering result is influenced by the initial value of k, and the association between spatial data is only based on geographic location. Aiming at this situation, a method of dynamically constructing spatial Topic R-tree (TR-tree) based on k-means++ was proposed. Firstly, in the traditional k-means++ algorithm, k clusters were dynamically determined by the clustering measure function, and Latent Dirichlet Allocation (LDA) model was introduced into the clustering measure function to calculate the topic probability of each spatial data text, as a result, the topic relevance between spatial data was strengthened. Secondly, the cluster center with the highest probability was selected through the topic probabilities. Finally, the TR-tree was constructed, and the spatial data were dynamically allocated during the construction. Experimental results show that with a slight increase of the R-tree construction time, this method has the indexing efficiency and correlation between nodes significantly improved compared to the algorithm of constructing R-tree based only on geographic location clustering.

Key words: R-tree, k-means++, clustering, search efficiency, Latent Dirichlet Allocation (LDA) model

摘要: 现有的R-树空间聚类技术在通常通过随机指定或者计算空间数据间的欧氏距离来选取聚类中心,而未考虑空间数据间的主题相关度。这些导致聚类结果受初始k值影响,空间数据间的关联仅仅是基于地理位置的。针对此种情况,提出了一种基于k-means++的动态构建空间主题R树(TR-tree)方法。首先,在传统的k-means++算法上,通过聚类测度函数动态地确定k个聚类簇,并在聚类测度函数中引入潜在狄利克雷分布(LDA)模型来计算每个空间数据文本的主题概率,从而加强空间数据间的主题关联度;其次,通过主题概率选取概率最大的聚类中心;最后,构建TR-tree,并且在构建时动态分配空间数据。实验结果表明:虽然构建R-树的时间略有增加,但该方法在索引效率及节点间关联度上较仅仅基于地理位置聚类构建R-树的算法有明显提升。

关键词: R-树, k-means++, 聚类, 索引效率, 潜在狄利克雷分布模型

CLC Number: