计算机应用 ›› 2005, Vol. 25 ›› Issue (09): 2037-2040.DOI: 10.3724/SP.J.1087.2005.02037

• 人工智能 • 上一篇    下一篇

一种优化初始中心点的K平均文本聚类算法

赵万磊1,2,王永吉2,张学杰1,李娟2   

  1. 1.云南大学信息学院; 2.中国科学院软件研究所
  • 发布日期:2011-04-11 出版日期:2005-09-01
  • 基金资助:

    国家863计划资助项目(2001AA1131802002AA116080)

Variant of K-means algorithm for document clustering: optimization initial centers

ZHAO Wan-lei1,2,WANG Yong-ji2,ZHANG Xue-jie1,LI Juan2   

  1. 1.Institute of Information,Yunnan University,Kunming 650091,China;2.Institute of Software,Chinese Academy of Sciences,Beijing 100080,China
  • Online:2011-04-11 Published:2005-09-01

摘要: 文本聚类在信息过滤,网页分类中有着很好的应用。但它面临数据量大,特征维度高的难点。由于K平均算法易于实现,对数据依赖度底,在文本聚类中得到应用。然而,传统K平均以及它的变种会产生有较大波动的聚类结果。因此对K平均算法进行了改进,通过优化聚类初始中心的选择,得到一种适合对文本数据聚类分析的改进算法。大量实验显示,该算法可以生成质量较高而且聚类质量波动性较小的结果。

关键词: 优化, 文本聚类, K平均

Abstract: Document clustering had been employed in information filtering,web page classification and so on.K-means is one of the widely used clustering techniques because of its simplicity and high scalability.Owing to its random selection of initial centers,unstable results were often got when using traditional K-means and its variants.Here a technique of optimization initial centers of clustering was proposed.Combined with incremental iteration,it can produce clustering results with high purity,low entropy as well as good stableness.

Key words: optimize, document clustering, K-means

中图分类号: