计算机应用

• 数据挖掘 • 上一篇    下一篇

基于遗传算法和自组织特征映射网络的文本聚类方法

覃晓   

  1. 广西师范学院数学与计算机科学系
  • 收稿日期:2007-09-27 修回日期:2007-12-03 出版日期:2008-03-01 发布日期:2008-03-01
  • 通讯作者: 覃晓

Text clustering method based on genetic algorithm and SOM network

Xiao QIN   

  • Received:2007-09-27 Revised:2007-12-03 Online:2008-03-01 Published:2008-03-01
  • Contact: Xiao QIN

摘要: 自组织映射(SOM)算法作为一种聚类和高维可视化的无监督学习算法,为进行中文Web文档聚类提供了有力的手段。但是SOM算法天然存在着对网络初始权值敏感的缺陷,从而影响聚类质量。为此,引进遗传算法对SOM网络加以优化。提出了以遗传算法优化SOM网络的文本聚类算法(GSTCA);进行了对比实验,实验表明,改进后的算法GSTCA比SOM算法在Web中文文档聚类中具有更高的准确率,其F-measure值平均提高了14%,同时,实验还表明,GSTCA算法对网络初始权值是不敏感的,从而提高了算法的稳定性。

关键词: 自组织特征映射, 遗传算法, 文本聚类

Abstract: As a cluster of high-dimensional visualization and unsupervised learning algorithm, Self-Organizing Map (SOM) provided a favorable means for Chinese Web clustering. However, the SOM algorithm has a natural flaw of being sensitive to the network initial power value so as that the accuracy of the cluster made by the SOM has been influenced. To solve the problem, this paper applied genetic algorithm to optimize SOM. This paper made the following contributions. 1) Propose a text clustering method based on GA-SOM-based Text Clustering Algorithms (GSTCA); 2) Make comparison experiment. The result of the experiment shows that the GSTCA has higher accuracy rate than SOM algorithm in the Web Chinese Document Clustering, and the average value of F measure is improved by 14.5% than traditional method. The experiments also show that GSTCA is not sensitive to initial weights of the network, thereby enhancing the stability of the algorithm.

Key words: Self-Organizing Map (SOM), Genetic Algorithm (GA), text cluster