Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (11): 3130-3134.DOI: 10.11772/j.issn.1001-9081.2015.11.3130

• DPCS 2015 Paper • Previous Articles     Next Articles

Novel text clustering approach based on R-Grams

WANG Xianming1,2, GU Qiong3,4, HU Zhiwen5   

  1. 1. Oujiang College, Wenzhou University, Wenzhou Zhejiang 325035, China;
    2. Network Research Institute of Wenzhou, Wenzhou Zhejiang 325035, China;
    3. School of Mathematics and Computer Science, Hubei University of Arts and Science, Xiangyang Hubei 441053, China;
    4. Institute of Logic and Intelligence, Southwest University, Chongqing 400715, China;
    5. College of New Media, Zhejiang University of Media and Communications, Hangzhou Zhejiang 310018, China
  • Received:2015-06-17 Revised:2015-07-15 Published:2015-11-13

基于R-Grams的文本聚类方法

王贤明1,2, 谷琼3,4, 胡智文5   

  1. 1. 温州大学 瓯江学院, 浙江 温州 325035;
    2. 温州信息化研究中心, 浙江 温州 325035;
    3. 湖北文理学院 数学与计算机科学学院, 湖北 襄阳 441053;
    4. 西南大学 逻辑与智能研究中心, 重庆 400715;
    5. 浙江传媒学院 新媒体学院, 杭州 310018
  • 通讯作者: 谷琼(1973-),女,湖北荆门人,副教授,博士, CCF会员,主要研究方向:Web数据挖掘、网络舆情.
  • 作者简介:王贤明(1979-),男,湖北黄冈人,讲师,硕士,主要研究方向:Web数据挖掘、网络舆情; 胡智文(1975-),男,湖北黄冈人,副教授,博士,主要研究方向:新媒体、网络数据挖掘.
  • 基金资助:
    浙江省自然科学基金资助项目(LY13F010005); 教育部人文社会科学研究项目(15YJAZH015); 湖北省科技支撑计划软科学项目(2015BDH109); 温州市科技计划项目(R20130021).

Abstract: Focusing on the issue that the clustering accuracy rate and recall rate are difficult to balance in traditional text clustering algorithms, a clustering approach based on the R-Grams text similarity computing algorithm was proposed. Firstly, the clustered documents were sorted in descending order; secondly, the symbolic documents were identified and then initial clustering results were achieved by using an R-Grams-based similarity computing algorithm; finally, the final clustering results were completed by combining the initial clustering. The experimental results show that the proposed approach can flexibly regulate the clustering results by adjusting the clustering threshold parameter to satisfy different demands and the optimal parameter is about 15. With the increasing of the clustering threshold, the clustering accuracies increase, and the recalls increase at first, then decrease. In addition, the approach is free from time-consuming processing procedures such as word segmentation and feature extraction and can be easily implemented.

Key words: text, clustering, random, R-Grams

摘要: 针对传统文本聚类中存在着聚类准确率和召回率难以平衡等问题,提出了一种基于R-Grams文本相似度计算方法的文本聚类方法.该方法首先通过将待聚类文档降序排列,其次采用R-Grams文本相似度算法计算文本之间的相似度并根据相似度实现各聚类标志文档的确定并完成初始聚类,最后通过对初始聚类结果进行聚类合并完成最终聚类.实验结果表明:聚类结果可以通过聚类阈值灵活调整以适应不同的需求,最佳聚类阈值为15左右.随着聚类阈值的增大,各聚类准确率增大,召回率呈现先增后降的趋势.此外,该聚类方法避免了大量的分词、特征提取等繁琐处理,实现简单.

关键词: 文本, 聚类, 随机, R-Grams

CLC Number: